Web Crawler
Crawl billions of pages: fetch, parse, dedupe, and feed an index, politely.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Seed → BFS frontier
- Fetch & parse links
- Dedupe URLs/content
- Respect robots.txt & politeness
Non-functional
- Massive throughput
- No traps/loops
- Freshness
Scale
Billions of pages
The approach
A distributed URL frontier (priority + politeness queues), fetcher workers, content dedup (hash/simhash), URL dedup (bloom filter), and a parser that extracts new links back into the frontier.
Key components
Frontier (queues) → fetchers → parser → dedup (bloom) → storage/index
Numbers that matter
- Google's crawler is estimated to index ~5-10 billion URLs in its active frontier at any time, with fresh crawl cycles for high-priority pages as short as hours
- A Bloom filter for 1 billion URLs at 1% false positive rate requires approximately ~1.14 GB of memory (9.6 bits/element) — fits in RAM on a single machine
- robots.txt must be fetched and cached per domain; a crawler hitting 100M distinct domains needs a robots cache of that size — evict by LRU with a 24-hour TTL to respect policy changes
- DNS resolution is a hidden bottleneck: resolving a new domain takes ~20-200ms; a dedicated DNS caching layer (with 1-hour TTL) can handle ~100K lookups/second to prevent this from gating fetch throughput
Senior deep-dive
The URL frontier is the core engineering problem — not fetching; fetching is just HTTP. Managing billions of URLs with priority, politeness constraints, and deduplication without a single bottleneck is the hard part.
Politeness is not optional: a crawler that ignores robots.txt and hammers a single domain becomes a DDoS attack; per-domain fetch queues with configurable crawl delay (default 1-10 seconds between hits to the same host) are mandatory.
Deduplication has two layers: URL dedup (have I seen this URL before?) via Bloom filter for speed, and content dedup (have I seen this content before?) via SimHash for near-duplicate detection — both are necessary because different URLs often serve identical content.
URL frontier: priority + politeness, together
The frontier must balance two competing concerns: priority (crawl high-value pages first — high PageRank, fresh content) and politeness (don't hit the same host more than once per crawl-delay interval). The standard architecture: a priority queue of per-domain queues. The scheduler picks the highest-priority domain whose next-crawl-time has passed, pops a URL from its queue, and dispatches it. This is not a simple global priority queue — you need the domain-level routing to enforce politeness, which is why Mercator and Nutch both use this two-level queue design.
URL dedup: Bloom filter is not enough alone
A Bloom filter gives O(1) approximate membership test — 'have I seen this URL before?' — with no false negatives and a tunable false positive rate. False positives mean occasionally skipping a URL you haven't seen; that's acceptable (you'll likely encounter it via another path). The trap: URL canonicalization must happen before Bloom filter insertion — `http://example.com/page` and `https://example.com/page?` and `HTTP://EXAMPLE.COM/page` must map to the same canonical form, or your dedup misses obvious duplicates. This normalization step (lowercase, sort params, strip fragments, resolve redirects) is where most crawler bugs live.
Content dedup: SimHash for near-duplicates
Many pages serve nearly identical content — boilerplate, scrapers, mirrors, pagination variants. Exact hash (MD5/SHA) deduplicates identical content; SimHash (Charikar, 2002) deduplicates near-identical content by producing a 64-bit fingerprint where similar documents have fingerprints with few bit differences (Hamming distance ≤ 3). Google uses SimHash to deduplicate crawled pages at scale. The lookup table for SimHash is the challenge: finding all fingerprints within Hamming distance 3 of a new fingerprint in a billion-entry table requires bit-permutation indexing (split 64 bits into k tables, each sorted by a permuted prefix).
Distributed fetcher fleet: the worker architecture
Fetchers are stateless workers that: resolve DNS, establish TCP/TLS, send HTTP GET, follow redirects (max N hops), respect timeout budgets, and return (URL, content, headers, status). Each fetcher maintains a persistent connection pool per host to amortize TCP handshake cost across multiple fetches from the same domain. The fetcher fleet should be geographically distributed — crawling from a single location misses geo-targeted content and creates hot-spots on paths between the crawler datacenter and popular server locations.
robots.txt and ethical crawling
Every domain's `robots.txt` must be fetched before crawling any URL from that domain. The file specifies per-user-agent crawl permissions and crawl-delay directives. Crawl-delay is often 1-10 seconds for common user agents. At 1 second per page, a single-domain crawl of a million-page site takes 11 days — per-domain parallelism is limited by politeness, not compute. Cache `robots.txt` aggressively (24h TTL), handle missing files as 'allow all', and never retry a 404 on `robots.txt` more than once per TTL cycle.
What breaks at scale
Spider traps are infinite URL spaces generated dynamically (calendars with infinite 'next day' links, session-parameterized URLs, infinite pagination) — your crawler will loop forever without a max-URL-depth limit and URL canonicalization that strips session tokens. DNS amplification: a naive crawler resolves the same hostname for every URL in its queue; at billions of URLs, a small number of fetchers can generate millions of DNS queries per second — mitigate with a dedicated DNS cache cluster and aggressive TTL-based caching. Finally, content pipeline backpressure: if the parser/indexer is slower than the fetcher, the fetched-but-unprocessed queue grows unboundedly — use backpressure signaling from the parser to the frontier scheduler to throttle fetch rate.
In production
Google's original architecture (from the 1998 paper) had a single crawler thread; modern Googlebot is a massively distributed system across many datacenters. Common Crawl (the public web corpus) runs open-source crawlers (Apache Nutch, Heritrix) and releases petabyte-scale snapshots monthly. Cloudflare and Akamai run crawlers to pre-populate their CDNs. The real engineering challenge is keeping the frontier fresh at internet scale: a page that changes every hour must be re-crawled hourly; one that never changes can be crawled monthly. Adaptive crawl scheduling (predict re-crawl interval from historical change frequency) is what separates search-engine crawlers from simple scrapers.
Common mistakes
- Single global queue (no per-domain politeness)
- Exact-match only dedup (misses near-dupes)
- Crawler traps / infinite calendars