Search Engine (Elasticsearch)
Full-text search over huge document sets with relevance ranking.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Index documents
- Full-text + filters
- Relevance ranking
- Aggregations
Non-functional
- Fast search
- Near-realtime indexing
- Scalable
Scale
Billions of docs
The approach
Documents tokenized into an inverted index (term→postings); index sharded across nodes with replicas; queries scatter-gather across shards, score by relevance (BM25/TF-IDF), merge top-k; segments + refresh give near-realtime.
Key components
Indexer → sharded inverted index (+replicas) → scatter-gather query → ranker
Numbers that matter
- A single Elasticsearch shard can hold ~50GB of index data comfortably; above that, merge amplification and JVM heap pressure cause latency to degrade non-linearly.
- Elasticsearch bulk indexing throughput peaks at roughly ~10,000–50,000 docs/sec per node depending on doc size and mapping complexity; single-document indexing is ~5–10× slower per doc due to per-request overhead.
- A full-text search across 1 billion documents on a 10-shard cluster can return in <100ms if the index fits in the OS page cache — cold queries (no cache) against SSDs take 500ms–2s.
- Elasticsearch's JVM heap should be set to 31GB (just below the 32GB compressed OOP threshold); above 32GB Java switches to 64-bit pointers and heap efficiency drops ~20–30%.
Senior deep-dive
Scatter-gather over shards is the query model: every search fans out to all primary shards in parallel, each returns its local top-k, and the coordinating node merges — latency is bounded by the slowest shard, so an imbalanced cluster (one hot shard, eleven cold) defeats the whole design.
The inverted index is immutable once written: Lucene segments are never modified in place; updates are a delete marker plus a new document, and merging compacts segments periodically — too many small segments from bulk indexing is the most common performance problem.
Relevance ranking (BM25) is computed per-shard without global IDF, so in a sharded index a rare term that happens to be common in one shard gets artificially low scores — a known accuracy defect that matters for small indices.
Lucene segments: the immutable building block
Each Elasticsearch index is a Lucene index composed of immutable segments — a segment is a complete inverted index over a subset of documents. New documents are buffered in memory (the indexing buffer), then flushed to a new segment on disk. A refresh (default every 1s) makes the segment searchable; a flush fsync's the segment to disk. Segments are merged in the background by tiered merge policy — too many small segments cause every query to open thousands of file handles.
Query execution: the two-phase scatter-gather
Phase 1 (query): the coordinating node broadcasts the search to all shards; each shard scores and returns its local top-k document IDs and scores. Phase 2 (fetch): the coordinator merges all results, takes the global top-k, then fetches the full `_source` for those documents from the shards that hold them. The insight: only the `_source` for the final top-k is transferred, not all documents — but phase 1 still scans all shards, so a 100-shard index has 100× the phase-1 work of a 1-shard index.
Relevance: BM25 and its sharding defect
BM25 scores a document by how rare the query terms are (IDF) and how often they appear in the document (TF). IDF is computed per-shard, not globally — a term that appears in 10% of documents globally but 90% in one shard gets a near-zero IDF score in that shard. This per-shard IDF skew is most pronounced in small, imbalanced shards. The fix is `search_type=dfs_query_then_fetch` which pre-computes global IDF in a scatter phase before scoring — correct but 2× the shard round-trips.
Shard sizing and the golden rule
The Elastic recommendation is 10–50GB per shard with 20 shards per GB of JVM heap as the upper limit. Over-sharding (hundreds of small shards) is a common mistake: each shard requires heap for its segment metadata, and having 1000 shards on a 32GB-heap node exhausts memory before data does. Time-based indices (one index per day) with ILM rollover is the standard pattern: each index is small enough to stay healthy, and old indices are deleted or moved to cold tier automatically.
Ingest pipelines and the mapping contract
Ingest nodes run enrichment pipelines (geoip, user-agent parsing, field renaming) before documents hit primary shards — this offloads transformation from the application. Dynamic templates define how auto-discovered fields are mapped: a catch-all template can mark all string fields `keyword` (no full-text) unless they match specific patterns, preventing the mapping explosion of a schema-less log pipeline. Once a field is mapped, changing its type requires a reindex — there is no ALTER COLUMN in Elasticsearch.
What breaks at scale
GC pauses at >75% JVM heap cause nodes to drop out of the cluster transiently — the master detects the node as failed, initiates shard recovery, which generates additional heap pressure, creating a GC death spiral. Hot shard imbalance: if a timestamp field is used for routing (or the default `_id` hash produces an unlucky distribution), one shard gets all the recent writes while others are idle, capping write throughput to one shard's capacity. Split-brain prevention requires `discovery.zen.minimum_master_nodes = (N/2)+1` — forgetting this on a 2-node cluster causes both nodes to elect themselves master and the cluster diverges silently.
In production
Elasticsearch powers search at GitHub (code search), Uber (log search), and Wikipedia (article search). OpenSearch (AWS fork post-Elastic license change) is functionally identical for most use cases. Solr is the older Lucene-based alternative, now mostly eclipsed. The real challenge at scale is index mapping explosion: Elasticsearch's dynamic mapping auto-creates fields from JSON documents, and a schema-less log stream with variable keys can generate thousands of field mappings, exhausting JVM heap in the mapping memory. The fix is strict mapping (reject unknown fields) combined with a `_source`-only fallback for ad-hoc fields.
Common mistakes
- Using it as a primary datastore
- Too many shards (overhead) or too few (no parallelism)
- Ignoring analyzer/tokenization for relevance