How to design a distributed search engine for large datasets
A distributed search engine takes a query from a user, searches across billions of documents partitioned across hundreds of nodes, ranks results by relevance, and returns the top matches—all within 200 milliseconds. The core data structure powering this is the inverted index: a mapping from every word to the list of documents containing that word. When you search for "distributed caching strategies," the engine looks up each term in the inverted index, finds the intersection of document lists, scores each document using ranking algorithms like BM25, and returns the highest-scoring results. In system design interviews, the search engine problem tests your understanding of inverted indexes, sharding strategies, the separation of indexing and serving pipelines, ranking algorithms, and the trade-off between indexing latency and search freshness. Elasticsearch—a distributed search engine built on Apache Lucene—is the standard reference architecture for this problem.
Key Takeaways
- The search engine has two separate pipelines: the indexing pipeline (ingest documents, tokenize, build inverted index) and the serving pipeline (receive query, scatter to shards, gather and merge results). These pipelines must be decoupled—indexing runs continuously in the background while serving responds in under 200ms.
- The inverted index is the core data structure. It maps terms to document lists with position and frequency metadata. Building and maintaining this index efficiently at scale is the primary engineering challenge.
- Document partitioning (sharding by document) is the standard strategy. Every query fans out to all shards in parallel, each shard returns its local top-K results, and a coordinator merges them into the global top-K. This is what Elasticsearch, Google, and Solr use.
- BM25 is the default ranking algorithm—it improves on TF-IDF with document length normalization and term frequency saturation. For modern search, a two-stage ranking pipeline combines BM25 for initial scoring with ML re-ranking (BERT, LambdaMART) for final ordering.
- Elasticsearch is a high-level orchestration framework for Apache Lucene. In interviews, reference Elasticsearch for the distributed systems layer (cluster coordination, sharding, replication) and Lucene for the search mechanics (inverted index, tokenization, scoring).
Step 1: Requirements and Scope
Functional requirements:
Full-text search: Given a text query, return the most relevant documents from a corpus of billions. Filtering: Support filtering by metadata (date range, category, author, language). Sorting: Sort by relevance score (default), recency, or custom fields. Autocomplete/typeahead: Suggest completions as the user types. Near-real-time indexing: New documents become searchable within seconds of ingestion.
Non-functional requirements:
Latency: Search results returned within 200ms (p99). Scalability: Support a corpus of 10B+ documents totaling 100 TB+. Availability: 99.99% uptime—search is the primary user interaction. Freshness: New content searchable within 1–5 seconds of indexing. Throughput: Handle 50,000 search queries per second at peak.
Interview tip: Clarify with the interviewer: "Are we designing a web-scale search engine like Google, or a product search engine for an e-commerce platform?" The scope changes dramatically—web search adds crawling, PageRank, and spam detection. Product search focuses on structured filtering and merchandising.
Step 2: The Inverted Index — Core Data Structure
An inverted index maps every term to a posting list containing the documents where that term appears, along with metadata like term frequency and position.
Example: Given three documents:
- Doc 1: "distributed caching strategies"
- Doc 2: "caching invalidation patterns"
- Doc 3: "distributed systems design"
The inverted index:
| Term | Posting List |
|---|---|
| distributed | Doc 1 (pos: 0), Doc 3 (pos: 0) |
| caching | Doc 1 (pos: 1), Doc 2 (pos: 0) |
| strategies | Doc 1 (pos: 2) |
| invalidation | Doc 2 (pos: 1) |
| patterns | Doc 2 (pos: 2) |
| systems | Doc 3 (pos: 1) |
| design | Doc 3 (pos: 2) |
A query for "distributed caching" looks up both terms, finds the intersection (Doc 1 appears in both posting lists), and returns Doc 1 as the most relevant result.
Text processing pipeline (before indexing):
Tokenization: Split text into individual terms ("distributed caching" → ["distributed", "caching"]). Lowercasing: Convert to lowercase for case-insensitive search. Stop word removal: Remove common words ("the," "is," "and") that add noise. Stemming/lemmatization: Reduce words to root form ("running" → "run," "strategies" → "strategy"). This ensures "caching strategies" matches documents containing "cache strategy."
Step 3: Architecture — Indexing and Serving Pipelines
The Indexing Pipeline
New documents are ingested, processed, and added to the inverted index. This pipeline runs continuously in the background.
Data flow: Document source (database CDC, API, crawler) → Message queue (Kafka) → Text processor (tokenize, normalize, stem) → Index writer (builds inverted index segments) → Segment merge (combines small segments into larger ones for query efficiency) → Replicate to serving nodes.
Near-real-time indexing: Elasticsearch achieves near-real-time search through a refresh interval (default: 1 second). New documents are written to an in-memory buffer, then flushed to a searchable segment every refresh interval. This means documents become searchable within 1 second of ingestion—not instantly, but fast enough for most use cases.
The Serving Pipeline
Receives a query, searches the inverted index, and returns ranked results.
Data flow (scatter-gather pattern):
Query arrives at a coordinator node. The coordinator parses the query, applies filters, and routes the request to all index shards in parallel (scatter). Each shard searches its local inverted index, scores documents using BM25, and returns its local top-K results. The coordinator merges all shards' results into a single sorted list (gather), applies global re-ranking, and returns the final top-K to the client.
Interview application: "The search serving pipeline uses a scatter-gather pattern. The coordinator sends the query to all 50 shards in parallel. Each shard returns its top 100 results scored by BM25. The coordinator merges these 5,000 results, re-ranks with an ML model, and returns the top 10. The entire process completes within 200ms because each shard searches independently and the coordinator only merges pre-sorted lists."
Step 4: Sharding Strategy
Document Partitioning (Industry Standard)
Shard by document: documents 1–1M on shard 0, documents 1M–2M on shard 1, and so on. Each shard holds a complete inverted index for its subset of documents.
Pros: Each document's data is self-contained on one shard. Adding documents scales by adding shards. This is what Google, Elasticsearch, and Solr use.
Cons: Every query must fan out to all shards (scatter-gather). With 50 shards, each query generates 50 parallel sub-queries.
Term Partitioning (Alternative)
Shard by term: all documents containing "apple" on shard 3, all documents containing "banana" on shard 7. A single-term query hits only one shard.
Pros: Single-term queries are fast (one shard only).
Cons: Multi-term queries require scatter-gather across all relevant shards. Hotspot risk—popular terms create unbalanced shards.
Interview recommendation: "I would use document partitioning. Every query fans out to all shards, but this is parallelized and predictable. Term partitioning creates hotspots on popular terms and still requires scatter-gather for multi-term queries. Document partitioning is the standard approach used by Elasticsearch and Google."
Step 5: Ranking and Relevance
BM25 — The Default Ranking Algorithm
BM25 (Best Match 25) scores documents based on term frequency, inverse document frequency, and document length normalization. It is the default ranking algorithm in Elasticsearch and Solr.
Key improvements over TF-IDF:
Term frequency saturation: Additional occurrences of a term give diminishing returns. A document mentioning "caching" 100 times is not 100x more relevant than one mentioning it once—BM25's k1 parameter controls this saturation curve.
Document length normalization: Shorter documents are rewarded over longer ones for the same term frequency. A 500-word article mentioning "caching" 5 times is more focused than a 50,000-word book mentioning it 5 times—BM25's b parameter controls this normalization.
Two-Stage Ranking Pipeline
Production search engines use a two-stage ranking approach.
Stage 1 — Initial retrieval (BM25): Score all matching documents cheaply. Return the top 1,000 candidates. This stage runs on every shard and must be fast (under 50ms per shard).
Stage 2 — ML re-ranking: Apply a heavier ML model (BERT for semantic understanding, LambdaMART for learning-to-rank) to the top 1,000 candidates. Re-order based on features like click-through history, freshness, user context, and semantic similarity. Return the top 10–20 to the user.
Interview application: "I would use BM25 for initial scoring on each shard—it is fast and effective for term-matching relevance. The coordinator collects the top 1,000 results across all shards and applies an ML re-ranker using a cross-encoder BERT model that considers semantic similarity, document freshness, and user click history. The re-ranker runs on GPU-backed instances and processes 1,000 candidates in under 50ms."
Step 6: Replication and Fault Tolerance
Each shard has a primary copy and 1–2 replica copies on different nodes. Write operations go to the primary, which replicates to replicas. Read operations can be served by any copy—distributing read load across replicas.
If a node fails, its shard replicas on other nodes continue serving queries. The cluster promotes a replica to primary and creates a new replica on a healthy node. This automatic failover maintains availability during node failures.
Interview application: "Each of our 50 shards has 1 primary and 2 replicas distributed across different nodes—150 shard copies total. Reads are load-balanced across all copies. If a node fails, the cluster promotes replicas and creates new copies automatically. This architecture tolerates the failure of any single node without query interruption."
Step 7: Scaling Considerations
Handling 10B documents: At an average of 10 KB per document, 10B documents = 100 TB of raw data. The inverted index adds approximately 30–50% overhead. Total storage: ~150 TB. With 50 shards, each shard stores 3 TB—manageable on modern SSDs with 32–64 GB RAM per node for caching hot segments.
Handling 50,000 QPS: With 50 shards × 3 copies = 150 shard instances serving reads, each instance handles ~333 QPS—well within capacity for Elasticsearch nodes with adequate RAM for caching.
Autocomplete/typeahead: Use a separate index with edge n-gram tokenization. "elasticsearch" is indexed as ["e", "el", "ela", "elas", ...]. Queries match prefix tokens instantly. Serve from a dedicated in-memory index for sub-10ms latency.
For structured practice on search engine design and other real-world system design problems, Grokking the System Design Interview covers search architecture as a core design pattern.
For advanced search patterns including semantic search, vector similarity, and production-scale Elasticsearch clusters, Grokking the Advanced System Design Interview builds the depth required for L6+ interviews. The system design interview guide provides the broader framework for approaching any system design problem.
Frequently Asked Questions
What is an inverted index and why is it essential for search?
An inverted index maps every term to the list of documents containing that term. Without it, searching requires scanning every document (O(N) per query). With it, searching is an index lookup (O(1) for term lookup, then intersection of posting lists). Every search engine—Google, Elasticsearch, Solr—is built on inverted indexes.
How does Elasticsearch distribute search across nodes?
Elasticsearch divides each index into shards (document partitions) distributed across nodes. Each shard has primary and replica copies. Queries scatter to all shards in parallel, each shard returns local top-K results, and a coordinator gathers and merges them into the global result set.
What is BM25 and why is it the default ranking algorithm?
BM25 scores documents using term frequency, inverse document frequency, and document length normalization. It improves on TF-IDF by adding term frequency saturation (diminishing returns for repeated terms) and length normalization (shorter documents are rewarded). It is the default in Elasticsearch and Solr because it provides strong relevance without ML overhead.
Should I use document partitioning or term partitioning?
Document partitioning. Shard by document so each shard holds a complete inverted index for its subset. Every query fans out to all shards, but this is parallelized and predictable. Term partitioning creates hotspots on popular terms and is rarely used in production. Google, Elasticsearch, and Solr all use document partitioning.
How does near-real-time search work?
New documents are written to an in-memory buffer and flushed to a searchable index segment at a configurable refresh interval (Elasticsearch default: 1 second). Documents are not instantly searchable but become searchable within the refresh interval. This balances indexing throughput with search freshness.
What is the scatter-gather pattern in distributed search?
The coordinator sends the query to all shards in parallel (scatter). Each shard returns its local top-K results. The coordinator merges all results into a single sorted list (gather) and returns the global top-K. This pattern parallelizes search across hundreds of shards while maintaining consistent result quality.
How do I handle search relevance beyond keyword matching?
Use a two-stage pipeline: BM25 for fast initial retrieval of the top 1,000 candidates, then ML re-ranking (BERT cross-encoder, LambdaMART) for semantic understanding, personalization, and freshness boosting. The ML model runs on the merged candidate set, not on every document, keeping latency under 200ms.
How much RAM does a search cluster need?
Elasticsearch performs best when the most frequently accessed index segments fit in the OS file system cache. A common guideline is 50% of node RAM for Elasticsearch heap (max 32 GB for compressed oops) and 50% for OS cache. A 64 GB RAM node provides 32 GB heap + 32 GB file cache—sufficient for a 3 TB shard.
When should I use Elasticsearch vs PostgreSQL full-text search?
PostgreSQL full-text search is sufficient for datasets under ~10 million documents with simple search requirements. Elasticsearch is needed when you have 100M+ documents, require faceted search, need sub-100ms latency across complex queries, or need near-real-time indexing. The crossover is around 10–50 million documents depending on query complexity.
How do I design autocomplete for a search engine?
Use a separate Elasticsearch index with edge n-gram tokenization. The term "elasticsearch" is indexed as prefix tokens ["e", "el", "ela", ...]. As the user types, prefix queries match against these tokens and return suggestions in under 10ms. Serve from dedicated nodes with high RAM for caching the autocomplete index entirely in memory.
TL;DR
A distributed search engine separates into two pipelines: indexing (ingest, tokenize, build inverted index) and serving (scatter query to all shards, gather and merge results). The inverted index maps terms to document lists—the core data structure that makes search O(1) instead of O(N). Use document partitioning (shard by document, fan out to all shards per query)—this is what Google, Elasticsearch, and Solr use. Rank with BM25 for fast initial scoring, then ML re-ranking (BERT, LambdaMART) for the top 1,000 candidates. Achieve near-real-time search with 1-second refresh intervals. Replicate each shard (primary + 2 replicas) for fault tolerance and read scaling. At 10B documents and 50,000 QPS: 50 shards × 3 copies = 150 shard instances, each handling ~333 QPS with 3 TB of data. Elasticsearch is the standard reference—it handles distributed coordination while Apache Lucene handles the search mechanics underneath.
GET YOUR FREE
Coding Questions Catalog

$197

$72

$78