How to design a distributed search engine for large datasets

Question

Design Gurus · Accepted Answer

A distributed search engine takes a query from a user, searches across billions of documents partitioned across hundreds of nodes, ranks results by relevance, and returns the top matches—all within 200 milliseconds. The core data structure powering this is the inverted index: a mapping from every word to the list of documents containing that word. When you search for "distributed caching strategies," the engine looks up each term in the inverted index, finds the intersection of document lists, scores each document using ranking algorithms like BM25, and returns the highest-scoring results. In system design interviews, the search engine problem tests your understanding of inverted indexes, sharding strategies, the separation of indexing and serving pipelines, ranking algorithms, and the trade-off between indexing latency and search freshness. Elasticsearch—a distributed search engine built on Apache Lucene—is the standard reference architecture for this problem.

Key Takeaways

The search engine has two separate pipelines: the indexing pipeline (ingest documents, tokenize, build inverted index) and the serving pipeline (receive query, scatter to shards, gather and merge results). These pipelines must be decoupled—indexing runs continuously in the background while serving responds in under 200ms.  
The inverted index is the core data structure. It maps terms to document lists with position and frequency metadata. Building and maintaining this index efficiently at scale is the primary engineering challenge.  
Document partitioning (sharding by document) is the standard strategy. Every query fans out to all shards in parallel, each shard returns its local top-K results, and a coordinator merges them into the global top-K. This is what Elasticsearch, Google, and Solr use.  
BM25 is the default ranking algorithm—it improves on TF-IDF with document length normalization and term frequency saturation. For modern search, a two-stage ranking pipeline combines BM25 for initial scoring with ML re-ranking (BERT, LambdaMART) for final ordering.  
Elasticsearch is a high-level orchestration framework for Apache Lucene. In interviews, reference Elasticsearch for the distributed systems layer (cluster coordination, sharding, replication) and Lucene for the search mechanics (inverted index, tokenization, scoring).

Step 1: Requirements and Scope

Functional requirements:

Full-text search: Given a text query, return the most relevant documents from a corpus of billions. Filtering: Support filtering by metadata (date range, category, author, language). Sorting: Sort by relevance score (default), recency, or custom fields. Autocomplete/typeahead: Suggest completions as the user types. Near-real-time indexing: New documents become searchable within seconds of ingestion.

Non-functional requirements:

Latency: Search results returned within 200ms (p99). Scalability: Support a corpus of 10B+ documents totaling 100 TB+. Availability: 99.99% uptime—search is the primary user interaction. Freshness: New content searchable within 1–5 seconds of indexing. Throughput: Handle 50,000 search queries per second at peak.

Interview tip: Clarify with the interviewer: "Are we designing a web-scale search engine like Google, or a product search engine for an e-commerce platform?" The scope changes dramatically—web search adds crawling, PageRank, and spam detection. Product search focuses on structured filtering and merchandising.

Step 2: The Inverted Index — Core Data Structure

An inverted index maps every term to a posting list containing the documents where that term appears, along with metadata like term frequency and position.

Example: Given three documents:

Doc 1: "distributed caching strategies"  
Doc 2: "caching invalidation patterns"  
Doc 3: "distributed systems design"

The inverted index:

Term Posting List
distributed Doc 1 (pos: 0), Doc 3 (pos: 0)
caching Doc 1 (pos: 1), Doc 2 (pos: 0)
strategies Doc 1 (pos: 2)
invalidation Doc 2 (pos: 1)
patterns Doc 2 (pos: 2)
systems Doc 3 (pos: 1)
design Doc 3 (pos: 2)

A query for "distributed caching" looks up both terms, finds the intersection (Doc 1 appears in both posting lists), and returns Doc 1 as the most relevant result.

Text processing pipeline (before indexing):

Tokenization: Split text into individual terms ("distributed caching" → ["distributed", "caching"]). Lowercasing: Convert to lowercase for case-insensitive search. Stop word removal: Remove common words ("the," "is," "and") that add noise. Stemming/lemmatization: Reduce words to root form ("running" → "run," "strategies" → "strategy"). This ensures "caching strategies" matches documents containing "cache strategy."

Step 3: Architecture — Indexing and Serving Pipelines

The Indexing Pipeline

New documents are ingested, processed, and added to the inverted index. This pipeline runs continuously in the background.

Data flow: Document source (database CDC, API, crawler) → Message queue (Kafka) → Text processor (tokenize, normalize, stem) → Index writer (builds inverted index segments) → Segment merge (combines small segments into larger ones for query efficiency) → Replicate to serving nodes.

Near-real-time indexing: Elasticsearch achieves near-real-time search through a refresh interval (default: 1 second). New documents are written to an in-memory buffer, then flushed to a searchable segment every refresh interval. This means documents become searchable within 1 second of ingestion—not instantly, but fast enough for most use cases.

The Serving Pipeline

Receives a query, searches the inverted index, and returns ranked results.

Data flow (scatter-gather pattern):

Query arrives at a coordinator node. The coordinator parses the query, applies filters, and routes the request to all index shards in parallel (scatter). Each shard searches its local inverted index, scores documents using BM25, and returns its local top-K results. The coordinator merges all shards' results into a single sorted list (gather), applies global re-ranking, and returns the final top-K to the client.

Interview application: "The search serving pipeline uses a scatter-gather pattern. The coordinator sends the query to all 50 shards in parallel. Each shard returns its top 100 results scored by BM25. The coordinator merges these 5,000 results, re-ranks with an ML model, and returns the top 10. The entire process completes within 200ms because each shard searches independently and the coordinator only merges pre-sorted lists."

Step 4: Sharding Strategy

Document Partitioning (Industry Standard)

Shard by document: documents 1–1M on shard 0, documents 1M–2M on shard 1, and so on. Each shard holds a complete inverted index for its subset of documents.

Pros: Each document's data is self-contained on one shard. Adding documents scales by adding shards. This is what Google, Elasticsearch, and Solr use.

Cons: Every query must fan out to all shards (scatter-gather). With 50 shards, each query generates 50 parallel sub-queries.

Term Partitioning (Alternative)

Shard by term: all documents containing "apple" on shard 3, all documents containing "banana" on shard 7. A single-term query hits only one shard.

Pros: Single-term queries are fast (one shard only).

Cons: Multi-term queries require scatter-gather across all relevant shards. Hotspot risk—popular terms create unbalanced shards.

Interview recommendation: "I would use document partitioning. Every query fans out to all shards, but this is parallelized and predictable. Term partitioning creates hotspots on popular terms and still requires scatter-gather for multi-term queries. Document partitioning is the standard approach used by Elasticsearch and Google."

Step 5: Ranking and Relevance

How to design a distributed search engine for large datasets

Key Takeaways

Step 1: Requirements and Scope

Step 2: The Inverted Index — Core Data Structure

Step 3: Architecture — Indexing and Serving Pipelines

The Indexing Pipeline

The Serving Pipeline

Step 4: Sharding Strategy

Document Partitioning (Industry Standard)

Term Partitioning (Alternative)

Step 5: Ranking and Relevance

BM25 — The Default Ranking Algorithm

Two-Stage Ranking Pipeline

Step 6: Replication and Fault Tolerance

Step 7: Scaling Considerations

Frequently Asked Questions

What is an inverted index and why is it essential for search?

How does Elasticsearch distribute search across nodes?

What is BM25 and why is it the default ranking algorithm?

Should I use document partitioning or term partitioning?

How does near-real-time search work?

What is the scatter-gather pattern in distributed search?

How do I handle search relevance beyond keyword matching?

How much RAM does a search cluster need?

When should I use Elasticsearch vs PostgreSQL full-text search?

How do I design autocomplete for a search engine?

TL;DR

Term	Posting List
distributed	Doc 1 (pos: 0), Doc 3 (pos: 0)
caching	Doc 1 (pos: 1), Doc 2 (pos: 0)
strategies	Doc 1 (pos: 2)
invalidation	Doc 2 (pos: 1)
patterns	Doc 2 (pos: 2)
systems	Doc 3 (pos: 1)
design	Doc 3 (pos: 2)