How do you implement ANN (approx nearest neighbor) for vector search?

Approx nearest neighbor is the family of indexing methods that makes vector search fast at large scale by trading a little accuracy for big wins in latency and cost. Instead of scanning every vector to find the k nearest neighbors, you route the query to promising regions of the space, scan only a tiny candidate set, then optionally re rank with exact math. This is the backbone of semantic search, retrieval augmented generation, recommendation, and deduplication in modern distributed systems and scalable architecture.

Why It Matters

Exact search is linear in the number of vectors and the dimension, which becomes slow and expensive once you have millions or billions of embeddings. ANN cuts the work by orders of magnitude while maintaining high recall at k. You get user visible wins like sub second responses, stable tail latency, and lower memory pressure. In a system design interview, being able to explain when to pick graph based indexes like HNSW versus inverted files with product quantization, how to size shards, and how to measure recall will set you apart.

How It Works Step by Step

Pick an embedding and distance metric Text and image models often produce unit normalized vectors. If you L2 normalize, inner product becomes equivalent to cosine similarity. For dense retrieval tasks, cosine or inner product is typical. For some metric learning cases you may keep raw vectors and use L2 distance.
Prepare data and training samples Gather a representative subset to train centroids or codebooks. Remove obvious outliers and consider whitening or normalizing features to reduce skew. Keep a holdout set to measure recall and latency.
Choose an index family
- Graph based routing such as HNSW builds a navigable small world graph. Queries walk from an entry point toward closer neighbors, expanding a beam of candidates.
- Coarse quantization such as IVF partitions the space into nlist centroids using k means. A query probes only the closest lists.
- Compression such as PQ or OPQ encodes vectors into short codes so you can scan many candidates from memory with low bandwidth.
- Hashing such as LSH puts similar vectors into the same buckets. This is less common for high quality recall but can be useful for very high throughput.
- Disk oriented methods like DiskANN or SPANN stage large graphs or lists on fast SSD and cache the hot parts in memory.
Build the index For HNSW, tune M and efConstruction which control connectivity and build time. For IVF, pick nlist from a few hundred to tens of thousands and train centroids. For IVFPQ, also choose the number of sub quantizers and bits per code. Index building is a batch step, so you snapshot the artifact and replicate it.
Serve with a two stage plan Stage one retrieves a few hundred to a few thousand candidates using ANN with query parameters such as efSearch in HNSW or nprobe in IVF. Stage two re ranks candidates using exact math on the original vectors. This restores quality while keeping latency low.
Support metadata filters Many queries require constraints like language equals English or tenant equals A. You can pre filter by adding attribute aware shards or post filter by scanning extra candidates and discarding mismatches. Bitmap indexes and bloom filters help keep filtering fast.
Handle updates and deletes HNSW supports online inserts. IVF can assign new vectors to the nearest list and optionally trigger periodic re clustering. For deletes, mark tombstones and compact during maintenance windows. Keep an append only log so you can rebuild cleanly.
Scale out Shard by vector id or by partition id. Each shard holds a subset of vectors and a copy of the trained structures needed for routing. A front end fans out the query to relevant shards and merges results. Memory map indexes for fast cold start and warm the hottest lists or entry points at process start.
Measure and tune Track recall at k on a golden set, average and p95 latency, and tail behavior when filters are used. Sweep efSearch or nprobe to find the best quality cost curve. Watch cache hit rates for disk based indexes. Automate canary analysis before shipping new codebooks or centroids.

Real World Example

Imagine an e commerce search where you want to show similar products when a user opens an item page. You store a vector for each product image and a separate vector for the text description. At request time you embed the active product, send it to HNSW to get one thousand image candidates in one to two milliseconds, send the same vector to an IVF index over text embeddings to get one thousand text candidates, union both sets, apply a brand and price filter, then re rank the remaining few hundred with exact cosine on the original in memory vectors.

Final top k is ready under one hundred milliseconds even with tens of millions of items. The same pipeline works for retrieval augmented generation where the query is the user question and the corpus contains chunk embeddings.

Common Pitfalls or Trade offs

Using the wrong metric Cosine and inner product behave differently from L2 when vectors are not normalized. If you forget to normalize for models that expect it, recall drops without an obvious cause.

Skipping re ranking Pure ANN results can be slightly off. Always re rank with exact math on the raw vectors or at least on higher precision residuals to restore accuracy.

Over aggressive compression PQ that uses too few bits per sub vector saves memory but can kill quality. Start with more bits, measure recall at k, then dial down.

Unbalanced partitions IVF centroids learned from a biased sample produce lists with heavy skew. Some lists become hot and blow up tail latency. Retrain with a better sample or use OPQ to rotate the space.

Parameter drift efSearch or nprobe set during load testing may no longer be optimal after the corpus grows. Schedule periodic sweeps and automate guardrails.

Poor filter strategy If you post filter after ANN without increasing the candidate budget, you return too few valid results. Either pre filter with shards or expand the initial candidate count.

Blind to updates Indexes that cannot absorb real time writes need a write buffer and merge process. Without this, users do not see fresh content and trust erodes.

Lack of ground truth You need a golden set with exact neighbors to measure recall and regression test new embeddings or indexes. Without it, you are tuning in the dark.

Interview Tip

A favorite prompt goes like this. You have five hundred million vectors of dimension seven hundred sixty eight, target recall at ten is ninety five percent, p95 latency must be below one hundred milliseconds, and writes are fifty per second. Walk through your design. A strong answer picks HNSW for high recall and online inserts or IVFPQ for tighter memory, explains two stage retrieval and re ranking, shows how to shard and measure recall, and calls out how to handle metadata filters without breaking latency.

Key Takeaways

ANN is routing plus pruning then optional re ranking to approximate k nearest neighbor quickly.
Choose between graph based, inverted file, and compressed methods based on recall, memory, and update needs.
Always measure recall at k on a golden set and tune query parameters like efSearch or nprobe.
Plan for filters, online updates, and index rebuilds as first class citizens.
Separate fast candidate retrieval from exact re ranking to get both speed and quality.

Table of Comparison

Method	Core Idea	Build Time	Memory	Update Speed	Typical Recall	Best For	Notes
Exact (Brute Force)	Scan all vectors and compute exact distances	None	High	Instant	100%	Small corpora, re-ranking	Perfect accuracy but very slow at scale
HNSW	Navigable small-world graph search	Medium–High	High	Good for inserts	Very High	Interactive, high recall search	Fast query time, moderate memory overhead
IVF Flat	Partition into centroids and scan top clusters	Moderate	Moderate	Moderate	High	Balanced recall and performance	Great trade-off between speed and simplicity
IVFPQ	IVF + Product Quantization compression	High	Low	Moderate	High (with re-rank)	Billion-scale corpora	Compact and memory-efficient
Annoy	Random projection trees for approximate search	Low	Low–Moderate	Fast inserts	Medium–High	Fast builds and lightweight apps	Simpler than HNSW, lower recall ceiling
ScaNN	Partitioning with asymmetric hashing	Moderate	Moderate	Batch	High	Text and semantic search	Excellent for transformer-based embeddings
DiskANN	Graph search with SSD-backed storage	High	Very Low	Batch	High	Billion-scale on commodity servers	Great balance of cost and scale, SSD latency sensitive

FAQs

Q1. Which distance metric should I use for text embeddings?

Most text models prefer cosine or inner product on normalized vectors. If you normalize, cosine and inner product are equivalent. Use L2 only if the model or evaluation calls for it.

Q2. How many clusters or lists should I use in IVF?

As a rule of thumb, start with the square root of the corpus size for nlist on a training subset, then tune nprobe to hit the recall target.

Q3. What is a good recall target for interactive apps?

Ninety five percent recall at ten is a common goal. For strict quality needs, push to ninety eight percent and rely on re ranking.

Q4. How do I support metadata filters with ANN?

Sharding by attribute or maintaining attribute bitmaps per list works well. If you post filter, raise the candidate budget so valid results remain after filtering.

Q5. Can I store the index on disk?

Yes. DiskANN and similar methods keep large structures on SSD. You still want a warm cache for entry points and popular lists to keep tail latency low.

Q6. How do I evaluate an index change safely?

Hold out a golden query set. Measure recall at k and latency before and after. Run a canary release with guardrails on recall and tail latency, then promote when stable.

Further Learning

If you want to master large-scale vector search and ANN systems, start with these highly rated DesignGurus.io courses:

Grokking Scalable Systems for Interviews – Learn how search, indexing, sharding, and caching fit together in scalable distributed architectures.
Grokking System Design Fundamentals – Build a strong foundation in scalability, data partitioning, and consistency models before diving into ANN design.
Grokking the System Design Interview – Practice full end-to-end system design questions like semantic search, recommendation engines, and real-time retrieval systems.