01Why Vector Databases Are Their Own Thing
Most candidates can name "Pinecone" and stop there. The interview goes deeper: what is an embedding, actually? Why is vector search approximate? What's the recall-vs-latency tradeoff in ANN algorithms, and how do you tune it? When does a dedicated vector store make sense versus pgvector inside Postgres? How does this all compose with keyword search and the rest of the system?
The depth here lives in three places. First, the conceptual foundation: embeddings are how text, images, and other content get translated into a geometric space where similarity becomes distance. Most candidates use this without understanding it. Second, approximation: exact nearest-neighbor search is too slow at scale, so production systems use approximate algorithms (HNSW, IVF) that trade recall for speed. The trade space matters. Third, architectural integration: vector search is rarely standalone; it sits inside a larger retrieval pipeline alongside keyword search, primary stores, and downstream LLM applications.
This page covers all three. Specific products are mentioned but they're not the point. The point is the underlying decisions, because the products will change every year and the decisions won't.
The Senior Move
The senior signal in vector database interviews isn't naming Pinecone. It's recognizing that vector search is approximate by design, that the embedding model choice matters more than the database choice, and that pgvector inside Postgres is the right starting point for most products. Naming these positions explicitly is what separates senior candidates from "we'd add a vector database" candidates.
02What an Embedding Actually Is
Before talking about vector databases, you need to know what they're storing. A vector database stores embeddings: numerical representations of content (text, images, audio) in a high-dimensional space.
The conceptual leap
Take a sentence: "Tokyo restaurants are great." Run it through an embedding model. The model produces a list of numbers, often 768 or 1024 of them, that represent the sentence's meaning in a learned geometric space. A different sentence with similar meaning, like "Restaurants in Tokyo are excellent," produces a similar list of numbers — close to the first one in the geometric space. A sentence about something different, like "Quantum mechanics is complex," produces numbers far from the Tokyo sentences.
That's the conceptual leap: meaning gets encoded as geometric position. Similar things are close. Different things are far. Once you have that, "find similar content" becomes "find nearby vectors."
Where embeddings come from
Embedding models are neural networks trained to produce these geometric representations. Modern options:
- OpenAI embeddings.
text-embedding-3-smallandtext-embedding-3-large. Hosted, paid, high quality. The default for most production systems. - Cohere embeddings. Multilingual support, often outperforms on non-English content. Hosted, paid.
- Open-source models. sentence-transformers (BERT-based), BGE, E5, Nomic embed. Run them yourself, no per-call cost, full control over the pipeline. Quality has improved dramatically; the gap with paid options is small for most use cases.
- Domain-specific models. Code embeddings (CodeBERT), legal text embeddings, biomedical embeddings. Specialized models that outperform general-purpose ones on their domain.
The choice of embedding model matters more than the choice of vector database. A great vector database with a poor embedding model will retrieve poorly. A modest vector database with great embeddings will retrieve well. Most teams obsess over the database; they should obsess over the embeddings.
What dimensions and distance actually mean
An embedding has dimensions: how many numbers it contains. Common sizes: 384, 768, 1024, 1536, 3072. Higher dimensions can capture more nuance but cost more to store and search. Most production embeddings are 768 or 1536.
Distance between two embeddings measures dissimilarity. Three common functions:
- Cosine similarity. Measures angle between vectors, normalized for magnitude. The default for text embeddings. Range: -1 (opposite) to 1 (identical), with 0 being unrelated.
- Euclidean distance (L2). Straight-line distance. Sometimes used for image embeddings.
- Dot product. Cheaper to compute than cosine. Used when embeddings are pre-normalized; equivalent to cosine in that case.
For most applications, cosine similarity is the right default. The choice rarely matters dramatically; the embedding model matters far more.
The embedding model determines what's similar to what. The vector database stores and searches efficiently. Get the embedding right; the database is downstream of that decision.
03Vector Similarity and Approximate Nearest Neighbor
Once you have embeddings, the core query is "find the K vectors most similar to this query vector." This is k-nearest-neighbor search. The naive approach: compare the query vector to every vector in the database, return the K closest. This is exact but slow: O(N) per query, where N is the number of vectors. At a million vectors and 768 dimensions, every query touches 768 million floating-point comparisons. Too slow at scale.
Vector Similarity in High-Dimensional Space
A 2D projection of vector space. Documents cluster by meaning. The query "good places to eat in japan" lands close to the Tokyo restaurants cluster even though "japan" never appeared in those documents. Semantic similarity, not keyword match.
Why approximation is necessary
Exact nearest-neighbor search has a real cost. For a corpus of 10 million vectors at 768 dimensions, every query requires roughly 10 million × 768 = 7.7 billion floating-point operations to find the exact top-K. That's hundreds of milliseconds even on fast hardware. At 100 million vectors, it's seconds per query. Production search needs to be tens of milliseconds.
The fix is approximate nearest neighbor (ANN) search: algorithms that return the top-K with high probability without scanning every vector. Modern ANN algorithms can search a 100M-vector corpus in a few milliseconds with 95%+ recall (the fraction of true nearest neighbors actually returned). The ~5% you miss are usually marginal results that wouldn't have changed user experience.
The interview move: when asked "how does vector search work at scale?", name the approximation explicitly. "Exact search is O(N) per query and too slow above a million or so vectors. Production systems use approximate nearest neighbor, typically HNSW, which trades a few percent of recall for 100x to 1000x speedup. The recall is tunable through algorithm parameters."
04ANN Algorithms: HNSW, IVF, and the Trade Space
Three algorithm families cover almost every production vector store. Each has a different position on the recall-vs-latency-vs-memory trade space. The right choice depends on which dimension matters most for your workload.
HNSW
The modern default
Hierarchical Navigable Small World. Builds a multi-layer graph where vectors are connected to their nearest neighbors. Search traverses the graph from coarse to fine, finding approximate nearest neighbors in logarithmic time. High recall (95-99%) at low latency.
What it tradesMemory: stores the graph structure alongside the vectors, often 1.5-3x the raw embedding storage. The default in pgvector, Pinecone, Weaviate, Qdrant. If you don't have a specific reason to choose otherwise, use HNSW.IVF
When memory matters
Inverted File. Partitions the vector space into clusters using k-means; at query time, only search the few clusters closest to the query. Lower memory than HNSW because no graph structure is stored. Lower recall at the same latency, but tunable through how many clusters you search.
What it tradesRecall vs latency: searching more clusters increases recall but slows queries. Common in FAISS-based stacks and very large vector stores where the memory savings matter. Often combined with quantization (PQ, SQ) for further memory reduction.Flat (Brute Force)
Small corpus, perfect recall
No index structure. Compare the query to every vector. 100% recall (it's exact). Linear in corpus size, so only practical up to roughly a million vectors depending on dimensions and hardware.
What it tradesLatency at scale. The right choice for small corpora (under ~100K vectors) where the overhead of building an index isn't worth it. Modern hardware (especially with SIMD instructions) makes flat search competitive longer than people expect.The recall-latency-memory trade space
Three dimensions you're trading off. Pick which two matter most:
- Recall. The fraction of true nearest neighbors actually returned. Higher is better. 95% recall means the system returns 95 of the actual top 100 nearest neighbors; the other 5 are missed.
- Latency. How long a single query takes. Lower is better. Production targets are usually 10-50ms for vector search.
- Memory. How much RAM the index occupies. Lower is better, but most ANN indexes have to fit in memory for fast queries.
The fundamental relationship: you can pick any two. Want high recall and low latency? Pay memory. Want low memory and high recall? Pay latency. Want low memory and low latency? Pay recall. Algorithms differ in how favorably they sit on this trade curve, but the curve itself is real.
Tuning HNSW
Two parameters dominate HNSW behavior:
ef_construction(build-time): how thoroughly the graph is built. Higher values produce better graphs (better recall at query time) but slower indexing. Common range: 100-400.ef_search(query-time): how many candidates to consider during search. Higher values produce better recall but slower queries. Common range: 50-200. Tunable per-query, which is useful: high ef_search for important queries, lower for bulk operations.
The interview move: when asked about ANN tuning, name the trade space and at least one parameter. "We'd use HNSW with ef_search around 100 as the default; tune up for higher recall on important queries, tune down when latency dominates. The actual values are workload-specific and need empirical evaluation."
Quantization, briefly
Production vector stores often quantize embeddings to reduce memory: storing 8-bit integers instead of 32-bit floats (4x reduction), or using product quantization (PQ) for even more aggressive compression. The recall cost is usually 1-3% in exchange for 4-16x memory reduction. At scale (100M+ vectors), quantization is the difference between affordable and unaffordable. Most prep material doesn't mention this; naming it briefly is a depth signal.
05The Retrieval Pipeline
Vector search rarely operates in isolation. It sits inside a larger pipeline: embed at ingestion time, store in a vector database, embed queries, search, return top-K. The diagram below shows the full flow.
The Retrieval Pipeline, End to End
The retrieval pipeline. Ingestion runs offline: chunk documents, generate embeddings, store. Queries run online: embed the query with the same model, ANN search, optionally rerank, return. The model match between the two paths is critical.
Ingestion: chunking and embedding
Documents go through two steps before they reach the vector store:
- Chunking. Break documents into pieces small enough to embed effectively. Most embedding models have token limits (8K to 32K depending on the model); long documents need to be split. Common chunk sizes: 256-1024 tokens with overlap. The chunking strategy matters: too small and you lose context, too large and embeddings dilute.
- Embedding. Run each chunk through the embedding model. Store the resulting vector alongside metadata (document ID, chunk position, source URL, etc.) in the vector store.
Chunking is more art than science. The "right" chunk size depends on the content type, the embedding model, and the queries you expect. Most teams iterate: try a chunking strategy, evaluate retrieval quality, adjust. Recursive chunking (split on sentence boundaries, then paragraph boundaries, then with overlap) is a reasonable default.
Query: embed, search, optionally rerank
At query time, the user's query goes through the same embedding model used for ingestion. The vector store does ANN search and returns the top-K candidates. Many production systems then rerank: a more expensive model scores the candidates against the query for final ordering.
The reranking step is increasingly important. Initial retrieval (ANN) optimizes for recall: get the right candidates into the top-K. Reranking optimizes for precision: order the top-K so the best ones are first. Cross-encoder rerankers (like Cohere Rerank) score query-document pairs directly and dramatically improve ranking quality at modest cost.
The model-match invariant
The same embedding model must be used for both ingestion and query. Different models produce different geometric spaces; query embeddings from one model and document embeddings from another will be in incompatible spaces, and the search results will be garbage. Changing your embedding model means re-embedding the entire corpus. This is one of the highest-risk operations in a vector retrieval system; plan for it explicitly.
06pgvector vs Dedicated Vector Stores
The most common architecture decision in vector databases: do you use pgvector inside your existing Postgres instance, or do you stand up a dedicated vector store like Pinecone, Weaviate, or Qdrant? The honest 2026 answer is "pgvector first, dedicated when scale demands."
pgvector
The 2026 default
A Postgres extension that adds vector storage and HNSW indexing. You get embeddings stored next to your relational data, joins between vectors and structured fields, and one less system to operate. Quality of HNSW implementation has improved substantially; for most workloads under 10M vectors, pgvector is competitive with dedicated stores.
When to useDefault for most products. Up to roughly 10M vectors, pgvector handles it well on a reasonable Postgres instance. The ability to filter on structured fields via SQL is a real advantage over many dedicated stores.Dedicated Vector Stores
Pinecone, Weaviate, Qdrant, Milvus
Purpose-built systems optimized for vector search at scale. Better performance at very large corpora (100M+ vectors). Specialized features: hybrid scoring, multi-vector documents, advanced filtering. Operationally heavier or, in Pinecone's case, fully managed at higher cost.
When to useScale beyond what pgvector handles cleanly. Specific feature needs (advanced filtering, multi-vector, sparse-dense hybrid). When the team has the operational capacity for a dedicated system or budget for a managed one.Why pgvector usually wins for new projects
Three reasons most products should start with pgvector:
- Operational simplicity. One database to operate, one set of backups, one set of credentials, one query interface. Adding a dedicated vector store doubles the operational surface for marginal gain at small to moderate scale.
- Joins with structured data. Most retrieval queries have filters: "find documents similar to this query, but only ones owned by this user, only published in 2026, only in English." Postgres handles this naturally through SQL. Many dedicated vector stores treat filtering as an afterthought, applied post-search, which can hurt result quality.
- Transactional consistency. Inserting a document and its embedding in one transaction is trivial in Postgres. Coordinating two systems (primary store + dedicated vector store) requires the same patterns from search-indexing — CDC pipelines, eventual consistency, sync drift to manage.
When dedicated stores actually win
Three scenarios where moving beyond pgvector makes sense:
- Scale. Beyond ~10-50M vectors, pgvector can struggle on a single Postgres instance. Dedicated stores are designed for sharding from day one.
- Specialized features. Multi-vector documents (storing multiple embeddings per document and querying them flexibly), sparse-dense hybrid scoring built-in, learned indexes, GPU acceleration. These come naturally to dedicated stores; pgvector doesn't have them.
- Latency requirements. Dedicated stores often have lower p99 latency at scale because they're not sharing resources with transactional workloads. If your retrieval is in a tight latency budget (sub-20ms), dedicated may be necessary.
The Interview Move
"What vector database would you use?" The strong response: "pgvector by default for new projects. Most workloads under 10M vectors run cleanly in Postgres alongside the primary data, with the bonus of structured filtering via SQL. We'd reach for a dedicated store like Pinecone or Qdrant when scale, specialized features, or strict latency requirements demand it. The honest answer in 2026 is that pgvector handles more cases than the dedicated-store-by-default narrative suggests."
07RAG: The Canonical Application Pattern
The most common reason to build a vector database in 2026 is RAG: Retrieval-Augmented Generation. The pattern is simple in outline: when a user asks a question, retrieve relevant documents from your corpus, include them as context in the LLM prompt, generate a grounded answer. The vector database is the retrieval engine.
Why RAG exists
LLMs have two failure modes that retrieval addresses:
- Knowledge cutoffs. The LLM only knows what was in its training data, which has a cutoff date. Anything newer is invisible. Retrieval brings in current information.
- Hallucination. LLMs confidently make up facts that sound plausible. Grounding the response in retrieved documents (and asking the LLM to cite them) reduces hallucination dramatically.
RAG isn't perfect. It can fail when retrieval misses the relevant document, when the LLM ignores the retrieved context, when the retrieved context is contradictory or low-quality. But it's much better than asking an LLM directly about content it doesn't know.
The pieces of a production RAG system
The basic pattern (retrieve, prompt, generate) is the start. Production RAG systems add several layers:
- Hybrid retrieval. Vector search alone misses exact keyword matches. Production systems combine vector search (semantic) with keyword search (BM25) using reciprocal rank fusion. Search and indexing covers the hybrid pattern.
- Reranking. Initial retrieval returns top-K candidates; a more expensive cross-encoder reranks them by relevance. The compute cost is justified because reranking quality directly affects answer quality.
- Query rewriting. The user's raw query may not match document phrasing well. Some systems rewrite the query (often via the LLM itself) before retrieval. Multi-query retrieval generates several variations and combines results.
- Citation and grounding. The LLM is prompted to cite retrieved documents. Post-processing verifies the citations are real. Hallucinated citations are a known failure mode worth detecting.
The depth probe
"How would you build a customer support chatbot over our knowledge base?" is a common variant. The strong response covers the full pipeline: embed the knowledge base into pgvector, hybrid retrieval combining BM25 with vector search, rerank the top-K, prompt the LLM with the top results as context, return the answer with citations. Mentioning observability (per-query latency, retrieval recall, LLM token cost) and the model-match invariant is what makes the answer staff-bar.
08Failure Modes
Failure 01
Embedding model swap without re-indexing
The team upgrades their embedding model from the old version to a new, better one. They update the query path. They forget to re-embed the entire corpus. Document embeddings are now in a different geometric space from query embeddings. Retrieval results become garbage. Users complain. The team realizes the issue and starts a multi-day re-indexing job.
The fix is operational discipline: changing embedding models is a corpus-wide migration, not a config change. Plan for the re-index time and cost up front. Some systems version embeddings explicitly, allowing both old and new to coexist during migration; this adds complexity but prevents the failure entirely.
Failure 02
Chunking that breaks semantic units
Documents get chunked at fixed token boundaries. A code example gets split in the middle. A table gets split between rows. A technical concept gets split between its statement and its explanation. Retrieval returns chunks that look relevant in isolation but lose meaning in context.
The fix is structure-aware chunking: split on natural boundaries (sentences, paragraphs, headers) before falling back to fixed sizes. Use overlap between chunks to preserve context. For structured content (code, tables, lists), keep the unit intact even if it exceeds normal chunk size. This is one of the highest-leverage tuning decisions in RAG systems.
Failure 03
Cardinality and metadata explosion
The team adds fine-grained metadata to every vector: user_id, session_id, exact_timestamp. The metadata index grows huge. Filtering becomes slow because every filter combination is rare. The vector store, optimized for vectors, becomes a slow general-purpose database.
The fix is to think of metadata like you'd think of metrics dimensions (see observability): low cardinality is your friend. Filter on coarse attributes (tenant ID, language, content type), not on per-request identifiers. If you need fine-grained filtering, store the fine-grained data in the primary store and join after retrieval.
Failure 04
Recall drops silently as the corpus grows
The system was tuned for 1M vectors with 95% recall. The corpus grows to 50M. Nobody re-tunes the ANN parameters. Recall silently drops to 80%. Users see worse results but don't know why. The team's monitoring tracks query latency (which is fine) but not recall (which they have no way to measure on production traffic).
The fix is to evaluate recall regularly. Maintain a small held-out evaluation set with known correct answers. Periodically run it against the production index and measure how many true neighbors are returned. When recall drops, retune ANN parameters or repartition. Without this loop, recall regression is invisible until users complain.
09How Vector Databases Interact With Other Concepts
- Vector databases × Search and indexing. The hybrid keyword + vector pattern from search and indexing is the dominant 2026 architecture. Keyword search and vector search compose; this page covers the vector half, search-indexing covers how they combine.
- Vector databases × Database selection. Vector store is its own database category in database selection. The choice between pgvector and dedicated stores is a database selection question: scale, features, operational fit. Postgres-as-default applies here too.
- Vector databases × Sharding. Vector indexes shard the same way primary stores do, with the same hot-key risks. Sharding covers the analogous tradeoffs; the partition key for vector data is usually tenant or document type.
- Vector databases × Caching. Hot queries (the same question asked many times) can be cached. Embedding the same query twice produces the same vector; caching the embedding step alone removes most of the per-call LLM cost. Caching covers placement.
- Vector databases × Observability. Vector retrieval has unique metrics: recall (often hard to measure on production traffic), embedding latency, ANN search latency, reranker latency, total token cost for LLM calls. Observability covers the broader pattern; vector workloads need specific instrumentation.
For more cross-concept interactions, see the concepts library hub.
10Practice Scenarios
Three scenarios. Read the setup. Decide your approach before opening the reveal.
Scenario 01
A startup wants to build a RAG-based customer support chatbot over their 50,000-document knowledge base. They're torn between Pinecone and pgvector. What do you recommend?
Existing stack: Postgres for application data, no current vector infrastructure. Team is small (three engineers). Budget is tight. Documents are technical articles ranging from 200 to 5000 words.
How to think about this
pgvector. The case is overwhelming for this scenario.
Scale. 50K documents at maybe 10 chunks per document averages 500K vectors. pgvector handles this comfortably on a single Postgres instance.
Operational fit. The team already operates Postgres. Adding pgvector is a single extension install; adding Pinecone is a new vendor relationship, new monitoring, new failure modes, new costs.
Cost. pgvector is free (within the existing Postgres). Pinecone is paid per vector and per query; for a startup, this matters.
Feature fit. The team likely wants to filter by document category, date, language. Postgres handles this naturally via SQL on indexed columns. Pinecone supports filtering but it's typically less flexible than SQL.
Strong answer: "pgvector. The scale fits, the operational fit is excellent, structured filtering via SQL is a real advantage for support content, and the cost is essentially zero. Reach for Pinecone only if you outgrow pgvector or need specific features (multi-vector, sparse-dense hybrid) that pgvector doesn't have. The default-Pinecone narrative is wrong for this stage of company."
Scenario 02
Your RAG system's answer quality has been declining for two months. Users report "answers feel less relevant." What do you investigate?
Architecture: pgvector with HNSW, OpenAI embeddings, GPT-4 for generation. Corpus has grown from 100K to 800K documents over six months. No changes to the embedding model or ANN parameters. Query volume has roughly doubled.
How to think about this
Three suspects, in priority order:
1. Recall regression at higher corpus size. ANN parameters tuned for 100K vectors may produce lower recall at 800K. With the same ef_search, the algorithm is exploring a smaller fraction of the graph. Solution: increase ef_search, accept slightly higher latency. Verify with a held-out evaluation set if you have one.
2. Drift in document distribution. The newer documents may cluster differently from older ones. The HNSW graph was built incrementally; if the build parameters were optimized for the early distribution, they may be suboptimal now. Solution: rebuild the index from scratch, possibly with tuned ef_construction.
3. Chunking issues at scale. If newer documents have different structure (longer, different format), the existing chunking strategy may produce worse chunks for them. Audit some failing queries; look at the chunks being retrieved and assess their quality.
The systemic gap. The team has no way to measure retrieval quality on production traffic. Without an evaluation harness, this kind of regression is invisible until users complain. The deeper fix is to set up offline evaluation: a held-out set of queries with known relevant documents, run periodically against the production index.
Strong answer: "Three things to check: ANN recall at the new corpus size, possible HNSW build-quality drift, and chunking issues at the document mix. Increase ef_search as the immediate fix; rebuild the index if drift is real. The deeper problem is no offline evaluation harness — without it, regressions like this are invisible until users complain."
Scenario 03
A team proposes embedding all of their data into a vector database "to make it AI-ready." Should they?
The data: structured records (orders, customers, products), transactional logs, user activity events. The proposal is to embed each record's serialized form and store everything in Pinecone, "so the LLM can search it semantically."
How to think about this
No. The proposal misunderstands what vector search is good for.
Vector search is for unstructured content. Text articles, images, audio, video. Things where "semantic similarity" is the natural query. Embedding a structured record like an order ("user_id: 4729, total: 49.99, status: shipped") into a vector adds no value: the structure is already explicit in the database. SQL handles this kind of data far better than vector search ever could.
"Make it AI-ready" is a fashionable but vague goal. If the actual need is "let the LLM answer questions about our data," the right pattern is RAG over the relevant unstructured content (knowledge base articles, documentation, support tickets) plus tool-calling for structured queries (the LLM generates SQL, executes it against the primary store, gets back results). Embedding everything indiscriminately is expensive, slow, and produces worse results than the targeted approach.
Cost considerations. Embedding hundreds of millions of records is expensive in compute and storage. For data that doesn't benefit from vector search, this is pure cost without payoff.
Strong answer: "Don't do this. Vector search is for unstructured content where semantic similarity is the right query model. Structured data already has the right query model: SQL. The right architecture is RAG over the unstructured content (docs, articles, support content) combined with LLM tool-calling for structured queries (the LLM writes SQL, runs it, uses the results). Embedding everything indiscriminately costs a lot and produces worse results than the targeted approach."
11Vector Databases FAQ
Pinecone or Weaviate or Qdrant?
Different operational positions. Pinecone is fully managed, easy to start, expensive at scale. Weaviate is open-source with a managed cloud option, supports modular ML integrations (rerankers, embedding modules built in). Qdrant is open-source, written in Rust, often the most performant in benchmarks. For a managed-default with low operational burden, Pinecone. For self-hosted with maximum performance, Qdrant. For a balance with strong feature richness, Weaviate. The choice rarely matters dramatically in interview discussions; the architectural pattern is the same.
What's the difference between embedding dimensions?
Higher dimensions can encode more information but cost more to store and search. Common sizes: 384 (smaller models), 768 (BERT-class), 1024 (larger), 1536 (OpenAI text-embedding-3-small), 3072 (OpenAI text-embedding-3-large). The relationship between dimensions and quality is not linear; a well-trained 768-dim model can outperform a poorly-trained 1536-dim one. Check benchmarks for your domain rather than assuming bigger is better. Many production systems also use truncation or quantization to reduce dimensions post-hoc.
Should I use cosine similarity or Euclidean distance?
For text embeddings, cosine similarity. For images, often Euclidean (L2). The actual right answer depends on how the embedding model was trained: use the distance function the model was optimized for. Most modern text embedding models (OpenAI, Cohere, sentence-transformers) are trained with cosine similarity. If the model documentation says otherwise, follow that. The differences are usually small but consistent.
How do I evaluate vector retrieval quality?
Maintain a held-out evaluation set: a list of (query, expected_relevant_documents) pairs. Run the queries through your retrieval system; measure how many of the expected documents appear in the top-K (this is recall@K). Run this regularly, especially after embedding model changes, ANN parameter changes, or corpus growth. Without an evaluation harness, retrieval regression is invisible until users complain. Tools like RAGAS and Trulens automate parts of this for RAG pipelines.
Do I need to retrain anything when adding documents?
No, but you do need to embed them. Adding documents is a runtime operation: chunk, embed, insert into the vector store. The HNSW or IVF index updates incrementally. The embedding model itself is pre-trained and doesn't change as you add documents. Re-training the embedding model is a separate, much heavier operation that most teams never do (they use a pre-trained model from OpenAI, Cohere, or open-source).
What's the deal with sparse + dense hybrid retrieval?
"Dense" embeddings are the standard vector embeddings discussed throughout this page: every dimension is meaningful, the vector is small (768-3072 dimensions). "Sparse" embeddings are vectors with mostly zero values where each dimension corresponds to a term (like an inverted index, but learned). Sparse embeddings preserve keyword-match behavior; dense embeddings capture semantics. Combining them produces better retrieval than either alone. SPLADE is the canonical sparse model. Some vector stores (Qdrant, Weaviate) support sparse-dense hybrid natively.
How does this work with multimodal content (images, audio, video)?
Same pattern, different embedding model. Multimodal models (CLIP for image-text, ImageBind for many modalities) embed different content types into a shared space. An image and a text caption describing it land near each other. This enables cross-modal retrieval: find images similar to a text query, find videos similar to an image. The vector database doesn't care what the content is; it just stores and searches vectors. The challenge is in the embedding pipeline, not the database.
What about agentic RAG and multi-step retrieval?
Newer pattern that's increasingly important. Instead of single-shot retrieval, the LLM iteratively decides what to retrieve based on partial answers. The agent might search, decide it needs more context, search again with a refined query, then generate the answer. This handles complex queries better than single-shot retrieval but costs more tokens and latency. Tools like LangChain, LlamaIndex, and Haystack provide frameworks for this. Worth knowing the term; the architectural details are usually beyond standard system design interviews.