How long is a system design interview?

Most system design interviews run 45 to 60 minutes. The first 5 to 10 minutes are clarifying requirements, the next 25 to 35 minutes are the actual design and deep-dives, and the last 5 minutes are wrap-up and candidate questions.

How do I prepare for a system design interview in 2026?

Follow a structured approach: learn the fundamentals (caching, sharding, load balancing, replication), master a 4-step framework (clarify, data/API, high-level design, deep dive), practice 8-10 classic problems across four question categories, and do mock interviews with real engineers. Eight weeks of consistent prep is sufficient for most engineers. The 2026 rubric now also requires cost reasoning, operational maturity, and AI-aware design knowledge.

What are the most common system design interview questions in 2026?

Common questions fall into four categories: Classic product designs (Design Twitter, Design YouTube, Design Uber, Design WhatsApp), Infrastructure designs (Rate Limiter, Key-Value Store, URL Shortener, Notification System), AI-adjacent designs (RAG Service, Vector Search, LLM-Powered Recommendation Feed), and Correctness/Operational designs (Payment Processing, Multi-Tenant Analytics, Distributed Job Scheduler).

What do interviewers look for in a system design interview?

Interviewers grade four things: Judgment (clarifying questions, tradeoff defense, committing to choices), Depth (going 3 layers deep on 2-3 components), Operational Maturity (observability, deployment strategy, cost reasoning), and Communication (handling pushback, checking in, treating the interview as a conversation). The weight of each shifts by level: senior loops weight judgment and depth highest; staff and principal shift toward operational maturity and communication.

Do I need distributed-systems experience to pass a system design interview?

No. What you need is conceptual fluency: understanding the patterns, knowing the tradeoffs, and being able to reason through scale and failure modes. Many engineers who have distributed-systems experience still fail because they haven't practiced the interview format specifically. The interview tests your ability to think systematically under time pressure, not your production experience.

What is the best system design interview course or resource?

Start with a free guide like this one for the framework and concepts. For structured practice with worked solutions, Grokking the System Design Interview by DesignGurus.io is a widely recommended course that covers 25+ real interview problems. For deeper reading, Designing Data-Intensive Applications by Martin Kleppmann and System Design Interview by Alex Xu are excellent supplements. The highest-leverage activity, regardless of resources, is mock interviews with engineers who currently interview at your target level.

How long does it take to prepare for a system design interview?

Eight weeks is a reasonable timeline for engineers without prior system design interview experience, dedicating about 1-2 hours per day. Two weeks is enough for engineers who have done system design interviews before and just need a refresher. If you only have a weekend, focus on the framework, the 2026 rubric changes, and one or two practice problems. The single highest-leverage activity in any timeframe is mock interviews.

Are AI and ML system design questions on the interview rubric in 2026?

Yes, increasingly. AI-adjacent questions (vector search, RAG services, LLM-serving infrastructure, recommendation systems with generative components) have moved from niche to mainstream. Even when the question is not AI-specific, surfacing awareness of the AI layer (embeddings, vector databases, semantic caching) is now treated as a seniority signal at many FAANG-tier companies.

LLM & AI Infrastructure: A System Design Concept Deep-Dive

00Quick Orientation

LLM infrastructure looks like an API client problem and isn't. Calling a hosted model endpoint is the easy part; the hard parts are caching aggressively without serving stale answers, managing prompt templates as versioned artifacts, retrieving the right context to ground responses, budgeting tokens per tenant, and handling provider failures without breaking the experience. Strong system designs treat LLM calls as expensive, slow, and constrained external dependencies, then build the surrounding infrastructure that compensates for those properties.

Three properties of LLM calls dominate the infrastructure decisions. First, latency is high: a typical chat completion takes 1-5 seconds for a meaningful response, and longer generations can take 10-30 seconds. This is one to two orders of magnitude slower than a database read, which means LLM calls cannot be on the synchronous critical path of latency-sensitive flows without aggressive engineering. Second, cost is non-trivial: at typical scale, LLM spend is a meaningful line item, often the largest variable cost in an AI feature. Cost scales with tokens consumed, which scales with both request volume and prompt size, so prompt engineering has direct margin impact. Third, reliability is provider-dependent and uneven: hosted providers have outages, rate limits, and occasional regressions, so production systems layer fallback rather than trusting any single provider's SLA.

Why this concept earns its own page

LLM infrastructure composes with familiar primitives (caching, rate limiting, message queues, observability) but introduces requirements those primitives weren't designed for: cache hits judged by semantic similarity rather than key equality, rate limits expressed in tokens rather than requests, observability that tracks generation quality alongside latency, and queues that handle multi-second to multi-minute work. The combination is distinctive enough to warrant a dedicated concept page, even though the underlying primitives are not new.

01Why LLM Infrastructure Is Its Own Category

Before 2024, system design interviews treated AI as an out-of-scope advanced topic, reasonable to mention but not required. By 2026, the expectation has flipped. Even questions that are not nominally about AI ("design a search system," "design a customer support tool," "design a content moderation pipeline") increasingly expect the candidate to surface relevant LLM considerations. Surfacing them well demonstrates current-era awareness; ignoring them entirely signals stale knowledge.

The shift has three drivers. The first is product surface area: features that used to require dedicated ML teams (semantic search, content classification, summarization, translation, generation) are now reachable with a few API calls to a hosted model. Many product organizations now ship LLM features as routine work rather than research projects, which means the infrastructure to serve those features is everyday engineering. The second driver is cost pressure: at scale, naive LLM usage produces alarming bills. The infrastructure decisions that control cost (caching, prompt design, model selection, request batching) become primary engineering work rather than optimizations. The third driver is reliability: as products bet on LLM features for revenue-critical flows, the question of "what happens when the provider is down" becomes a board-level concern, and the infrastructure to handle provider failures becomes load-bearing.

The interview implication: LLM infrastructure shows up as a sub-topic in patterns that previously had no AI dimension. A few examples worth keeping in mind:

Search. Semantic search uses embeddings to retrieve documents by meaning rather than keyword overlap. The query understanding step often calls an LLM. The ranking step may use an LLM as a re-ranker. None of this existed in keyword-search system designs five years ago; it is standard now.
Content moderation. Classification of user content (spam, abuse, policy violations) increasingly uses LLM-based classifiers alongside or instead of traditional ML models. The infrastructure to call these classifiers at content-creation rates is non-trivial.
Customer support. Auto-response generation, ticket triage, knowledge base retrieval, and agent-assist features all rely on LLM calls. Production support tools include the LLM infrastructure as a co-equal architectural component.
Recommendations and personalization. Embedding-based recommendation systems (where user preferences are represented as embeddings, candidate items are pre-embedded, and matching is similarity search) are increasingly common alongside collaborative-filtering classics.
Code and document workflows. Autocomplete, summarization, generation, and editing features in productivity tools are LLM-backed by default in 2026.

The senior-level expectation is not deep ML knowledge. It is the ability to recognize when an LLM is the right tool, to surface the infrastructure implications (latency, cost, reliability, observability), and to design the surrounding system to compensate for the LLM's weaknesses without depending on its perfection.

02LLM Gateway: The Provider Abstraction Layer

The LLM gateway is the consolidation point for all LLM calls across the application. It sits between application services and the actual model providers (OpenAI, Anthropic, Google, self-hosted models on inference platforms). Treating it as a first-class component, similar to an API gateway in front of microservices, is what turns LLM usage from a collection of ad-hoc API calls into managed infrastructure.

LLM Gateway Architecture

Application services call into a unified LLM gateway rather than each owning their own provider integration. The gateway runs four sequential stages: semantic cache lookup, prompt template resolution, provider and model selection, and the actual call with retry and fallback logic. Side stores supply cache state, versioned prompts, usage metering, and quality logs. The gateway routes to multiple providers with a policy that prefers primary, falls back to secondary on failure, and uses self-hosted models as a last resort for cost or availability reasons.

What the gateway centralizes

The gateway pattern centralizes concerns that would otherwise duplicate across every application service that calls an LLM:

Provider abstraction. Application services don't know whether the call went to OpenAI, Anthropic, or a self-hosted endpoint. The gateway hides provider-specific request and response formats behind a unified interface, which lets you change providers without touching application code.
Semantic caching. Most LLM responses can be reused for similar (not just identical) queries. The gateway is the natural place to lookup the semantic cache before incurring an LLM call.
Prompt template management. Prompts are versioned artifacts. The gateway resolves template references, fills in variables, and tracks which version was used for which response.
Token budgeting and rate limiting. Per-tenant token quotas and per-model rate limits live at the gateway. Without this, runaway costs and provider rate-limit collisions become application bugs that are hard to diagnose.
Observability. Every LLM call generates a record: which template, which model, how many tokens, how long, what the response was, whether the cache hit. The gateway is the consolidation point for this data, which feeds quality monitoring and cost analysis.
Fallback handling. When the primary provider fails, the gateway retries against alternates. Application services see a unified outcome rather than provider-specific failures.

The pattern is closely analogous to an API gateway in front of microservices, with one important difference: an API gateway routes based on URL paths and forwards mostly opaque payloads, while an LLM gateway transforms requests substantially (prompt assembly, model parameters, response parsing) and looks deeply into both request and response. The gateway is more involved in the call than a traditional API gateway, which is why "gateway" is sometimes a slight understatement of what it does.

Build vs buy

Several open-source and managed LLM gateway options exist (LiteLLM, Portkey, Helicone, OpenRouter, Anthropic's Agent SDK abstractions, and others). For early-stage products, these are the right starting point. They handle provider abstraction, basic caching, retry, and observability out of the box. As the product scales and the LLM features become revenue-critical, most teams build in-house gateways for the same reasons they build in-house API gateways: tighter integration with proprietary auth, custom rate-limit policies, deep integration with billing systems, and the ability to evolve the abstraction as new providers and models emerge.

03Semantic Caching: Hit Rates and Quality

Naive caching for LLM responses uses an exact string match on the prompt as the cache key. This catches the case where the same prompt is sent twice but misses the much more common case where two prompts differ slightly (whitespace, punctuation, paraphrase) but should produce the same answer. Semantic caching solves this by keying the cache on the embedding of the prompt rather than the prompt text, then matching by similarity.

The architectural pattern is multi-tier: try cheap caches first, fall through to the expensive LLM call only when nothing matches. Each tier has different hit rates, different latency, and different quality risks.

Multi-Tier Semantic Cache

A request first checks Tier 1 (exact prompt hash); if missed, it computes the prompt embedding and checks Tier 2 (semantic similarity above threshold); if missed, it makes the actual LLM call. The response from Tier 3 populates both upper tiers for future requests. Hit rates depend heavily on use case but typically split as 10-20% / 20-40% / 40-70% across the tiers. The Tier 2 similarity threshold is the key tunable: too tight and the cache is rarely useful; too loose and stale or wrong answers leak through.

Tuning the similarity threshold

Tier 2 introduces a tradeoff that exact-match caching does not have: the threshold for what counts as "similar enough." Set it too tight (cosine similarity > 0.99) and Tier 2 rarely hits; the cache is mostly useless. Set it too loose (similarity > 0.85) and prompts that should produce different answers get the same cached response. The right threshold is workload-specific.

Some workloads tolerate loose thresholds. Customer support tools, where many users ask roughly the same questions, can use thresholds around 0.92-0.95 with good results. Some workloads require tight thresholds. Code generation, where small differences in the prompt can produce dramatically different correct answers, often runs around 0.98-0.99 if it uses semantic caching at all. Some workloads should skip Tier 2 entirely. Highly personalized responses (where the answer depends on user-specific context) can't share cache entries across users without violating privacy or correctness; Tier 1 with per-user keys is fine, Tier 2 across users is not.

Production systems often run multiple semantic cache namespaces, each with its own threshold tuned to its workload. Quality monitoring tracks cache-hit responses against fresh-LLM responses for a sampled fraction of requests, so threshold drift gets caught before it harms users.

04Prompt and Template Management

Prompts are not strings; they are versioned artifacts. Production LLM systems treat prompts the way they treat database schema: changes go through review, versions are tagged, deployments are tracked, and rollback is supported. Without this discipline, prompts evolve through ad-hoc edits in application code, which makes it impossible to attribute quality regressions to specific changes.

The architectural pattern: a prompt store separate from application code. Each prompt has a stable identifier, a current production version, optionally one or more A/B testing variants, and a history of previous versions. Application services request prompts by identifier; the store returns the appropriate version based on routing rules.

What the prompt store holds

Template body. The prompt text with variable placeholders for dynamic content (user query, retrieved context, examples).
Model parameters. Default temperature, max tokens, top-p, stop sequences. Can be overridden per call but the template carries reasonable defaults.
Variable schema. What variables the template requires, their types, validation rules. Catches "you forgot to pass user_query" at template resolution time rather than producing nonsense from the LLM.
Version metadata. Author, timestamp, change description, evaluation scores against a regression suite, deployment history.
Routing rules. Which version is in production, what fraction goes to which variant for A/B tests, which version each tenant pins to (when relevant).

Why this matters for system design

Two properties of LLM systems make prompt management more important than it sounds. First, small prompt changes can produce large quality changes. A reordered example, a different system message, a new instruction, can shift response quality measurably. Without versioning, regressions look like provider issues. Second, prompts often grow over time. A prompt that started as a 50-token instruction becomes a 1500-token system message over months as edge cases are addressed. This drift has cost implications (tokens are billed) that are invisible without prompt-version tracking.

The interview implication: when LLM features are in scope, name prompt management as a first-class component. The interviewer is looking for awareness that prompts are not static strings; they are configuration that changes over time and needs the same discipline as other configuration.

05Retrieval-Augmented Generation Flows

LLMs trained on a fixed dataset don't know about your specific data: your users, your content, your documentation, your inventory. Retrieval-Augmented Generation (RAG) is the architectural pattern for grounding LLM responses in your data without retraining the model. The flow is: take the user's query, retrieve relevant context from your data, augment the prompt with that context, ask the LLM to generate a response constrained by the retrieved material.

RAG is no longer a research pattern; it is the default architectural answer for any product feature that needs LLMs to know proprietary information. Customer support tools retrieve from the support knowledge base. Internal company assistants retrieve from internal docs. E-commerce shopping assistants retrieve from the product catalog. The pattern is the same; only the corpus differs.

The pipeline

Five stages, each with its own engineering concerns:

Embed the query. Convert the user's query into a vector using an embedding model. This is itself a model call (cheaper than a full LLM call but not free) and adds latency on the order of 50-100ms.
Retrieve candidates. Search a vector database for items similar to the query embedding. Returns the top-k most similar candidates, typically k=5 to k=50. The vector databases concept page covers the index types and recall-latency tradeoffs.
Re-rank. Optionally re-rank the candidates using a more expensive scoring method (cross-encoder, LLM-as-judge, business rules). This often improves quality enough to justify the latency.
Assemble the prompt. Combine the user's query, the retrieved context, and the prompt template into the final input for the generation call. Context window management matters here: too much retrieved content blows the budget and can degrade response quality.
Generate. The actual LLM call that produces the user-visible response. Streamed back to the client to reduce perceived latency.

Each stage has its own caching, observability, and failure-handling needs. A common architectural mistake is treating RAG as a single black-box call, which conflates the failure modes and obscures cost. Each stage should be its own service or component with its own metrics.

Common RAG pitfalls worth naming

Chunking strategy matters. The corpus has to be split into retrievable chunks before embedding. Chunks too small lose context; chunks too large dilute relevance. Production systems often run multiple chunking strategies in parallel (semantic chunks, fixed-size chunks, hierarchical chunks) and merge results.
Embedding model choice locks you in. Once a corpus is embedded with a particular model, switching models requires re-embedding everything. Production systems often run with two embedding models in parallel during transitions, similar to how database migrations work.
Retrieval quality is the ceiling on response quality. The LLM can only ground its response in what was retrieved. If the right document isn't in the top-k, no amount of prompt engineering recovers from it. Investing in retrieval quality (better embeddings, hybrid search combining keyword and vector, re-ranking) often beats investing in prompt engineering.
Context window pollution. Stuffing too much retrieved context into the prompt degrades response quality. The LLM gets distracted by irrelevant material. Tighter top-k with better re-ranking generally beats looser top-k with more context.
Stale corpora. The vector database is a derived view of source content. When source content changes, the embeddings must update, or RAG returns outdated information. Production systems pipeline corpus updates through embedding regeneration on change.

For a full RAG-pattern walkthrough including the swimlane diagram of the request lifecycle, see the AI-augmented apps walkthrough.

06Embedding Pipelines

Embeddings are the connective tissue of modern AI infrastructure. They power semantic search, RAG retrieval, recommendation similarity, semantic caching, content clustering, anomaly detection, and many other features. Production systems generate embeddings at two points: when documents enter the corpus (batch-embed everything) and when queries arrive (embed the query for retrieval).

Batch embedding for corpus content

The flow: when content is added or updated in the source system (a new support article, a new product, a new document), an event flows to an embedding pipeline. The pipeline embeds the content, writes the embedding to the vector database, and acknowledges back to the source. The pipeline handles failures, retries, and re-embeds when the embedding model changes.

Architecturally, this is an event-driven workflow on top of message queues. Source systems emit "content_changed" events. The embedding pipeline consumes them, calls the embedding model, writes to the vector store. Throughput typically isn't the bottleneck; the embedding model's rate limits and the vector database's write capacity are. Batch embedding jobs (bulk re-embed when changing models) can use higher-parallelism workers to push through faster.

Online embedding for queries

Query embedding is on the synchronous critical path of search and RAG. The latency budget is tight: typical embedding model calls take 30-80ms, which is significant when the total response budget is a few seconds. Production systems often co-locate the embedding model with the application (running on local GPUs or fast inference services) to eliminate round-trip overhead. Some systems also batch query embeddings across concurrent users to amortize the cost, accepting a small latency increase in exchange for higher throughput.

Embedding cache

For commonly-asked queries, the query embedding can be cached. This avoids re-embedding the same text repeatedly. The cache is keyed on the query string (or its hash) and stores the embedding vector. Hit rates depend on query distribution; head-heavy distributions (a few popular queries dominate) see high hit rates and meaningful latency wins.

Cost asymmetry

Embedding model calls are dramatically cheaper than generation calls. A typical embedding costs a small fraction of a cent per call; a full LLM generation might cost orders of magnitude more. This asymmetry shapes architectural decisions: it's almost always worth embedding more aggressively (more candidates, more re-ranking) if it improves the quality of the eventual generation. The generation is where the cost concentrates; embeddings are cheap insurance.

07Token Budgets and Rate Limiting

Traditional rate limiting counts requests per second. LLM rate limiting must count tokens per minute or hour. The reasons: token counts are what providers actually bill on, generation latency scales with output token count, and the natural unit of "expensive call" is "produces lots of tokens" not "lots of requests."

The architecture: per-tenant token budgets enforced at the gateway. Each request consumes tokens proportional to its prompt size plus its generated response. The gateway maintains running counters per tenant, per model, and per time window, rejecting requests that would exceed quotas. This is a specialization of the standard rate limiting pattern with token counts as the unit instead of request counts.

What needs token budgets

Per-tenant token quotas. Tier-based limits on total tokens per month or hour. Free tier might allow 10K tokens per day, paid tiers 100K to 10M. Crossing the limit returns 429 with a header indicating quota and reset time.
Per-model rate limits. Provider-imposed rate limits on tokens per minute per model. The gateway enforces these to avoid hitting provider 429s, which would cascade to user-visible failures.
Per-request token caps. Maximum tokens any single request can produce. Prevents pathological prompts from generating runaway responses.
Per-conversation budgets. For multi-turn flows, total tokens across the conversation. Stops conversations from accumulating runaway context.

Why this is harder than request-based rate limiting

Two properties make token-based rate limiting more nuanced than request-based. First, you don't know the cost until after the call completes. Generation token counts depend on what the model produces, which you can't fully predict. Production systems use estimated cost (prompt tokens + max_tokens parameter) for the pre-call check and reconcile actual cost after the response. Over-budget reconciliations either bill the overage or pause future calls until the budget recovers. Second, fairness across requests is more complex. A small prompt requesting a long response and a large prompt requesting a short response have different ratios of input to output tokens, which providers price differently. Per-tenant budgets can be expressed in either dollars or tokens; dollars are more accurate but require pricing tables that change with provider updates.

Cost monitoring as a primary product feature

For products where customers see their own usage (B2B SaaS with API access, anything with usage-based billing), token budgets feed directly into customer-visible dashboards. The accuracy of these dashboards depends on the gateway logging every token correctly; lost or duplicated logging shows up as billing disputes. Production systems treat token logging with the same discipline as financial transaction logging: durable, auditable, reconcilable against the provider's own usage reports.

08Latency Strategies and Fallback Handling

LLM calls dominate p99 latency in any flow that includes them. A typical generation runs 1-5 seconds; longer outputs run 10-30 seconds; some agentic flows run minutes. Compared to a database read at 1-5 milliseconds, this is a different latency regime that requires different architectural tools.

Streaming as the default user-facing pattern

The single most impactful UX move: stream the response token by token to the client rather than waiting for the full generation to complete. The user sees the first words within hundreds of milliseconds even when the full response takes seconds. Perceived latency drops dramatically; actual end-to-end time is the same or marginally worse.

Streaming requires plumbing that traditional request-response APIs don't have. The gateway holds the connection open between client and provider, forwarding tokens as they arrive. WebSocket or Server-Sent Events on the client side; streaming HTTP from the provider. This adds connection-management complexity (similar to chat messaging) but the UX win justifies it for any user-facing LLM feature.

Async patterns for long-running work

For generations longer than a few seconds (large documents, complex agentic workflows, batch processing), synchronous request-response breaks down. The standard pattern: accept the request, return a job ID immediately, process asynchronously, deliver the result via webhook or polling. This mirrors the long-running operation pattern from API platforms. The architectural payoff: client connections are short, retries are clean, work survives client disconnects.

Fallback hierarchy

Production systems plan for provider failures rather than hoping they don't happen. The fallback hierarchy typically has three or four levels:

Primary provider, primary model. The default choice. Best quality, current pricing, expected SLA.
Primary provider, fallback model. Same provider but a smaller, faster, cheaper model. Used when the primary model is rate-limited or experiencing latency spikes.
Secondary provider. A different provider's equivalent model. Used when the primary provider has an outage. Requires compatible API formats (the gateway abstracts these) and acceptable quality across providers.
Self-hosted fallback. A locally-hosted model on owned infrastructure. Quality is typically lower than hosted frontier models but availability is independent of any external provider. Used as the bottom layer when external providers are all unavailable.
Cached fallback. If all generation paths fail, serve a cached or pre-computed response with a degraded-mode notice. Better than a hard error for many features (recommendation, summary, suggestion features can degrade gracefully).

The retry budget

Naive retry on LLM failures can amplify problems. If a provider is rate-limiting, retrying immediately makes it worse. If a generation is slow because the prompt is pathological, retrying produces the same slow result. Production retry policies use exponential backoff, cap retries at 2-3 attempts, and switch to fallback providers rather than retrying the same provider repeatedly.

For user-facing flows with tight latency budgets, retries may not be possible at all. If the original call took 4 seconds and failed, a retry would push total latency to 8+ seconds, past most user tolerance. Better to fall back to a cached or pre-computed response than to retry. Production systems often have a "fail fast and degrade" policy for user-facing flows and a more aggressive retry policy for background work where latency is less critical.

Reliability through layering, not provider SLAs

The reliability of an LLM-backed feature is the reliability of its weakest layer. If you depend on a single provider with 99.9% uptime, your feature has 99.9% uptime at best, plus whatever your own infrastructure subtracts. Layered fallback (multiple providers, self-hosted, cached) is how production systems achieve effectively higher reliability than any single provider offers. The interview move: name layered fallback as the architectural answer to provider reliability concerns. Single-provider designs are visibly fragile to anyone who's operated LLM features at scale.

09How This Concept Composes with Patterns

LLM infrastructure shows up across many of the canonical system design patterns, often as a sub-component rather than the headline feature. Recognizing where it composes is part of the senior-level expectation.

AI-augmented apps. The pattern walkthrough where LLM infrastructure is the headline. Multi-tier caching, RAG flow, token rate limiting, fallback handling are all explicit architectural components. The walkthrough composes this concept directly.
Search. Modern search systems use embeddings for semantic retrieval and increasingly use LLMs for query understanding (rephrasing ambiguous queries) and re-ranking (LLM-as-judge for top results). The LLM gateway and embedding pipelines compose with the search pattern's inverted index primitive.
Social feed. Modern feed ranking uses embedding-based candidate generation alongside collaborative filtering. Content moderation increasingly relies on LLM classifiers. Both compose this concept's gateway and embedding infrastructure.
E-commerce. Product search, recommendation, and shopping-assistant features all use LLMs in 2026 systems. The catalog itself often has embedded representations of products for similarity-based recommendation.
Notification systems. Subject line and body generation, content personalization, and engagement prediction increasingly use LLM calls. The notification pipeline composes with this concept at the content-generation stage.
Collaborative editing. Autocomplete, summarization, and inline suggestion features in productivity tools are LLM-backed. The collaborative editing pipeline composes with this concept at the suggestion-generation layer.
Vector databases. The storage primitive used by RAG and semantic caching. Vector databases and LLM infrastructure compose tightly: the gateway calls the embedding model, queries the vector database, then calls the generation model.
Caching. Semantic caching is a specialization of the caching pattern with similarity-based key matching instead of exact equality. The cache invalidation, sizing, and hit-rate analysis from the caching concept all apply.
Rate limiting. Token-based rate limiting is a specialization of request-based rate limiting with tokens as the unit. The multi-dimensional rate limit pattern (per-tenant, per-model, per-endpoint) carries directly over.
Observability. LLM features add quality observability alongside operational observability. Beyond latency and error rate, production systems track response quality (via sampled human review, automated eval against test sets, or LLM-as-judge), prompt drift over time, and cache-hit response quality vs fresh-generation response quality.
Message queues. Embedding pipelines, batch generation jobs, and async LLM workflows all use queues for durability and decoupling. The webhook and event-source patterns from the queues concept apply directly.

10Common Pitfalls and Interview Moves

The patterns that distinguish strong from weak treatment of LLM infrastructure in system design interviews.

Pitfalls to avoid

Treating LLM calls as equivalent to API calls. They are 100-1000x slower and cost more by similar margins. Designing as if they're free fast operations produces unworkable systems. Always surface the latency and cost asymmetry early.
Single-provider dependency. Hosted providers fail. Designs that have no fallback path are visibly fragile. Layered fallback (multiple providers, self-hosted, cached) is the production answer.
Skipping caching. Without semantic caching, costs scale linearly with traffic. Production systems need 30-60% combined cache hit rates to be economically viable at scale. Mention caching as a primary architectural component, not an optimization.
Conflating retrieval quality with generation quality. RAG systems are bottlenecked by retrieval. The LLM can only ground in what was retrieved. Investing in better retrieval often beats investing in prompt engineering.
Ignoring observability. LLM features regress invisibly without quality monitoring. Latency and error rate aren't enough; sampled response review, eval suites, and prompt-version tracking are required.
Treating prompts as code constants. Prompts are versioned configuration that change frequently. Without prompt management infrastructure, regressions look like provider issues and root cause analysis becomes guesswork.
Synchronous-only design for long generations. Generations longer than a few seconds need async patterns. Hard-coding synchronous flows breaks for batch work, complex agents, and long-form generation.

Interview moves that signal current-era awareness

Surface LLM considerations even on non-AI questions. Many questions in 2026 have an AI dimension whether or not the question states it. "Design a customer support tool" benefits from naming RAG and LLM infrastructure. "Design a content moderation system" benefits from naming LLM classifiers alongside traditional ML. Surfacing this signals current awareness; ignoring it signals stale knowledge.
Name the cost dimension explicitly. "An LLM call here costs about a cent; at our scale that's $X per day" is a specific, current-era observation. Generic "we'll cache responses" is weaker; specific cost framing is stronger.
Treat the LLM as an unreliable, slow, expensive dependency. The framing that distinguishes strong designs from weak ones is recognizing that the LLM is the constrained external dependency around which everything else is engineered. Caching, fallback, async patterns, streaming all flow from this framing.
Reference specific patterns from this concept page. "I'd put an LLM gateway in front of our calls so we can manage caching, prompt versioning, and provider routing centrally" is a specific architectural recommendation. Vague "we'll use AI" answers signal surface-level awareness.
Acknowledge what you don't know. LLM infrastructure is evolving fast. The right answer in 2026 may be different from the right answer in 2025 or 2027. Strong candidates name the current production patterns and acknowledge the rate of change rather than claiming permanent expertise.

The infrastructure for serving LLMs is not new infrastructure invented from scratch. It is familiar primitives (gateways, caches, queues, rate limits) composed in ways that compensate for the LLM's specific weaknesses. Strong designs make the composition explicit; weak designs treat LLM calls as if they had the latency, cost, and reliability of database reads. The senior-level move is recognizing the LLM as a particular kind of expensive constrained dependency and engineering the surroundings accordingly.

Continue

Back to the Concept Library →

The other concept deep-dives. Or see the AI-augmented apps walkthrough for how this concept composes into a full pattern, or the vector databases concept for the storage primitive that RAG depends on.