Caching for System Design Interviews: Strategies, Failure Modes, and Tradeoffs

01Why Caching Is the Highest-Leverage Concept

Caching shows up in nearly every system design interview. For read-heavy products it's usually the first deep-dive area. For write-heavy products it's usually the second. Even infrastructure questions like "design a key-value store" almost always include caching layers in the discussion.

The reason is simple: caching is the single highest-leverage optimization in distributed systems. A well-placed cache turns a 50ms database query into a 1ms in-memory lookup, which means you serve 50x more requests per server. The math is so favorable that nearly every production system at scale relies on caching as a first-class architectural element.

Which is also why caching is where most candidates get exposed. The naive answer ("add Redis in front of the database") is the right starting point and the wrong stopping point. The interesting questions are not "do you cache" but "what gets cached, where, with what invalidation strategy, and what happens when the cache fails." That is what this page covers.

What This Page Covers

This is a deep-dive page. It assumes you've read the main guide's concept overview and want depth. If you're just starting to learn the concept families, the concept library hub is a better entry point. This page is organized around what interviewers probe on, not around the textbook taxonomy of caching.

02Where Do Caches Sit?

The first decision in any caching discussion is placement. Different cache layers solve different problems. A senior candidate names the layer they're putting a cache at and explains why. A junior candidate says "we add a cache" without specifying where.

There are four common cache placements. You'll see all four in production systems, often together.

Where Caches Sit in a Typical Architecture

The four common cache placements. Most production systems use all four together. The rightward direction is from the user toward the data, with each cache catching what the previous one missed.

1. CDN (edge cache)

The cache sits at the network edge, geographically close to the user. CDN caches handle static assets (images, JavaScript, CSS), increasingly handle full HTML pages for static or semi-static content, and in 2026 are starting to handle some API responses through edge compute.

What lives here: anything that's the same for many users and changes infrequently. Marketing pages, product images, JavaScript bundles, CSS. Increasingly: HTML responses with personalization injected via edge functions.

What does not live here: anything per-user (user feeds, session data, account information) and anything that changes frequently.

2. Application-local (in-process) cache

The cache sits inside the application server's memory. Hits are nanoseconds because there's no network call. Common implementations: a simple in-memory dictionary, Caffeine in Java, or a process-local LRU.

What lives here: small reference data (configuration, feature flags, lookup tables), request-scoped memoization, and rarely-changing data that's expensive to compute.

What does not live here: anything that needs to be consistent across servers, anything large enough to blow up the heap, and anything user-scoped if you have many servers (each server would need its own copy of every user's data, which doesn't scale).

3. Distributed cache

The cache sits as its own service, shared across application servers. Redis and Memcached are the canonical examples. This is what most candidates think of when they hear "cache" in an interview context.

What lives here: session data, rendered fragments, computed views (like Twitter timelines), API response cache, anything that needs to be consistent across the application tier and is too big or per-user for in-process caches.

What does not live here: anything that needs network-zero latency (use in-process for that), and anything you can't tolerate losing if the cache cluster fails.

4. Database tier (buffer pool)

Every modern database has internal caching. Postgres has its buffer pool. MySQL has the InnoDB buffer pool. Cassandra has its row cache and key cache. This is caching you mostly don't have to think about, but you should know it exists.

What lives here: recently-accessed data pages and indexes. The cache is managed by the database engine, sized by configuration. The candidate signal here is knowing when to push memory toward the buffer pool versus toward an external cache layer. The right answer depends on access patterns: highly-skewed reads benefit from a dedicated cache; uniform-random reads benefit from a larger buffer pool.

The interview move

When an interviewer asks "where would you cache this?" the strong answer names a layer and explains why. "I'd put it in the distributed cache because the data is per-user and we have many app servers; the in-process cache wouldn't share state across servers and the CDN can't see per-user data." Three sentences. One choice. Defended.

03What Gets Cached, and What Shouldn't

Not every piece of data benefits from caching. The two questions to ask: is it expensive to recompute, and is it accessed often relative to how often it changes. If both yes, cache it. If both no, don't.

Three categories worth caching almost always:

Read-heavy, slowly-changing data. User profiles, product catalogs, configuration. Read 1000x more than written. Cache lifetimes can be long. This is the canonical caching win.
Computed views. Pre-assembled timelines, recommendation rankings, aggregate counters. The computation is expensive; caching the result is high-leverage.
External API responses. Calls to third-party services that have rate limits or cost per call. Cache the response for some sensible TTL even if it's just a few seconds.

Three categories where caching is dangerous:

Frequently-changing data. If a value changes every read, the cache hit rate is zero and you've added complexity for no benefit. Worse, you've added a stale-data risk.
Strongly-consistent data. If the application logic depends on always seeing the latest value (financial balances, inventory counts during checkout), caching introduces correctness risk that needs careful handling.
Per-request unique data. If the cache key is unique per request, you'll never get a hit. Don't cache it.

The Cache-Hit-Rate Test

Before adding a cache, ask: what's the expected hit rate? If you can't construct an argument that it'll be above 50% for the cache to be useful (and ideally 80%+), the cache isn't helping. Strong candidates name an expected hit rate when proposing a cache. Weak candidates add caches without thinking about the rate.

04Cache Patterns: The Four Canonical Approaches

How does data get into the cache, and how does it get back out? There are four canonical patterns. Each has specific use cases. Knowing when to use each is one of the standard depth probes.

Cache-aside

Most common · Use when in doubt

Application checks the cache first. On cache miss, application reads from the database, then writes the result back to the cache. On writes, the application writes to the database and either invalidates the cache or updates it directly.

The cache and database are independent. The application is responsible for keeping them coordinated.

Simple, flexible, default choice. Risk: temporary inconsistency between database and cache after a write.

Read-through

When you want the cache to abstract the database

Application reads from the cache. The cache itself, on miss, fetches from the database, caches the result, and returns it. The application never talks to the database directly for reads.

The cache is the read-side interface to the data layer.

Cleaner application code; cache handles loading logic. Risk: cache becomes critical-path infrastructure; if it fails, reads fail.

Write-through

When you need cache and DB consistency on every write

Every write goes to the cache and to the database synchronously. Both are updated before the write is acknowledged. Reads always come from the cache, which is guaranteed to have current data.

The cache and the database are kept in lockstep.

Strong consistency between cache and DB. Risk: writes are slower because they wait for both stores; cache failure can block writes.

Write-behind (write-back)

When write throughput matters more than durability

Writes go to the cache only. The cache flushes to the database asynchronously, in batches. Reads come from the cache. The database is the eventual store.

The cache is the primary store; the database is the durable backup.

Fastest writes, batched DB load. Risk: cache failure means writes can be lost. Use only when this loss is acceptable or when the cache itself is durable.

Refresh-ahead (a less common but useful fifth)

Some caching implementations support refresh-ahead: when a cached value is approaching its TTL, the cache proactively refreshes it before expiration so that subsequent reads always hit fresh data. This is a useful technique for highly-accessed data where you want to avoid the latency hit of an occasional cache miss.

It's worth knowing the name. Most candidates won't bring it up. Volunteering it in the right context is a depth signal.

The interview move

When asked which pattern you'd use, the strong answer names cache-aside as the default and explains the choice. "I'd use cache-aside because it's the most flexible and we don't need the strict consistency guarantees of write-through. The application owns the cache update logic, which gives us more control if we need different invalidation strategies for different data types."

05Eviction Policies: What Gets Removed and When

Caches are bounded. When the cache fills up, something has to leave. The eviction policy decides what.

Four policies cover almost all production cases.

Policy	How it works	When to use
LRU	Evict the least-recently-used item.	The default. Works well for general workloads with temporal locality. Most cache systems default to LRU or an LRU approximation.
LFU	Evict the least-frequently-used item.	When some items are perpetually hot. LRU would evict a hot item if it hasn't been accessed in the last N seconds, even though it's accessed millions of times daily. LFU keeps it.
FIFO	Evict the oldest item.	Rare in practice. Used when access patterns are uniform and the staleness of a cached item is the main concern.
TTL	Evict items past their time-to-live regardless of access pattern.	Often combined with one of the above. TTLs are the primary mechanism for bounding staleness; LRU and LFU bound size.

Production caches usually combine policies. Redis, for instance, supports configurations like "LRU with TTL," meaning items are evicted on TTL expiration first, and on size pressure, the least-recently-used non-expired items are evicted.

The depth probe: when does LRU fail?

A common interviewer follow-up: "you've chosen LRU, but what if your workload has a one-time scan that touches every item once? What happens?" The answer: LRU evicts the entire hot working set in favor of one-time-touched items. This is called cache pollution. The fix is either a more sophisticated policy like LRU-K (which only counts an item as recently-used after K accesses) or admission control (don't add items to the cache unless they've been accessed multiple times).

Knowing this failure mode signals that you understand LRU isn't a silver bullet, which is what the probe tests.

06The Cache-Miss Path

Designing for cache hits is easy. Designing for cache misses is where systems break. The miss path needs to be handled deliberately, because under load it's what determines whether the system stays up or melts down.

What happens on a miss?

On a cache miss, the application falls back to the source of truth (usually the database), reads the value, returns it to the user, and writes it back to the cache for next time. This is straightforward in steady state. The complications appear under load.

The thundering herd problem

Imagine a cache key that's accessed 10,000 times per second. The cache entry expires. Suddenly all 10,000 requests in that second miss the cache. They all fall through to the database simultaneously. The database, expecting near-zero queries on this key (because the cache was handling them), now gets a 10,000 QPS spike. If the database can't handle that spike, the database crashes, and now no requests on this key can be served from anywhere.

This is the thundering herd, and it's one of the most-tested failure modes in caching deep dives.

Three solutions:

Single-flight (request coalescing). The first request that misses takes a lock and fetches from the database. Subsequent requests for the same key wait for the in-flight fetch instead of issuing their own. Once the fetch completes, all waiting requests get the answer. Most modern caching libraries support this.
Probabilistic early expiration. Instead of expiring exactly at TTL, items have a small chance of being treated as expired before the TTL. The first request to "see" the early expiration refreshes the cache. Most other requests still get the cached value. Smooths out the herd.
Stale-while-revalidate. When an item is expired, serve the stale value to the user while triggering a background refresh. The user never sees a miss latency. This pattern is increasingly the default for read-heavy systems where slightly-stale data is acceptable.

Cache stampedes (the related but distinct cousin)

A cache stampede is what happens when many keys expire simultaneously and all miss together. This is often caused by careless TTL choices, like setting all keys to expire at the same wall-clock time, or by a cache restart that wipes everything at once.

The fix is TTL jitter: instead of TTL=300s exactly, use TTL=300s plus a random offset of 0-30s. The expirations spread across a window instead of clustering at one moment.

Cache Stampede: With and Without TTL Jitter

Without jitter, every cache key expires at the same instant, producing a database load spike that can crash the system. With a small randomized TTL offset, expirations spread over a window and database load stays manageable.

07The Invalidation Problem (The Hard One)

"There are only two hard things in computer science: cache invalidation and naming things." The line is overused but it's stuck around because it's true. Cache invalidation is genuinely hard, and it's where most cache-related bugs in production come from.

The problem: when the underlying data changes, how does the cache learn about it? Three approaches, each with tradeoffs.

1. TTL-based expiration (the simplest)

Set a time-to-live on every cached item. When the TTL expires, the cache treats the item as missing and refetches on next access.

Pros: Trivially simple. Self-healing. No coordination required between writers and the cache.

Cons: You're guaranteed to serve stale data for up to the TTL window. Choosing the TTL is a tradeoff between staleness and database load: shorter TTLs mean fresher data and more cache misses; longer TTLs mean less load and more staleness.

When to use: when slight staleness is acceptable. Most read-heavy product designs.

2. Explicit invalidation (the most accurate)

When data changes, the writer (or a cleanup process) explicitly removes the affected cache entries. Subsequent reads will miss and refetch.

Pros: Cache is consistent with the underlying data within seconds of the write.

Cons: The writer has to know which cache keys are affected by its write. This is harder than it sounds: a single write can affect many cache keys (a tweet write affects the author's profile cache, every follower's timeline cache, the global trending cache, and so on). Track the cache dependencies wrong, and you serve stale data anyway.

When to use: when staleness is unacceptable and you can enumerate all affected cache keys.

3. Write-through or write-behind (the structural fix)

Use one of the cache patterns from Section 4 that updates the cache on every write. The cache and database stay in sync because every write touches both.

Pros: No invalidation problem because there's no stale state.

Cons: Slower writes (write-through) or risk of data loss (write-behind). The structural simplicity comes with operational tradeoffs.

The interaction nobody thinks about: distributed cache invalidation

If you have a single cache instance, invalidation is straightforward. If you have a distributed cache cluster, you need to invalidate the right node, which requires consistent hashing or routing. If you have multiple caches in front of one database (CDN + distributed cache + local cache), invalidating one doesn't invalidate the others, and you'll need a coordinated invalidation mechanism (cache purge APIs at the CDN layer, pub/sub for cache invalidation messages across local caches).

This layered invalidation is where production systems get gnarly. The interviewer rarely pushes here in a 45-minute loop, but at staff level it's fair game.

Cache invalidation is genuinely hard. The TTL approach trades correctness for simplicity; the explicit approach trades simplicity for correctness; everything else is some negotiation between the two.

08What Breaks When the Cache Fails

This is the operational maturity probe. Most candidates have a design that works when the cache works. Senior candidates have a design that survives when the cache doesn't.

Three failure scenarios to reason about.

Scenario 1: cache cluster goes down completely

Every read becomes a cache miss. Every miss falls through to the database. The database, sized for occasional miss traffic, suddenly receives full read traffic. If the database can't handle it (it almost certainly can't, that's why you have a cache), the database crashes too. Now nothing serves reads.

Mitigations:

Don't fall through to the database under cache failure. Return a degraded response (cached defaults, stale data from local memory, an error with retry-after). Better to fail fast than to take down the database.
Use circuit breakers. If the cache health check fails, stop sending requests to the database for some categories of reads. Serve stale or default values.
Run multi-region cache. If one region's cache fails, traffic routes to another. Adds cost; reduces blast radius.

Scenario 2: cache cold-start

You've just deployed a new cache cluster, or restarted an existing one. Every key is missing. The database gets hammered until the cache fills up. This is the cache stampede problem in its most extreme form.

Mitigations:

Cache warming. Before sending real traffic to a new cache, run a script that pre-populates it with the most-accessed keys. Trades complexity for a smoother startup.
Gradual rollout. Send only 10% of traffic to the new cache initially, ramping up as the cache fills. The database sees a graceful ramp instead of a step function.
Persistent cache. Use a cache that survives restarts (Redis with persistence, or a hybrid disk-backed cache). Not always feasible but a useful tool.

Scenario 3: cache inconsistency

The cache has stale data because invalidation failed, or because of a race condition between a write and a refresh. The application serves wrong data to users.

This is harder to detect than the previous two failures because the system is technically up. The wrong-data path can persist for hours before someone notices.

Mitigations:

Aggressive TTLs. Even if invalidation fails, the data self-heals after the TTL.
Versioned cache keys. Include a version or timestamp in the cache key. When the underlying data changes, the cache key changes, and old keys become naturally unreachable.
Out-of-band reconciliation. Periodic background jobs that compare cache values to source-of-truth values and re-cache mismatches.

The 3am Probe

The senior-level operational probe is "what does the on-call engineer see at 3am when this cache breaks, and what's the runbook?" If you can answer that for each of the three scenarios above, you've cleared the operational maturity bar that Section 9 of the main guide covers in detail.

09Cost Reasoning

2026 senior loops grade cost reasoning explicitly. For caching, the cost story has two halves: the cost of the cache itself, and the cost the cache saves you on the database.

Cache instances are typically priced on RAM. A managed Redis cluster with 100GB of memory across primaries and replicas runs roughly $1-3K/month at AWS list price as of 2026. Self-managed is cheaper but adds operational burden.

The savings come from reduced database load. If your cache hit rate is 95%, you've reduced database read load by 95%. If your database was on the verge of needing to be horizontally scaled (sharded, replicated, or both), the cache can defer that scaling effort. The deferred-cost benefit is often larger than the cache cost itself.

The interview move on cost

State the cost direction in one sentence. "Adding the cache costs roughly X but saves us from sharding the database, which would cost roughly Y. The cache is the cheaper path, and it's faster to ship." That's the level of cost reasoning that clears the senior bar. Exact numbers are not required; defensible orders of magnitude are.

If the interviewer pushes deeper on cost, the levers to discuss:

Cache size vs. hit rate. Bigger cache, higher hit rate, more cost. There's a knee in the curve where doubling the cache size only marginally improves hit rate.
TTL length. Longer TTL means lower miss rate but more stale data. The tradeoff is operational, not just cost.
Cache placement. CDN caches reduce origin egress costs (often a meaningful line item). Distributed caches reduce database costs. Local caches reduce distributed cache traffic.

10How Caching Interacts With Other Concepts

Caching doesn't live alone. The interesting depth probes often cross concept boundaries.

Caching × Database selection. A read-replica-heavy SQL database needs different caching than a Cassandra cluster with eventual consistency. The cache strategy follows the database choice.
Caching × Replication. If your database has read replicas with replication lag, your cache might be fresher than the replica it would fall through to. Inconsistencies show up at the boundary.
Caching × Load balancing. Consistent hashing in the load balancer can route requests to app servers with warm local caches for that key, improving hit rates dramatically. Round-robin defeats local caching entirely.
Caching × Vector databases. Vector search results are expensive to compute. Caching embeddings and search results is increasingly common in 2026 RAG and recommendation pipelines.

For more on these cross-concept interactions, see the concepts library.

11Practice Scenarios

Three short scenarios. Read the setup. Decide what you'd do before opening the reveal. The reveal is not the only correct answer; it's a defensible answer that demonstrates the kind of reasoning interviewers grade.

Scenario 01

A user-profile API serves 100K requests per second. The database is feeling the load. Where do you cache?

Profile data changes occasionally (when a user edits their bio) but is read constantly. Cache hit rate would clearly be high. The question is the layer.

How to think about this

Distributed cache (Redis or Memcached) is the right answer. The data is per-user, so CDN won't work (per-user data isn't shareable across users). In-process won't work either, because each app server would need a copy of every user's profile, which doesn't fit. The buffer pool of the database is helpful but not enough at 100K QPS.

Cache-aside pattern with a TTL of a few minutes balances staleness against load. Explicit invalidation on profile updates handles the freshness when users edit their data. The cache hit rate should approach 99% given the access pattern.

Scenario 02

Your cache cluster restarts. The database immediately gets crushed. What went wrong, and how do you fix it?

After the restart, every read missed the cache and fell through to the database. The database wasn't sized for full read load. Within seconds, query times spiked, then the database started timing out, then the application started failing.

How to think about this

This is a cold-start cache stampede. Three layered fixes:

Short-term: Don't fall through to the database under cache failure. Use a circuit breaker that returns a degraded response (cached defaults, stale data, or 503 with retry) when the cache is unhealthy. Better to fail some requests gracefully than to crash the whole stack.

Medium-term: Cache warming. Before sending real traffic to the new cache, run a warm-up script that pre-populates the most-accessed keys. Combined with a gradual traffic ramp, the database sees load grow slowly instead of in a step.

Long-term: Consider Redis with persistence so the cache survives restarts, or run multiple cache clusters with traffic-routing failover so one cluster's restart doesn't drain the cache entirely.

Scenario 03

Your team adds a "user follow count" feature. The count must be accurate. Do you cache it? If so, how?

Follow counts are read frequently (shown on every profile page) and updated on every follow action. Some users have millions of followers and the count changes hundreds of times per minute.

How to think about this

Yes, you cache it, but with care. The naive approach (cache the count, invalidate on every follow change) doesn't work at this scale: the cache invalidates faster than it gets used, defeating the purpose.

Two reasonable strategies:

1. Approximate counts. If exact accuracy isn't required (and for "follow count displayed on a profile" it usually isn't), cache the count with a short TTL (say 60 seconds) and accept that the count is up to a minute stale. Most users won't notice. Some social products explicitly round counts ("1.2M followers") which builds in tolerance.

2. Counter-specific data structures. Use a redis counter (atomic increment/decrement) as the source of truth, with the database as the eventual durable store via write-behind. The counter is fast to read and write; the database catches up asynchronously. This is more complex but accurate and high-throughput.

The wrong answer is to invalidate on every write while keeping a regular cache key. That gives you the worst of both worlds: cache complexity with database load.

12Caching FAQ

Should I always reach for Redis?

Redis is the right default for distributed caching, but not always the right choice. Memcached is simpler and faster for pure key-value caching with no advanced data structures. Application-local caches are dramatically faster than Redis when the data fits and per-server consistency is acceptable. CDN caches handle a class of static traffic Redis shouldn't see. The right answer is "Redis is my default, but I'd use X here because Y."

What's a good cache hit rate to target?

Above 80% to be useful. Above 95% to be transformative. Below 50% suggests the data isn't a good fit for caching, or your TTLs are too short, or your cache is too small. Always state your expected hit rate when proposing a cache; "80%+" is a reasonable handwave, more specific numbers are better when you can back them up.

How do I choose a TTL?

Start from staleness tolerance. How long can the data be wrong before it causes a real problem? That's your upper bound. Then trade staleness against database load: shorter TTL means fresher data and more misses. A common starting point for read-heavy product data is 5 to 60 minutes. For session data, hours to days. For static reference data, hours or longer. Add jitter to all TTLs to avoid stampedes.

When should I cache at multiple layers?

When the layers serve different purposes. CDN caches handle geographic latency for static and semi-static content. Application-local caches handle hot data with zero network cost. Distributed caches handle shared per-user state. Each layer catches what the previous layer missed. In production systems serving real traffic, multi-layer caching is the norm, not the exception.

Is cache invalidation really that hard?

Yes, especially when a single write affects many cache keys, or when caches are layered, or when you need consistency across regions. Most production cache bugs come from invalidation. The honest senior answer is "I'd start with TTL-based invalidation because it's simple and self-healing, and add explicit invalidation only where the staleness window matters." That's a defensible position. "I'll just invalidate everything that's relevant" is the answer that gets you in trouble.

What about cache eviction storms?

This is what happens when an eviction policy cascades into a system-wide problem. Example: you set a memory limit on Redis. Memory pressure starts evicting LRU keys. The evicted keys turn into cache misses. The misses generate database load. The database load slows down everything, including the cache rewriting. Now keys are evicting faster than they're being written. The fix is usually capacity-based: size the cache so that under normal load, eviction is rare. Don't run caches at 100% capacity.

How do I cache for AI workloads?

The 2026 patterns: cache LLM responses (semantic caching, often using embeddings to detect similar prompts that should share a response), cache embeddings themselves (computing them is expensive), cache vector-search results, and use response caching aggressively for any API that calls a model. Token costs and latency are the dominant constraints in AI workloads, and caching addresses both. The AI infrastructure deep-dive covers this in more detail.

Continue

Database Selection →

The next concept on the recommended learning path. Choosing the right store for the data shape, not the brand name. Postgres, DynamoDB, Cassandra, Spanner, and how to defend the choice in three sentences.