Caching strategies and distributed caching solutions for high-traffic web services
Distributed caching is a technique where frequently accessed data is stored in a shared, in-memory layer spread across multiple nodes, sitting between your application servers and your database. It reduces read latency from tens of milliseconds to sub-millisecond, cuts database load by 80–99%, and is the single most impactful scaling lever for high-traffic web services.
If you're preparing for a system design interview or building production systems, caching is foundational.
Key Takeaways
- Cache-aside (lazy loading) is the most common distributed caching strategy and the default answer in interviews unless the problem demands something else.
- Cache invalidation — not caching itself — is the hard problem. Choose between TTL-based expiration, event-driven invalidation, and delete-on-write based on your consistency requirements.
- Redis is the default choice for most teams in 2026. Memcached still wins for simple, high-throughput string caching at massive scale.
- Consistent hashing distributes keys across cache nodes and minimizes reshuffling when nodes join or leave the cluster.
- A multi-tier cache (local L1 + distributed L2) gives you sub-microsecond hot-path reads with cluster-wide consistency for the broader working set.
- In interviews, always discuss eviction policies, cache stampede mitigation, and failure modes — these separate strong candidates from average ones.
Why Distributed Caching Matters for High-Traffic Systems
A single database read to PostgreSQL or MySQL takes 1–10ms. A Redis or Memcached read takes 0.1–0.5ms. At 100,000 requests per second, that difference is the gap between a responsive application and a crashed database.
Every major tech company relies on distributed caching at scale. Meta's TAO cache serves billions of social graph reads per second. Netflix caches personalization data in EVCache (built on Memcached) across multiple AWS regions to serve 260+ million subscribers. Twitter caches timelines in Redis to serve 500 million tweets per day.
The value is simple: memory is faster than disk, and a shared cache avoids the inconsistency of local per-server caches. When your application scales beyond a single server, a distributed cache becomes the coordination point for fast, consistent reads.
If you're building this foundation from scratch, the Grokking System Design Fundamentals course walks through caching alongside other core building blocks like load balancing and database replication.
The Four Core Caching Strategies
Every caching pattern answers one question: who keeps the cache and database in sync?
1. Cache-Aside (Lazy Loading)
The application owns all cache logic. On a read, the app checks the cache first. On a miss, it queries the database, writes the result to the cache, and returns. On a write, the app updates the database and either invalidates or ignores the cache entry.
def get_user(user_id): # Step 1: Check cache cached = redis.get(f"user:{user_id}") if cached: return deserialize(cached) # Step 2: Cache miss — query database user = db.query("SELECT * FROM users WHERE id = %s", user_id) # Step 3: Populate cache with TTL redis.setex(f"user:{user_id}", 3600, serialize(user)) return user
When to use it: Read-heavy workloads where stale data is tolerable for short windows. This is the default pattern for most web applications.
Trade-off: The first request after a miss or expiration is slow (cold start penalty). Data can be stale between the write to the database and the next cache refresh.
2. Write-Through Cache
Every write goes to the cache and the database synchronously before the write is acknowledged. The cache always contains the latest data.
When to use it: Systems where read-after-write consistency is critical, like shopping carts or account balances.
Trade-off: Higher write latency because every write hits two systems. Cache fills with data that may never be read.
3. Write-Behind (Write-Back) Cache
Writes go to the cache immediately and are asynchronously flushed to the database in batches. The application gets fast write acknowledgment.
When to use it: Write-heavy workloads like analytics event ingestion or activity logging where brief data loss on cache failure is acceptable.
Trade-off: Risk of data loss if the cache node crashes before flushing. Adds complexity with background flush workers and retry logic.
4. Read-Through Cache
The cache itself is responsible for loading data on a miss — the application never talks to the database directly for reads. The cache acts as the primary read interface.
When to use it: When you want to decouple the application from data-fetching logic, common in CDN architectures and some ORM frameworks.
Trade-off: Tightly couples your caching layer to your data model. Harder to debug because the fetch logic lives inside the cache abstraction.
Caching Strategies Comparison Table
| Strategy | Write Path | Read Path | Consistency | Write Latency | Best For |
|---|---|---|---|---|---|
| Cache-Aside | App writes to DB, invalidates cache | App checks cache → DB on miss | Eventual (TTL-bound) | Low (DB only) | General-purpose read-heavy apps |
| Write-Through | App writes to cache + DB synchronously | Always served from cache | Strong | High (two writes) | Read-after-write critical paths |
| Write-Behind | App writes to cache; async flush to DB | Always served from cache | Weak (async lag) | Very low (cache only) | High-throughput write-heavy systems |
| Read-Through | App writes to DB; cache loads on next read | Cache auto-fetches on miss | Eventual | Low (DB only) | CDN, transparent caching layers |
Cache Invalidation: The Hard Problem
Phil Karlton famously said there are only two hard things in computer science: cache invalidation and naming things. He was right about the first one.
Invalidation means ensuring the cache reflects reality. Get it wrong and users see stale prices, phantom inventory, or outdated permissions. There are three primary approaches.
TTL-Based Expiration
Every cache entry gets a time-to-live. After expiration, the next read fetches fresh data. This works well when you can tolerate bounded staleness. Common pattern: 60-second TTL for user profiles, 5-minute TTL for product listings, 24-hour TTL for static configuration.
Event-Driven Invalidation
When data changes, a change event (via a message queue, CDC stream, or database trigger) notifies the cache to delete or refresh the affected keys. Netflix uses CDC from their databases and an internal event bus to propagate invalidation events to EVCache clusters, keeping data fresh within seconds.
Delete-on-Write (Idempotent Invalidation)
On any write, delete the cache key rather than trying to update it. The next read triggers a cache-aside refill. Meta recommends this for large-scale caching because it eliminates race conditions: two concurrent writes can't leave the cache inconsistent if both simply delete the key.
Interview tip: Lead with delete-on-write plus TTL as a safety net. It's the simplest correct answer for most scenarios.
Cache Eviction Policies
When cache memory is full, the eviction policy decides which entries to discard.
| Policy | How It Works | Best For | Used By |
|---|---|---|---|
| LRU (Least Recently Used) | Evicts the entry not accessed for the longest time | General-purpose workloads | Redis (approximated LRU), Memcached |
| LFU (Least Frequently Used) | Evicts the entry with the fewest accesses | Skewed popularity distributions | Redis 4.0+ |
| TTL | Evicts entries after a fixed time-to-live | Time-sensitive data (sessions, tokens) | All major caches |
| Random | Evicts a random entry | Uniform access patterns, simplicity | Redis (optional policy) |
| ARC (Adaptive Replacement Cache) | Dynamically balances recency and frequency | Workloads with shifting access patterns | ZFS, IBM DS8000 |
Redis uses an approximated LRU by default. It samples a configurable number of keys (default 5) and evicts the least recently used among the sample. Starting with Redis 4.0, you can switch to LFU, which is better for workloads where a few keys dominate access (a Zipfian distribution, which describes most web traffic).
Distributed Caching Architecture: Consistent Hashing
A single cache node is a single point of failure and a capacity ceiling. Distributed caches partition data across multiple nodes. The standard technique is consistent hashing.
In consistent hashing, both cache nodes and keys are mapped onto a hash ring. A key is stored on the first node encountered clockwise from the key's position on the ring. When a node is added or removed, only ~1/N of keys need to be remapped (where N is the number of nodes), compared to a naive modular hash where nearly all keys shift.
Virtual nodes solve the problem of uneven distribution. Instead of placing each physical server at one point on the ring, you place 100–200 virtual nodes per server. Amazon's 2007 Dynamo paper popularized this approach, and Cassandra adopted it for its partitioning layer. Redis Cluster uses a fixed set of 16,384 hash slots distributed across nodes — a similar concept with a different implementation.
Redis vs Memcached: Choosing Your Cache Engine
This is one of the most common questions in system design interviews and real architectural decisions. Here's how they compare in 2026.
| Feature | Redis / Valkey | Memcached |
|---|---|---|
| Data structures | Strings, lists, sets, sorted sets, hashes, streams, bitmaps | Strings only |
| Persistence | RDB snapshots + AOF (append-only file) | None — purely volatile |
| Replication | Built-in primary-replica replication | Application-level only |
| Clustering | Redis Cluster with automatic sharding (16,384 slots) | Client-side consistent hashing |
| Threading | Single-threaded command execution (threaded I/O since 6.0) | Multi-threaded, leverages multiple cores |
| Pub/Sub | Built-in | Not supported |
| Licensing (2026) | AGPLv3 (Redis 8.0+); Valkey is BSD-licensed fork | BSD — fully open source |
| Best for | Complex data models, pub/sub, persistence needs | Simple key-value caching at extreme throughput |
The Valkey factor: In 2024, Redis Labs changed its license away from BSD, eventually landing on AGPLv3 in 2025. The Linux Foundation forked Redis 7.2.4 as Valkey, a production-ready, BSD-licensed, drop-in replacement now used by AWS ElastiCache as its default engine.
When to pick Memcached: Pure key-value string caching (session tokens, HTML fragments, API response blobs) at extreme throughput. Netflix's EVCache is built on Memcached for this reason.
When to pick Redis/Valkey: Any use case requiring sorted sets, pub/sub, persistence, or data structures beyond strings. For most teams and most interview answers, Redis is the default.
Multi-Tier Caching: L1 + L2 Architecture
Production systems at scale rarely use a single cache tier. The standard architecture uses two layers:
L1: Local in-memory cache (per application server). Libraries like Caffeine (Java) or simple LRU dictionaries. Response time: microseconds. Capacity: a few hundred MB to a few GB per server.
L2: Distributed cache (Redis/Memcached cluster). Shared across all application servers. Response time: sub-millisecond over the network. Capacity: tens to hundreds of GB across the cluster.
The read path: check L1 first → on L1 miss, check L2 → on L2 miss, query the database and populate both L1 and L2. Writes invalidate both layers. The L1 absorbs repeated reads for hot data (e.g., trending content), while the L2 provides cluster-wide consistency for the broader working set.
The Cache Stampede Problem (Thundering Herd)
When a popular cache key expires, hundreds of concurrent requests simultaneously find a cache miss and all query the database at once. This cache stampede can overload your database and cascade into a full outage. Three proven mitigations:
1. Locking (Mutex): The first request acquires a distributed lock and fetches from the database. Others wait or receive a stale value. Redis's SET key value NX EX 5 is commonly used as the lock.
2. Probabilistic Early Expiration (XFetch): Each request has a small random probability of proactively refreshing the cache before TTL actually expires, increasing as TTL approaches zero.
3. Request Coalescing: Multiple concurrent misses for the same key collapse into a single database query. Go's singleflight package is a common implementation.
Mentioning cache stampede mitigation unprompted in interviews is a strong signal that you think about failure modes, not just the happy path.
Caching in System Design Interviews
Caching questions appear in nearly every system design interview. For a structured walkthrough with dozens of practice problems, the Grokking the System Design Interview course covers caching in the context of real designs like URL shorteners, Twitter feeds, and chat applications.
Sample Interview Dialog
Interviewer: "You're designing a news feed for 500 million users. How would you handle caching?"
Strong answer: "I'd use a multi-tier cache. L1 in-process cache on each feed server stores feeds for the most active users (top 1% by request frequency). L2 is a Redis Cluster holding precomputed feeds for all active users. For invalidation, when a user publishes a post, I'd fan-out-on-write to update follower caches asynchronously, with a 5-minute TTL safety net. For celebrity users with millions of followers, I'd switch to fan-out-on-read. LRU eviction works well since engagement follows a power law."
Interviewer follow-up: "What happens if the Redis cluster goes down?"
Strong answer: "The L1 caches absorb a portion of read traffic. For the rest, I'd serve stale data from a backup tier or CDN-cached version. The database is protected by a circuit breaker limiting concurrent reads. Recovery means warming the cache from the database in priority order — most active users first. I'd alert if cache hit ratio drops below 95%."
Common Follow-Up Questions Interviewers Ask
- "How do you decide what to cache?" — Cache data that is read frequently, expensive to compute, and tolerant of short staleness windows.
- "How do you size the cache?" — Use the 80/20 rule: 20% of your data typically serves 80% of reads. Estimate the working set size and provision 1.5–2x that for headroom.
- "What's your cache hit ratio target?" — 95%+ for read-heavy services. Below 90% usually means your eviction policy or key design needs work.
For advanced distributed systems topics like multi-region caching, geo-replicated consistency, and cache coherence protocols, the Grokking the Advanced System Design Interview goes deeper into the patterns used by systems like DynamoDB, Cassandra, and Google Spanner.
Frequently Asked Questions About Distributed Caching
What is a distributed cache and how does it work?
A distributed cache is a shared, in-memory data store spread across multiple nodes on a network. Data is partitioned across nodes using consistent hashing, and replication provides fault tolerance. Redis Cluster and Memcached are the two most widely used implementations.
What is the difference between cache-aside and read-through caching?
In cache-aside, the application manages checking the cache and querying the database on a miss. In read-through, the cache itself handles the database fetch transparently. Cache-aside gives more control; read-through gives cleaner application code.
When should I use Redis vs Memcached in 2026?
Use Redis (or its BSD-licensed fork, Valkey) when you need data structures beyond strings, persistence, pub/sub, or built-in clustering. Use Memcached for pure key-value string caching at maximum multi-threaded throughput. Most teams default to Redis for its versatility.
How do you handle cache invalidation in a microservices architecture?
Use delete-on-write: when any service writes to the database, it deletes the corresponding cache key. Combine this with a TTL safety net so that even if an invalidation message is lost, the cache self-heals. For cross-service invalidation, publish events to a message broker (Kafka, RabbitMQ) that consuming services subscribe to.
What is a cache stampede and how do you prevent it?
A cache stampede occurs when a popular key expires and hundreds of concurrent requests simultaneously hit the database. Prevent it with distributed locking, probabilistic early expiration (refresh before TTL hits), or request coalescing (collapse duplicate misses into one query).
How do you decide what data to cache?
Cache data that is read far more often than written, expensive to fetch, and tolerant of short staleness. Common examples: user sessions, product catalog entries, API responses, and computed aggregations.
What is consistent hashing and why is it used in distributed caches?
Consistent hashing maps keys and nodes onto a circular hash space. Each key is assigned to the nearest node clockwise on the ring. When nodes are added or removed, only ~1/N of keys are remapped, minimizing disruption during scaling events.
How large should a distributed cache be?
Size your cache to hold the working set. The 80/20 rule is a good heuristic: 20% of your data typically serves 80% of requests. If your cache hit ratio is consistently below 95% and eviction rates are high, add more capacity.
Can a cache replace a database?
No. A cache is an acceleration layer, not a persistence layer. Even Redis with persistence is not a substitute for a database. Caches optimize for read speed; databases optimize for durability, transactions, and complex queries.
TL;DR
Distributed caching stores frequently accessed data in shared, in-memory nodes to reduce database load and serve reads in sub-millisecond time. The four core strategies are cache-aside, write-through, write-behind, and read-through — cache-aside is the default. Invalidation is the hardest part: use delete-on-write plus TTL safety nets. Redis/Valkey is the standard choice in 2026; Memcached wins for pure key-value throughput.
In interviews, discuss eviction policies, consistent hashing, cache stampede prevention, and multi-tier architecture. Target a 95%+ hit ratio.
GET YOUR FREE
Coding Questions Catalog

$197

$72

$78