How do you mitigate hot keys in a sharded cache?

A hot key is a cache entry that receives outsized traffic compared to the average key. In a sharded cache, consistent hashing places that key on a single shard. All requests pile onto that shard which creates an imbalance. The result is uneven utilization, long queues on one node, and cascading failures that ripple into downstream services and databases. This guide explains practical ways to mitigate hot keys that work in both interviews and production.

Why It Matters

Hot keys show up in real systems whenever you have celebrities, trending items, popular home pages, or global feature flags. The problem appears at any scale once traffic is skewed. Interviewers love this topic because it tests your ability to reason about uneven load, distributed systems, and scalable architecture beyond the happy path. A strong answer proves you can diagnose heavy hitters, select the right mitigation, and discuss the cost of each choice.

How It Works Step by Step

Measure the skew

Start with visibility. Track per key request counts and per shard queue depth. Use approximate heavy hitter detection to keep cost low such as a count min sketch or top K with sampling. Add alarms for any single key that exceeds a percent of total QPS.
Add a small near cache on each app node

L one per process caching absorbs repeated reads for hot items without extra cross network hops. Keep TTL short and add random jitter so keys do not expire together. This alone flattens many spikes.
Collapse duplicate requests

Request coalescing means only one thread fetches a missing key while others wait for that result. Use a single flight style lock keyed by cache key. This removes the stampede on miss or at expiry.
Stripe the hot key across replicas

Write the same value under N salted keys such as user one two three colon profile colon zero to N minus one. On read pick one replica at random or round robin. On write update all replicas in parallel or via a background fanout. This spreads load across shards at the cost of write amplification and more memory.
Refresh ahead with serve stale

Do not let a hot key expire in the middle of peak traffic. Refresh it before TTL using a background task. If a miss happens anyway, serve stale for a short window while refreshing in the background. Add TTL jitter to avoid sync expiry.
Use read replicas or multi cluster reads

Some caches support replica reads. For extreme hot spots, replicate the shard and allow reads from followers. You trade strict freshness for higher read capacity.
Isolate the key to a dedicated tier

Move a few top hot keys to a separate small pool with more resources. This is heavy weight but effective when a handful of keys dominate traffic.
Edge cache where possible

For content over HTTP place a CDN or regional cache in front. Push invalidations on updates. This removes load before it reaches your cache shards.
Change the data shape

If one key aggregates everything, split the value into smaller parts or paginate. Many small keys with parallel fetch is easier to scale than one giant hot key.
Rate limit and degrade gracefully

When all else fails, throttle requests for a short period or serve a simplified response. Protect the system first, then recover.

Real World Example

Consider a social app profile page for a global celebrity. The profile object is cached and lives on a single shard. A live event starts. Traffic spikes for that one profile and overwhelms its shard while other shards sit idle. A practical fix is to replicate the profile cache value into thirty two salted keys. The app reads a random replica and writes fan out to all replicas on update. A background job refreshes ahead every few seconds with jitter. Each app node keeps a tiny near cache so most reads never leave the process. If the user posts a new update, the write path bumps the version and triggers a fast refresh, while stale while revalidate serves the old value for a brief window. The shard imbalance disappears and the database load stays flat.

Common Pitfalls or Trade offs

Replication multiplies memory and write cost. Use it only for a small top K set.
Coalescing requires careful locks. If the worker fails, all waiters might time out. Add timeouts and circuit breakers.
Refresh ahead without jitter creates synchronized storms. Always randomize.
Multi cluster reads relax consistency. Know your freshness needs and service level goals.
Edge caching adds another invalidation path. Keep a single source of truth for versioning and purge rules.
Isolating keys in a special tier increases operational complexity. Document ownership and on call plans.

Interview Tip

Expect a follow up like this. A single cache shard is at ninety percent CPU due to one hot key. You can add only one new technique before the next peak. What do you choose and why. A strong answer picks request coalescing or read striping with replicas, explains the trade off, and shows how you would verify the fix with metrics and a rollback plan.

Key Takeaways

Hot keys are a skew problem not a global capacity problem.
Start with visibility then remove stampedes with near cache and coalescing.
Stripe reads across replicas for a tiny top K set and refresh ahead with jitter.
Use edge caching and isolation when traffic is extreme.
Always measure tail latency and shard balance before and after any change.

Table of Comparison

Strategy	Best For	Core Idea	Pros	Cons
Near cache on app nodes	Repeated reads from same node	Store small in-memory cache with short TTL and jitter	Reduces network hops and absorbs spikes	Risk of stale data and added memory use
Request coalescing	Miss storms or expiry stampedes	One worker fetches value, others wait	Eliminates duplicate requests	Needs careful lock and timeout handling
Striped replicas of hot key	Read-heavy hot spots	Write same value to multiple salted keys	Balances load across shards	Write amplification and memory overhead
Refresh ahead + Serve stale	Predictable expiry patterns	Refresh before TTL and serve stale window	Prevents thundering herd	Requires background jobs and version control
Replica reads / multi-cluster	Extreme read hot spots	Allow reads from replica shards	High read capacity, better availability	Lower consistency and higher cost
Key isolation tier	One or two dominant keys	Move to separate high-capacity tier	Limits blast radius	Adds operational complexity
Edge cache / CDN	HTTP content or global reads	Cache near user edge with invalidations	Offloads load early	Complex purge logic
Data model split	Large monolithic keys	Break data into smaller chunks	Parallelism and efficient caching	Increases application logic complexity

FAQs

Q1. What is a hot key in a cache?

A hot key is a cache entry that gets a large share of requests compared to others which overloads the shard that owns it.

Q2. How do I detect hot keys quickly?

Collect per key request counts with sampling and use heavy hitter sketches to track the top K without full cardinality scans.

Q3. Is replication of hot keys always the best fix?

No. Replication helps read heavy keys but increases memory and write cost. Start with near cache and coalescing first.

Q4. How does serve stale help with hot keys?

It avoids herd effects at expiry. Clients receive a slightly old value while a background task refreshes the key.

Q5. Can a CDN solve all hot key issues?

A CDN helps for HTTP content but you still need proper invalidation and an origin cache plan for dynamic data.

Q6. When should I isolate a key to its own tier?

When a single key dominates traffic for long periods and simple methods fail. Isolation caps blast radius while you redesign.

Further Learning

To strengthen your system design foundation, explore:

Grokking System Design Fundamentals: Learn the building blocks of caching, consistency, and scalability with practical diagrams.
Grokking Scalable Systems for Interviews: Master real-world scaling problems and design strategies used in FAANG-level interviews.