How do you mitigate the thundering herd problem?

The thundering herd problem appears when many clients or workers wake up at the same moment and hit the same resource. A cache entry expires, a feature flag flips, a service restarts, or a popular page is invalidated, and suddenly thousands of requests arrive together. The result is a spike in latency, dropped connections, and a feedback loop that makes recovery hard. In interviews and in production you will see this pattern often. The good news is that it can be controlled with a few proven design moves.

Why It Matters

A single herd can take down an otherwise healthy stack. Databases receive bursty read traffic, memory pools thrash, autoscaling lags behind demand, and error rates climb. Tail latency worsens for every user, not only the ones requesting the hot key. Costs also jump because a brief spike can trigger extra capacity that sits idle afterward. Hiring teams ask about this scenario because it tests your ability to connect caching, concurrency, retries, queueing, and back pressure into one coherent plan for scalable architecture.

How It Works step by step

A shared trigger occurs. This could be a cache TTL hitting zero, a cold start after a deploy, or a synchronized cron job.
Thousands of clients notice the same condition and send requests at once.
The upstream bottleneck gets hammered. This is usually a database, a search cluster, an external API, or a microservice with limited concurrency.
Latency rises, clients time out, and naive retry logic fires. Retries compound the surge and push the system into a failure spiral.
Even after the hot key is regenerated, the queue of pending work leads to long tails and slow recovery, especially if worker pools are saturated.

Mitigation is about breaking at least one of those links. You can desynchronize clients so they do not move together. You can ensure only one worker rebuilds expensive results while others wait or serve stale data. You can shape traffic with queues and back pressure so bursts are smoothed rather than amplified. Most robust designs combine several of these ideas.

Real World Example

Consider a streaming home page that shows personalized rows. Each user has a cache entry for the home layout. During a peak evening window a change in ranking logic invalidates a large set of keys. Many users need a fresh layout at the same instant. Without protection, the recommendation service queries the database at a huge rate, saturates CPU, and causes a storm of timeouts. The fix is layered. The cache uses soft TTL with background refresh so most users can read a slightly old value while a small number of workers refresh. A single flight guard per key ensures only one worker runs the expensive computation. Requests pass through a queue with a target concurrency per shard. Clients use jittered retries with back off. The result is a smooth curve instead of a cliff.

Common Pitfalls or Trade offs

Expiring everything at the same time. Uniform TTLs cause synchronized refresh. Prefer randomized TTLs with a small jitter window.
Letting every worker rebuild the same hot key. Use request coalescing so one worker computes while others wait.
Treating errors as signals to retry right away. Retries without jitter and back off turn small issues into large ones.
Locking on a global key. Per key locks reduce contention while a global lock can become its own bottleneck.
Serving only fresh data. Stale while revalidate keeps the user journey fast while the system heals.
Infinite queues. A queue without back pressure simply moves the problem downstream. Set admission limits and time based deadlines.

Interview Tip

Interviewers often ask for a precise plan for the hot key case. A strong answer mentions at least three moves that work together. For example, say you would add soft TTL plus background refresh, a single flight per key using Redis set with NX and a short expiry to avoid deadlocks, and client side retries with full jitter and a cap. If time allows, add rate limiting at the edge and a small queue in front of the database tier with a fixed worker pool.

Key Takeaways

The herd is triggered by a shared event that synchronizes many clients. Break that synchronization.
Serve stale data on purpose while a small set of workers refresh the truth.
Coalesce identical work so one compute serves many waiters.
Shape traffic with queues, rate limits, and back pressure to protect bottlenecks.
Always add jitter to TTLs and retries to prevent alignment.

Table of Comparison

Technique	Main Idea	Best Use	What to Watch
Request Coalescing (Single Flight)	Only one worker regenerates a value while others wait	Hot keys and expensive recompute	Use per-key scope and short lock TTLs
Stale-While-Revalidate	Serve cached value while refreshing in background	User-facing endpoints tolerant to minor staleness	Define freshness windows carefully
Randomized TTLs	Add jitter to spread expirations over time	Large caches with many keys	Avoid large jitter to prevent stale entries
Client Back-off with Jitter	Exponential wait with random delay on retries	Retry storms and transient failures	Bound retries and total wait time
Token Bucket Rate Limiting	Control request rate per user/key	Protect shared services from bursts	Choose fair limits and clear error responses
Queue with Fixed Workers	Buffer and process requests steadily	Expensive compute or fan-out tasks	Drop old jobs and avoid unbounded queues
Pre-Warm and Lazy Warm	Warm popular keys in stages	Deploy rollouts or new clusters	Throttle warm-up and limit concurrency
Circuit Breaker / Load Shedding	Fail fast or degrade gracefully under pressure	Backend protection and graceful fallback	Define trip thresholds and recovery policies

Practical mitigation patterns

This section gathers the core patterns you can quickly apply in design sessions and in production.

Per key mutex or lease:

Use a short lived lock in Redis with set NX and a small TTL. If a worker holds the lease it recomputes and writes the result. Others either wait with a bounded timeout or serve stale data. A library pattern often called single flight provides the same idea inside a process.
Soft TTL plus hard TTL:

Give each cache entry two timers. Before the soft TTL, serve as fresh. After soft TTL and before hard TTL, serve but trigger a refresh in the background. After hard TTL, avoid serving unless the resource is critical and you have a safe stale window.
Jitter everywhere:

Spread out client behavior with randomness. Add a small random offset to TTLs. Add full jitter to retries so clients do not align. Add jitter to periodic jobs and warmers.
Queue and shape:

When bursts are inevitable, accept at the edge, enqueue, and drain with a fixed worker pool. Attach a deadline to each unit of work and drop expired items. This turns rate spikes into controlled latency rather than collapse.
Protect dependencies:

Use per caller and per key rate limits. Apply circuit breakers that trip on error rate and latency. When tripped, return a graceful fallback, such as a cached variant or a simple default.
Observe and alert:

Track hot keys, lock contention, queue depth, saturation of worker pools, and the ratio of stale to fresh serves. Watch p95 and p99 latency during cache invalidations and deploys. Alerts should target the herd pattern, not just average latency.

When each choice fits

Real time read paths with strict freshness

Prefer request coalescing and rate limits. Use short soft TTL windows and very short waits on the coalesced request.
User experience paths where slight staleness is fine

Stale while revalidate gives excellent perceived performance.
Expensive batch or compute intensive tasks

Queue with fixed concurrency and deadlines is a strong default.
External API with hard quotas

Strong rate limits and back off with jitter prevent quota burn and ban waves.

Frequently asked questions

Q1. What is the difference between thundering herd and cache stampede?

A stampede is a herd specific to cache refresh. The general herd pattern also includes retry storms, synchronized cron jobs, or cold starts that hit any shared resource.

Q2. How do I detect that a herd is happening?

Look for synchronized spikes on a few hot keys, a rise in p99 latency, increased retry counts, and lock contention. Correlate with TTL timers, deploy times, or config changes.

Q3. Is stale while revalidate safe for financial or critical data?

Usually you avoid stale reads for strong consistency workflows. For those, use request coalescing plus rate limits and circuit breakers, and keep stale reads off.

Q4. Do serverless functions change the picture?

Serverless can make cold starts align across many workers which amplifies the herd. The same mitigations apply, plus prewarm, staged rollouts, and regional staggering.

Q5. Which libraries help with single flight?

Many languages offer a single flight utility or a cache library with per key locking. If you roll your own with Redis, use set NX with a short TTL and always handle lock loss safely.

Q6. Should I prefer queues or rate limits?

Use both. Rate limits protect the service boundary. Queues smooth the work that you agree to accept. Together they prevent overload and provide predictable latency.

Further Learning

To master traffic-control patterns like these and practice solving cache stampedes under interview pressure, check out Grokking the System Design Interview.

For an in-depth understanding of scalability principles, back pressure, and rate-limiting design, explore Grokking Scalable Systems for Interviews.

Final checklist for your design session

State the trigger that creates alignment across clients.
Choose at least three complementary mitigations.
Explain how you will keep user experience fast with stale reads or graceful fallbacks.
Describe observability signals and rollback steps.
Clarify limits, deadlines, and fairness choices.

Use this checklist to structure your answer in any system design interview that mentions a surge, a hot key, or a cache refresh window.