How do you implement graceful degradation and load shedding?

Graceful degradation keeps your product useful under stress by dimming or turning off non critical features while protecting core flows and SLOs. Load shedding is the protective gate that refuses new work when the system is near saturation so that admitted requests finish fast. Used together, they keep latency predictable, prevent cascading failures across distributed systems, and make your system design interview answers sound practical and senior level.

Why It Matters

Traffic is bursty, tail latency dominates user experience, and every dependency can fail at the worst time. Graceful degradation gives you a plan to serve something useful when parts of the stack are slow or unavailable. Load shedding protects the system by saying no early instead of timing out late. Together they improve reliability, maintain SLOs under overload, and reduce pager noise. Interviewers listen for this playbook because real systems win by protecting the most important work, not by trying to do everything during a spike.

How It Works Step by Step

1. Map critical journeys and optional features List what must never break checkout, sign in, watch video start from click and what can degrade recommendations, avatars, high resolution images, live counters.

2. Choose overload signals Pick fast and local signals that predict pain queue depth per worker, concurrent requests, CPU saturation, memory pressure, and user facing latency at P95 or P99. Avoid averages.

3. Add admission control at the edge Reject early at CDN or API gateway using token bucket or leaky bucket. Prefer fixed concurrency limits per route. Return a short 503 with Retry After or 429 for client controlled pacing.

4. Classify traffic into priority classes Gold for core flows, silver for important but deferrable, bronze for background or decorative. Give each class its own quota and queue limits. Never allow bronze to starve gold.

5. Implement graceful policies feature by feature Lower video bitrate, shrink image size, hide non essential widgets, collapse expensive joins, switch to cached or slightly stale data, reduce time windows or result counts. Make each policy a flag that can be turned on automatically.

6. Add a brownout controller Tie feature flags to health metrics. As P99 rises or queue depth grows, progressively disable optional features. When the system recovers, restore features smoothly to avoid oscillations.

7. Set strict timeouts, cancellation, and fallbacks Use short server side timeouts with request cancellation. Return partial responses without blocking on slow subcalls. Provide safe defaults when a dependency is unavailable.

8. Shed load where it is cheapest Discard at the door rather than mid pipeline. Use head based admission, drop oldest in low priority queues, or random early drop for bronze traffic. Log a compact event for later analysis.

9. Propagate backpressure across services When a downstream is hot, upstreams slow down. Convert 429 or 503 to smaller retry budgets and exponential backoff with jitter. Keep idempotency keys to avoid duplicate effects.

10. Plan for capacity and autoscaling but do not rely on it Autoscaling helps but reacts after the spike. Degradation and shedding keep the system safe until new capacity arrives.

11. Observe and alert on user centric metrics Track P95 and P99 latency, error rate, queue depth, shed rate, brownout level, and SLO compliance. Alert on trends before hard limits. Export clear events when a policy triggers.

12. Test with game days and failure injection Run synthetic spikes and dependency slowdowns. Verify that policies trigger, that gold requests stay fast, and that recovery is smooth.

Real World Example

Prime Day style traffic hits a retail site. The API gateway caps per tenant concurrency and returns 429 for chatty clients. The product page serves cached recommendations and hides real time inventory counters. Image service switches to medium resolution. Search suggestions are paused. Checkout, sign in, and payment remain top priority and stay within SLO. Orders complete, some non essential sugar is missing, and the system never tips into a meltdown.

Common Pitfalls

Shedding too late in the call chain Rejecting after three downstream hops wastes work and still times out the user. Shed at the edge or at the first hop.

Using averages instead of tail metrics Averages hide pain. Trigger on queue depth, concurrency, and P95 or P99.

All or nothing brownouts Binary on or off creates a jarring user experience. Prefer progressive steps and per feature flags.

Unfair policies across tenants Global drops hurt small tenants while large tenants blast through. Use per tenant quotas and per route limits.

Retry storms from clients Aggressive retries multiply load. Publish client rules with exponential backoff and jitter, plus Retry After hints.

Over relying on autoscaling Scaling lags the spike. You still need admission control and brownouts to bridge the gap.

No visibility or poor messaging Silent degradation confuses teams and users. Emit events and show subtle in product hints when a feature is temporarily simplified.

Interview Tip

When asked how to keep latency under a flash crowd, say you will protect the critical path with priority classes, shed at the gateway using token bucket with conservative retry budgets, and downgrade optional features via a brownout controller tied to P99 and queue depth. Then sketch two or three concrete degradations for that product, for example hide social widgets, serve cached search suggestions, and reduce image sizes. Specifics beat theory in a system design interview.

Key Takeaways

Load shedding rejects early to keep admitted work fast and predictable
Graceful degradation keeps the product useful by simplifying non essential features
Priorities, quotas, and per feature flags turn strategy into code
Trigger policies from tail latency, queue depth, and concurrency, not averages
Test routinely with game days to verify recovery and fairness

Table of Comparison

Approach	Primary trigger	What it does	Best use	Main risk
Graceful degradation	Rising latency or partial dependency failure	Serves simpler results or hides optional features	Keep core journeys usable during partial outages	Too aggressive can hide important signals or reduce conversion
Load shedding	Saturated concurrency or deep queues	Refuses new work with quick error	Protect latency and stability during spikes	Poor fairness can starve small tenants
Backpressure	Downstream signals of overload	Slows or pauses upstream senders	Prevent cascades across microservices	Can propagate slowness system wide if not bounded
Rate limiting	Policy or contract thresholds	Caps request rate per client or route	Multi tenant fairness and abuse control	Static limits can underutilize capacity

FAQs

Q1. What is the difference between graceful degradation and load shedding?

Graceful degradation keeps serving by simplifying non essential parts. Load shedding says no to some requests so that accepted ones meet SLO.

Q2. When should I prefer backpressure over load shedding?

Use backpressure when producers can naturally slow down for example internal services or streams. Use shedding at the edge or for untrusted clients that will not slow down.

Q3. Which algorithms help with load shedding?

Token bucket and leaky bucket for admission, per class queues with head based admission, tail drop or random early drop for low priority traffic.

Q4. How do I test my degradation plan?

Run synthetic spikes, inject latency in dependencies, and watch brownout levels and shed rate. Verify that P95 recovers quickly once load falls.

Q5. What status codes should clients see during shedding?

Use 429 for client rate limits and 503 with Retry After for server overload. Keep bodies short and cacheable where safe.

Q6. Does autoscaling remove the need for these tactics?

No. Autoscaling reacts after the spike. Degradation and shedding protect the system during the spike and during dependency slowdowns.

Further Learning

Level up your overload playbook with the course Grokking Scalable Systems for Interviews at DesignGurus.io. It covers admission control, queues, and capacity planning in depth. Enroll in Grokking Scalable Systems for Interviews
If you want a structured interview toolkit that ties these ideas to whiteboard friendly patterns, study Grokking the System Design Interview. Start Grokking the System Design Interview