How do you run reliability reviews pre‑launch (design/SLOs/capacity)?

Before any major system launch, reliability reviews serve as a final safety net. They combine technical validation, capacity planning, and operational readiness into one structured process. This review ensures the system can handle real-world stress and meet service level objectives (SLOs). It’s not just a checklist but a mindset that balances reliability, performance, and scalability.

Why It Matters

A pre-launch reliability review guarantees confidence in your distributed system before customers touch it. In interviews, it signals that you think in terms of end-to-end system reliability, capacity forecasting, and operational sustainability (traits that separate junior engineers from system design experts). It’s where engineering precision meets business reliability goals.

How It Works (Step-by-Step)

Define the critical user journeys Identify the most vital flows that must never break. For each, map every dependent service, datastore, and external API. Knowing what’s critical sets the scope of your review.
Set SLIs and SLOs Define measurable indicators (SLIs) like latency or error rate and corresponding SLOs such as 99.9% availability or 250ms p95 latency. Translate these goals into an error budget to monitor reliability over time.
Forecast capacity Convert expected user behavior into quantitative load metrics. Formula:
```
Average QPS = (DAU × sessions per user × requests per session) / 86,400
Peak QPS = Average QPS × peak factor (4–6x typical)
```
Add 30–50% headroom for failures, growth, and sudden traffic surges.
Evaluate resource limits Validate that every dependency (databases, caches, queues, and external APIs) can handle the peak load with headroom. Address bottlenecks by scaling horizontally, adding caches, or improving query efficiency.
Engineer resilience Build reliability into the design. Implement timeouts shorter than client limits, retries with exponential backoff, circuit breakers, and fallback responses for degraded states.
Perform tests and validation Run load tests to peak and beyond, simulate failures (e.g., pod kills, cache outages), and validate that dashboards and alerts behave correctly. Each test should map back to an SLO.
Set up observability and alerting Monitor golden signals, like latency, traffic, errors, and saturation. Use SLO burn-rate alerts to prevent noise and identify meaningful issues early.
Prepare operational readiness Create runbooks, define on-call procedures, and perform tabletop exercises. This ensures incident response teams know how to act when things go wrong.
Establish rollout gates Use staged rollouts (canary, 25%, 50%, full) and clear rollback triggers. Launch only if metrics stay within defined thresholds.
Document and decide Summarize results, unresolved risks, and rollback plans. Approve launch only when all key owners agree that reliability objectives are achievable.

Real-World Example

Take Netflix’s recommendation service launch. The team projected 10 million DAUs, each making 10 requests per day. That’s 100 million daily calls or around 1,157 QPS average. Applying a 5x peak factor with 30% buffer means planning for ~7,500 QPS. Load tests confirmed each instance could handle 400 QPS, so 20+ instances were deployed with autoscaling. Circuit breakers, Redis caching, and fallback lists for cold starts ensured that even during failures, Netflix still served recommendations reliably.

Common Pitfalls or Trade-offs

Unrealistic SLOs Choosing targets like 99.99% availability without real data creates false confidence. Start with 99.9% and iterate based on telemetry.
Unbounded retries Retry storms amplify outages. Always cap attempts and add jitter.
Ignoring third-party dependencies Overlooking API rate limits or quota constraints can lead to unpredictable downtime.
Over-alerting Triggering alerts on minor spikes creates alert fatigue. Tie alerts to SLO burn rates.
Optimizing averages, not tails Users notice 95th and 99th percentile latency more than average performance.
Underestimating write amplification Ignoring replication and indexing costs leads to premature capacity saturation.

Interview Tip

In a system design interview, talk about reliability reviews using a structured checklist. Begin with SLOs, estimate QPS, discuss resilience mechanisms, and mention how you’d validate them pre-launch. Bonus points if you mention calculating an error budget and using it to decide rollout policies.

Key Takeaways

Reliability reviews validate system health before exposure to real users.
Always align design with measurable SLOs and capacity evidence.
Add resilience patterns early rather than patching failures later.
Observability and clear rollout gates turn reliability into a measurable discipline.
In interviews, combine SLO logic, QPS math, and failure simulation to show maturity.

Table of Comparison

Aspect	Reliability Review (Pre-launch)	Code Review	Load Test	Chaos Test
Goal	Validate reliability, SLOs, and capacity readiness	Ensure correctness and maintainability	Verify performance under load	Test resilience under failure
Timing	Before major launch	During development	Before release	After stability is achieved
Metrics	SLO compliance, error budget	Code quality, bugs	Throughput, latency	Recovery time, blast radius
Participants	SREs, product engineers, operations	Developers	QA, SREs	DevOps, SREs
Outcome	Go/No-Go decision and rollback plan	Approved code	Performance certification	Verified failover readiness

FAQs

Q1. What is a reliability review before launch?

It’s a structured evaluation ensuring the system design, capacity plan, and SLOs can sustain expected traffic and failure scenarios before going live.

Q2. How do you pick SLO targets for a new system?

Base them on user expectations, business impact, and achievable metrics from similar services. Start moderate (like 99.9%) and adjust post-launch.

Q3. What capacity buffer should I plan for?

A common rule is 30–50% above peak traffic to absorb growth, failures, and sudden bursts.

Q4. What’s the difference between SLO and SLA?

SLOs are internal reliability goals, while SLAs are external, contractual promises to customers.

Q5. What teams should participate in reliability reviews?

Typically engineering, SRE, product management, and operations. Each brings visibility into different risk areas.

Q6. What evidence is required to pass the review?

Load test data, failure test logs, capacity models, observability dashboards, and a documented rollback plan.

Further Learning

Deepen your understanding of scalability and fault tolerance in Grokking Scalable Systems for Interviews.
Build a strong foundation in distributed system fundamentals and design principles with Grokking System Design Fundamentals.