Key performance indicators for evaluating system design scalability

Question

Design Gurus · Accepted Answer

System design scalability metrics are quantifiable indicators that measure how well a distributed system handles increasing load while maintaining acceptable performance. These include latency (how fast), throughput (how much), availability (how reliable), error rate (how correct), and resource utilization (how efficient). In system design interviews, candidates who quantify their design—"This architecture handles 50,000 QPS with p99 latency under 200ms at 99.99% availability"—score significantly higher than candidates who describe systems in qualitative terms like "fast" or "highly available." Numbers transform vague claims into engineering commitments, and interviewers reward the precision.

Key Takeaways

Every system design answer should include specific metrics. "High throughput" is vague. "50,000 queries per second" is engineering. Interviewers use your metric choices and target values to assess whether you understand the system's real constraints.  
The seven core scalability metrics are: latency (p50, p95, p99), throughput (QPS/RPS), availability (nines), error rate, resource utilization, cache hit ratio, and data growth rate.  
Metrics work in tension. Optimizing for latency often reduces throughput. Maximizing availability increases cost. Understanding these trade-offs—and articulating them with numbers—is what separates senior candidates from mid-level ones.  
SLOs (Service Level Objectives) and SLIs (Service Level Indicators) are the frameworks production systems use to define and measure scalability targets. Mentioning them in interviews signals production-grade thinking.  
Back-of-envelope estimation—converting user counts to QPS, storage, and bandwidth—is how you derive the metrics your design must achieve. This estimation step is mandatory at Google and heavily weighted at every FAANG company.

Why Metrics Matter in System Design

A system design without metrics is a sketch, not an architecture. Metrics anchor every design decision in reality.

Without metrics: "I would add a cache to improve performance."

With metrics: "At 50,000 QPS, the database handles reads at 15ms average but degrades to 500ms at p99 due to lock contention. Adding a Redis cache with a 95% hit ratio reduces database load to 2,500 QPS, bringing p99 read latency below 50ms."

The second answer demonstrates three things interviewers evaluate: you understand the problem quantitatively, you can reason about the impact of architectural decisions, and you can set measurable targets that define success. This is why back-of-envelope estimation—often the first 5 minutes of a system design interview—matters so much. The numbers you calculate in that phase become the constraints your entire design must satisfy.

The Seven Core Scalability Metrics

1. Latency

What it measures: The time between a client sending a request and receiving a response.

Why it matters: Latency directly affects user experience. Amazon found that every 100ms of added latency cost them 1% in sales. Google found that a 500ms increase in search latency reduced traffic by 20%.

How to discuss it:

Percentile What It Means Typical Target When to Use
p50 (median) Half of requests are faster than this 50–100ms for web APIs General performance baseline
p95 95% of requests are faster 200–500ms Standard performance target
p99 99% of requests are faster 500ms–1s Tail latency; affects heaviest users
p99.9 99.9% of requests are faster 1–2s Critical for payment/checkout flows

Interview application: "I would set an SLO of p99 latency under 200ms for the feed service. Our estimation shows 10,000 reads per second. With a Redis cache achieving 95% hit ratio, cached reads return in 2ms. The remaining 5% of requests hit PostgreSQL at 15ms average. The p99 latency is driven by cache misses during cold starts and database lock contention—I would mitigate this with connection pooling and read replicas."

Critical insight: Always discuss tail latency (p99, p99.9), not just averages. Average latency hides the worst-case experience. A system with 50ms average latency but 5-second p99 latency is broken for 1% of users—often your highest-value users making complex requests.

2. Throughput

What it measures: The number of operations a system can handle per unit of time. Measured as queries per second (QPS), requests per second (RPS), or transactions per second (TPS).

Why it matters: Throughput defines whether your architecture can handle the expected load. If your system needs to serve 100,000 QPS and your database maxes out at 10,000 QPS, you have an architectural problem, not a tuning problem.

How to derive throughput from user counts:

Daily active users (DAU) → requests per day → requests per second (QPS).

Example: 10M DAU × 20 requests per user per day = 200M requests/day. 200M / 86,400 seconds = ~2,300 QPS average. Peak traffic is typically 2–3x average = ~7,000 QPS peak.

Interview application: "Based on 10M DAU with 20 requests per user per day, we need to handle ~2,300 QPS average and ~7,000 QPS at peak. A single PostgreSQL instance handles approximately 5,000–10,000 simple read QPS. At peak, we are at the upper boundary. I would add 2 read replicas and a Redis cache to ensure headroom. With the cache absorbing 90% of reads, the database sees only ~700 QPS—well within capacity."

3. Availability

What it measures: The percentage of time the system is operational and serving correct responses.

Why it matters: Availability targets drive architectural complexity and cost. Each additional "nine" requires significantly more redundancy.

Target Annual Downtime Architectural Requirements
99% 3.65 days Basic redundancy
99.9% 8.76 hours Multi-AZ deployment, health checks
99.99% 52.6 minutes Automated failover, no single points of failure
99.999% 5.26 minutes Multi-region active-active, distributed consensus

Interview application: "For a payment system, I would target 99.99% availability—52 minutes of annual downtime maximum. This requires multi-AZ deployment for every component, automated database failover within 30 seconds, and no single points of failure in the critical payment path. Achieving 99.999% would require active-active multi-region, which adds cross-region replication complexity and approximately doubles infrastructure cost. For our use case, four nines is the right balance."

4. Error Rate

What it measures: The percentage of requests that result in errors (5xx server errors, timeout errors, incorrect responses).

Why it matters: A system can be technically "available" while returning errors to a significant percentage of users. Error rate captures quality of service beyond simple uptime.

Typical targets: Less than 0.1% error rate for production services. Payment systems target less than 0.01%. An error budget of 0.01% on 100M daily requests means no more than 10,000 errors per day.

Interview application: "I would set an error rate SLO of 0.1% for the notification service. With 50M notifications per day, that allows 50,000 failed deliveries daily. I would implement a dead letter queue for failed notifications with automatic retry at exponential backoff. If the error rate exceeds 0.1% for more than 15 minutes, an alarm fires and new deployments are automatically blocked until the rate recovers."

Key performance indicators for evaluating system design scalability

Key Takeaways

Why Metrics Matter in System Design

The Seven Core Scalability Metrics

1. Latency

2. Throughput

3. Availability

4. Error Rate

5. Resource Utilization

6. Cache Hit Ratio

7. Data Growth Rate

SLOs, SLIs, and Error Budgets: The Production Framework

How to Use Metrics in Each Interview Phase

Frequently Asked Questions

What metrics should I mention in a system design interview?

Why is p99 latency more important than average latency?

How do I estimate QPS from user counts?

What is a good cache hit ratio?

How do I choose between 99.9% and 99.99% availability?

What are SLOs and SLIs?

How do I calculate storage requirements?

Should I mention resource utilization in system design interviews?

What is an error budget?

How many metrics should I define for a system design interview?

TL;DR

Percentile	What It Means	Typical Target	When to Use
p50 (median)	Half of requests are faster than this	50–100ms for web APIs	General performance baseline
p95	95% of requests are faster	200–500ms	Standard performance target
p99	99% of requests are faster	500ms–1s	Tail latency; affects heaviest users
p99.9	99.9% of requests are faster	1–2s	Critical for payment/checkout flows

Target	Annual Downtime	Architectural Requirements
99%	3.65 days	Basic redundancy
99.9%	8.76 hours	Multi-AZ deployment, health checks
99.99%	52.6 minutes	Automated failover, no single points of failure
99.999%	5.26 minutes	Multi-region active-active, distributed consensus