Key performance indicators for evaluating system design scalability
System design scalability metrics are quantifiable indicators that measure how well a distributed system handles increasing load while maintaining acceptable performance. These include latency (how fast), throughput (how much), availability (how reliable), error rate (how correct), and resource utilization (how efficient). In system design interviews, candidates who quantify their design—"This architecture handles 50,000 QPS with p99 latency under 200ms at 99.99% availability"—score significantly higher than candidates who describe systems in qualitative terms like "fast" or "highly available." Numbers transform vague claims into engineering commitments, and interviewers reward the precision.
Key Takeaways
- Every system design answer should include specific metrics. "High throughput" is vague. "50,000 queries per second" is engineering. Interviewers use your metric choices and target values to assess whether you understand the system's real constraints.
- The seven core scalability metrics are: latency (p50, p95, p99), throughput (QPS/RPS), availability (nines), error rate, resource utilization, cache hit ratio, and data growth rate.
- Metrics work in tension. Optimizing for latency often reduces throughput. Maximizing availability increases cost. Understanding these trade-offs—and articulating them with numbers—is what separates senior candidates from mid-level ones.
- SLOs (Service Level Objectives) and SLIs (Service Level Indicators) are the frameworks production systems use to define and measure scalability targets. Mentioning them in interviews signals production-grade thinking.
- Back-of-envelope estimation—converting user counts to QPS, storage, and bandwidth—is how you derive the metrics your design must achieve. This estimation step is mandatory at Google and heavily weighted at every FAANG company.
Why Metrics Matter in System Design
A system design without metrics is a sketch, not an architecture. Metrics anchor every design decision in reality.
Without metrics: "I would add a cache to improve performance."
With metrics: "At 50,000 QPS, the database handles reads at 15ms average but degrades to 500ms at p99 due to lock contention. Adding a Redis cache with a 95% hit ratio reduces database load to 2,500 QPS, bringing p99 read latency below 50ms."
The second answer demonstrates three things interviewers evaluate: you understand the problem quantitatively, you can reason about the impact of architectural decisions, and you can set measurable targets that define success. This is why back-of-envelope estimation—often the first 5 minutes of a system design interview—matters so much. The numbers you calculate in that phase become the constraints your entire design must satisfy.
The Seven Core Scalability Metrics
1. Latency
What it measures: The time between a client sending a request and receiving a response.
Why it matters: Latency directly affects user experience. Amazon found that every 100ms of added latency cost them 1% in sales. Google found that a 500ms increase in search latency reduced traffic by 20%.
How to discuss it:
| Percentile | What It Means | Typical Target | When to Use |
|---|---|---|---|
| p50 (median) | Half of requests are faster than this | 50–100ms for web APIs | General performance baseline |
| p95 | 95% of requests are faster | 200–500ms | Standard performance target |
| p99 | 99% of requests are faster | 500ms–1s | Tail latency; affects heaviest users |
| p99.9 | 99.9% of requests are faster | 1–2s | Critical for payment/checkout flows |
Interview application: "I would set an SLO of p99 latency under 200ms for the feed service. Our estimation shows 10,000 reads per second. With a Redis cache achieving 95% hit ratio, cached reads return in 2ms. The remaining 5% of requests hit PostgreSQL at 15ms average. The p99 latency is driven by cache misses during cold starts and database lock contention—I would mitigate this with connection pooling and read replicas."
Critical insight: Always discuss tail latency (p99, p99.9), not just averages. Average latency hides the worst-case experience. A system with 50ms average latency but 5-second p99 latency is broken for 1% of users—often your highest-value users making complex requests.
2. Throughput
What it measures: The number of operations a system can handle per unit of time. Measured as queries per second (QPS), requests per second (RPS), or transactions per second (TPS).
Why it matters: Throughput defines whether your architecture can handle the expected load. If your system needs to serve 100,000 QPS and your database maxes out at 10,000 QPS, you have an architectural problem, not a tuning problem.
How to derive throughput from user counts:
Daily active users (DAU) → requests per day → requests per second (QPS).
Example: 10M DAU × 20 requests per user per day = 200M requests/day. 200M / 86,400 seconds = ~2,300 QPS average. Peak traffic is typically 2–3x average = ~7,000 QPS peak.
Interview application: "Based on 10M DAU with 20 requests per user per day, we need to handle ~2,300 QPS average and ~7,000 QPS at peak. A single PostgreSQL instance handles approximately 5,000–10,000 simple read QPS. At peak, we are at the upper boundary. I would add 2 read replicas and a Redis cache to ensure headroom. With the cache absorbing 90% of reads, the database sees only ~700 QPS—well within capacity."
3. Availability
What it measures: The percentage of time the system is operational and serving correct responses.
Why it matters: Availability targets drive architectural complexity and cost. Each additional "nine" requires significantly more redundancy.
| Target | Annual Downtime | Architectural Requirements |
|---|---|---|
| 99% | 3.65 days | Basic redundancy |
| 99.9% | 8.76 hours | Multi-AZ deployment, health checks |
| 99.99% | 52.6 minutes | Automated failover, no single points of failure |
| 99.999% | 5.26 minutes | Multi-region active-active, distributed consensus |
Interview application: "For a payment system, I would target 99.99% availability—52 minutes of annual downtime maximum. This requires multi-AZ deployment for every component, automated database failover within 30 seconds, and no single points of failure in the critical payment path. Achieving 99.999% would require active-active multi-region, which adds cross-region replication complexity and approximately doubles infrastructure cost. For our use case, four nines is the right balance."
4. Error Rate
What it measures: The percentage of requests that result in errors (5xx server errors, timeout errors, incorrect responses).
Why it matters: A system can be technically "available" while returning errors to a significant percentage of users. Error rate captures quality of service beyond simple uptime.
Typical targets: Less than 0.1% error rate for production services. Payment systems target less than 0.01%. An error budget of 0.01% on 100M daily requests means no more than 10,000 errors per day.
Interview application: "I would set an error rate SLO of 0.1% for the notification service. With 50M notifications per day, that allows 50,000 failed deliveries daily. I would implement a dead letter queue for failed notifications with automatic retry at exponential backoff. If the error rate exceeds 0.1% for more than 15 minutes, an alarm fires and new deployments are automatically blocked until the rate recovers."
5. Resource Utilization
What it measures: How efficiently the system uses its provisioned resources—CPU, memory, disk I/O, network bandwidth.
Why it matters: Under-utilization wastes money. Over-utilization creates latency spikes and risk of outages. The sweet spot is typically 60–75% CPU utilization for production services.
Key utilization metrics:
CPU utilization: Target 60–75% average with headroom for spikes. Memory utilization: Monitor for memory leaks; target below 80% to prevent OOM kills. Disk I/O: Monitor IOPS and queue depth; high queue depth indicates I/O bottleneck. Network bandwidth: Measure against provisioned capacity; auto-scale when utilization exceeds 70%.
Interview application: "I would configure auto-scaling to add instances when average CPU exceeds 70% and remove instances when it drops below 40%. This maintains headroom for traffic spikes while avoiding over-provisioning during off-peak hours. For the database, I would monitor connection pool utilization—if it consistently exceeds 80%, that signals we need read replicas or connection pooling optimization."
6. Cache Hit Ratio
What it measures: The percentage of data requests served from cache rather than the origin database.
Why it matters: Cache hit ratio directly determines how much load reaches your database. A 90% hit ratio means the database sees 10x less traffic. A 99% hit ratio means 100x less.
Typical targets: 90–95% for general web applications. 99%+ for read-heavy systems with hot data (social media feeds, product catalogs). Below 80% suggests the caching strategy needs redesign.
Interview application: "The feed service is read-heavy with a 50:1 read-to-write ratio. I would cache the precomputed feed in Redis with a 5-minute TTL. Based on our access pattern analysis, 95% of feed reads are for the top 10% of active users whose feeds are in cache. This gives us a 95% cache hit ratio, reducing database read QPS from 50,000 to 2,500."
7. Data Growth Rate
What it measures: How fast the system's data volume increases over time, driving storage, indexing, and query performance decisions.
Why it matters: A system that works perfectly at 1 TB can break at 10 TB if the data model does not account for growth. Storage cost, query performance, and backup duration all scale with data volume.
How to estimate: Users × data per user per day × retention period.
Example: 10M users × 5 KB per request × 20 requests/day = 1 TB/day. With 1-year retention: 365 TB. This volume requires sharding, tiered storage, and data lifecycle policies.
Interview application: "At 1 TB per day of new data with 1-year retention, we will store 365 TB. I would shard the database by user_id across 50 partitions to keep individual partition sizes manageable. Data older than 90 days moves to cold storage via lifecycle policy. This keeps the hot dataset under 90 TB—within the performance envelope of our sharded PostgreSQL cluster."
For structured practice incorporating these metrics into complete system design solutions, Grokking the System Design Interview teaches back-of-envelope estimation and metric-driven design across 18 real-world problems.
SLOs, SLIs, and Error Budgets: The Production Framework
In production systems, metrics are organized into a framework popularized by Google's SRE practices.
SLI (Service Level Indicator): A specific metric that measures one aspect of service quality. Example: p99 latency of the /api/feed endpoint.
SLO (Service Level Objective): A target value for an SLI. Example: p99 latency of the /api/feed endpoint shall be below 200ms.
SLA (Service Level Agreement): A contract with consequences if the SLO is violated. Example: If p99 latency exceeds 200ms for more than 0.01% of requests in a month, the customer receives a service credit.
Error budget: The amount of unreliability your SLO allows. An availability SLO of 99.99% gives you an error budget of 52.6 minutes of downtime per year. When the error budget is consumed, new feature deployments halt until reliability improves.
Interview application: "I would define three SLOs for the notification service: p99 delivery latency under 5 seconds, availability of 99.99%, and error rate below 0.1%. The error budget for availability is 52.6 minutes per year. I would track error budget consumption on a 30-day rolling window. When 75% of the monthly error budget is consumed, we shift engineering effort from features to reliability work."
How to Use Metrics in Each Interview Phase
During estimation (first 5 minutes): Calculate QPS, storage, and bandwidth from user counts. These numbers become the constraints your design must satisfy. "10M DAU × 20 requests/day = ~2,300 QPS average, ~7,000 QPS peak."
During high-level design: Reference your metrics when choosing components. "At 7,000 QPS peak, a single database instance is insufficient. I would add read replicas and a cache layer."
During deep dive: Set specific SLOs for critical components. "The payment service targets p99 latency under 100ms and 99.99% availability."
During trade-offs: Frame trade-offs in metric terms. "Adding a second region improves availability from 99.99% to 99.999% but doubles infrastructure cost and adds 50ms of cross-region replication latency."
Frequently Asked Questions
What metrics should I mention in a system design interview?
The essential seven: latency (p50/p95/p99), throughput (QPS), availability (nines), error rate, resource utilization, cache hit ratio, and data growth rate. Choose the 3–4 most relevant to your specific design and quantify them with specific targets.
Why is p99 latency more important than average latency?
Average latency hides worst-case experience. A system with 50ms average but 5-second p99 has 1% of users experiencing terrible performance—often your highest-value users making complex requests. Amazon and Google both found that tail latency directly correlates with revenue loss. Always discuss p99 in interviews.
How do I estimate QPS from user counts?
DAU × requests per user per day ÷ 86,400 seconds = average QPS. Peak QPS is typically 2–3x average. Example: 10M DAU × 20 requests = 200M daily requests ÷ 86,400 = ~2,300 QPS average, ~7,000 peak.
What is a good cache hit ratio?
90–95% for general web applications. 99%+ for read-heavy systems with concentrated access patterns. Below 80% suggests the caching strategy needs redesign—likely a mismatch between what is cached and what is requested.
How do I choose between 99.9% and 99.99% availability?
99.9% allows 8.76 hours of annual downtime and requires multi-AZ deployment. 99.99% allows 52.6 minutes and requires automated failover with no single points of failure. 99.999% allows 5.26 minutes and requires multi-region active-active. Choose based on the business impact of downtime and the engineering cost of each additional nine.
What are SLOs and SLIs?
SLIs (Service Level Indicators) are specific metrics measuring service quality—like p99 latency or error rate. SLOs (Service Level Objectives) are target values for SLIs—like "p99 latency under 200ms." Error budgets are the allowed unreliability—99.99% availability allows 52.6 minutes of annual downtime.
How do I calculate storage requirements?
Users × data per user per day × retention period. Example: 10M users × 5 KB per request × 20 requests/day × 365 days = 365 TB per year. This drives decisions about sharding, tiered storage, and data lifecycle policies.
Should I mention resource utilization in system design interviews?
Yes, when discussing auto-scaling and cost efficiency. "Auto-scaling triggers at 70% CPU utilization and scales down at 40%" shows you understand capacity planning. Resource utilization connects system design to operational reality and cost.
What is an error budget?
The amount of unreliability your availability SLO permits. At 99.99% availability, your error budget is 52.6 minutes of downtime per year. When the budget is consumed, new deployments halt until reliability improves. This framework, popularized by Google SRE, balances feature velocity against system reliability.
How many metrics should I define for a system design interview?
Three to five SLOs for the overall system is appropriate. Define the most critical metric for each major component (latency for the API, availability for the database, error rate for the message queue). Too many metrics signals over-engineering; too few signals insufficient rigor.
TL;DR
System design scalability metrics transform vague descriptions into engineering commitments. The seven core metrics are: latency (always discuss p99, not just average—Amazon found 100ms of added latency costs 1% in sales), throughput (derive QPS from DAU × requests/user ÷ 86,400), availability (each additional nine doubles architectural complexity), error rate (target <0.1% with dead letter queues for retries), resource utilization (auto-scale at 70% CPU, scale down at 40%), cache hit ratio (95% reduces database load 20x), and data growth rate (users × data/user × retention = storage requirements). Organize metrics using the SLO/SLI/error budget framework from Google SRE. In interviews, calculate metrics during estimation, reference them when choosing components, set SLOs during deep dives, and frame trade-offs in metric terms. Quantified answers always score higher than qualitative ones.
GET YOUR FREE
Coding Questions Catalog

$197

$72

$78