Best practices for designing scalable backend services
Scalable backend services are server-side components designed to handle increasing traffic, data volume, and user concurrency without degradation in latency, throughput, or availability. Scalability is not a single technique—it is a collection of design decisions that compound: stateless services enable horizontal scaling, connection pooling prevents database exhaustion, caching reduces redundant computation, asynchronous processing decouples request handling from heavy work, and sharding distributes data beyond single-node limits. In system design interviews, every architecture you propose is a backend service architecture. When the interviewer asks "How does this handle 10x traffic?", they are testing whether you have embedded scalability into the design from the start or bolted it on as an afterthought. The best answers demonstrate that scalability is not a phase—it is a property of well-designed services.
Key Takeaways
- Stateless services are the foundation of horizontal scaling. If a service stores no session data locally, any instance can handle any request. Adding instances doubles capacity. Removing instances loses nothing.
- The database is almost always the bottleneck. Application servers scale horizontally with ease. Databases do not. Every scalability strategy—caching, read replicas, sharding, CQRS—exists because the database hits its limit before anything else.
- Design every write operation as idempotent. In distributed systems, retries are inevitable. An idempotent operation produces the same result whether executed once or ten times. This prevents duplicate orders, double charges, and corrupted state.
- Asynchronous processing decouples request latency from work duration. A user does not need to wait for an email to send, a thumbnail to generate, or analytics to record. Push heavy work to background queues.
- Observability is not optional. You cannot scale what you cannot measure. Instrument every service with metrics (latency, error rate, throughput), logs (structured, searchable), and traces (request flow across services).
Practice 1: Design Stateless Services
A stateless service stores no client-specific data between requests. Every request contains all the information needed to process it—typically via a JWT token, API key, or session identifier that maps to an external store (Redis, DynamoDB).
Why it matters: Stateless services scale horizontally by adding instances behind a load balancer. Any instance can handle any request because no instance holds unique state. Auto-scaling adds instances during traffic spikes and removes them during lulls—without state migration or session draining.
The statefulness trap: A service that stores user sessions in local memory becomes stateful. If that instance fails, sessions are lost. If the load balancer routes a user to a different instance, their session is missing. Sticky sessions (routing users to the same instance) are a band-aid that prevents horizontal scaling.
Interview application: "All services are stateless. User sessions are stored in Redis with a 30-minute TTL. JWTs contain the user_id and role, validated on every request without a database lookup. This means I can auto-scale from 3 to 30 instances during a traffic spike with zero session management complexity."
Practice 2: Scale the Database Separately
Application servers are cheap to scale—add more containers. Databases are expensive to scale—adding capacity requires read replicas, connection pooling, caching, or sharding, each with significant trade-offs.
Read Replicas
Create 1–3 read-only copies of the primary database. Route read traffic to replicas and write traffic to the primary. For read-heavy workloads (90% reads), this 3–4x multiplies read capacity.
Trade-off: Replicas may serve slightly stale data due to replication lag (typically milliseconds, but can spike to seconds under heavy write load). Acceptable for feeds and dashboards; not acceptable for payment balances.
Connection Pooling
Every database connection consumes memory and file descriptors. Without pooling, each request opens a new connection—and at 10,000 concurrent requests, the database runs out of connections before it runs out of CPU. Connection pools (PgBouncer for PostgreSQL, HikariCP for Java) maintain a fixed set of reusable connections, serving thousands of requests through dozens of connections.
Interview application: "At 10,000 QPS, direct connections would exhaust PostgreSQL's default 100-connection limit. I would deploy PgBouncer in transaction mode with a pool of 50 connections. Each request borrows a connection for the duration of its query and returns it immediately. This serves 10,000 QPS through 50 connections."
Caching
A Redis cache with 90–95% hit ratio reduces database load by 10–20x. Cache frequently accessed data with appropriate TTLs. Use cache-aside (lazy loading) for read-heavy patterns: check cache first, query database on miss, populate cache.
Sharding
When a single database node cannot handle the write throughput or storage capacity, shard the data across multiple nodes. Hash-based sharding on a unique key (user_id) distributes data evenly. This is a last resort—sharding adds complexity for cross-shard queries and distributed transactions.
Practice 3: Make Every Write Idempotent
In distributed systems, networks fail, clients retry, and messages are delivered more than once. An idempotent operation produces the same result regardless of how many times it executes.
Implementation patterns:
Unique request identifiers: Every write request includes a client-generated UUID. The server checks if this UUID has been processed before executing. If already processed, return the stored result.
Database constraints: Use UPSERT (INSERT ON CONFLICT UPDATE) instead of INSERT to handle duplicate writes gracefully. The database enforces uniqueness rather than application code.
Natural idempotency: GET, PUT, and DELETE are naturally idempotent by design. POST is not—add idempotency keys to all POST endpoints that modify state.
Interview application: "Every POST endpoint accepts an Idempotency-Key header. The server stores key-to-result mappings in Redis with a 24-hour TTL. Duplicate requests return the stored result without re-execution. This prevents double-charging when clients retry after network timeouts."
Practice 4: Process Heavy Work Asynchronously
A user clicking "Place Order" should not wait for the confirmation email, inventory update, fraud scan, and analytics event to complete. These operations happen asynchronously after the order is persisted.
Pattern: The API handler persists the order to the database, publishes an OrderPlaced event to Kafka, and returns a 202 Accepted response immediately. Downstream consumers—email service, inventory service, fraud detection—process the event independently at their own pace.
Benefits: Request latency drops from the sum of all downstream operations (500ms+) to just the database write (10ms). Each downstream service scales independently. A slow fraud scan does not block order confirmation.
Interview application: "The order API writes the order to PostgreSQL and publishes to Kafka, returning 202 in under 50ms. The email consumer, inventory consumer, and analytics consumer each subscribe to the order-events topic independently. If the email service goes down, messages accumulate in Kafka and are processed when the service recovers—no orders are lost."
Practice 5: Implement Rate Limiting and Backpressure
Without rate limiting, a single misbehaving client can overwhelm the entire service. Without backpressure, a traffic spike cascades through every downstream dependency.
Rate limiting: Enforce per-client limits at the API gateway using token bucket. Return 429 Too Many Requests with Retry-After header. Different client tiers receive different limits (free: 100/min, pro: 1,000/min).
Backpressure: When a downstream service is slow, the upstream service must not keep sending requests. Circuit breakers stop calls to failing dependencies. Message queues buffer traffic spikes. Auto-scaling responds to increased load—but has a startup delay that queues must absorb.
Practice 6: Design for Failure
Every component will fail. Scalable services handle failure gracefully instead of catastrophically.
Circuit breakers: Stop calling a failing downstream service after consecutive failures. The circuit opens (requests rejected immediately), then half-opens (test requests probe recovery), then closes (normal operation resumes). Prevents cascading failures across the service mesh.
Retries with exponential backoff: Retry transient failures (network blips, temporary overload) with increasing delays: 100ms, 200ms, 400ms, 800ms. Add jitter (randomization) to prevent thundering herd—thousands of clients retrying simultaneously at the same interval.
Graceful degradation: When a non-critical dependency fails, serve a reduced experience instead of an error. If the recommendation service is down, show trending items. If the personalization service is slow, show generic content. The core service continues operating.
Health checks: Every service exposes a /health endpoint. The load balancer polls it every 10 seconds. Unhealthy instances are removed from rotation. Kubernetes liveness and readiness probes automate this entirely.
Timeouts: Set explicit timeouts on every outbound call—database queries, HTTP requests to other services, cache lookups. A missing timeout means a slow dependency blocks your thread pool indefinitely, cascading failure across the entire service. Default to aggressive timeouts (500ms–2s for inter-service calls) and tune based on observed p99 latency. A request waiting 30 seconds for a response that will never come is worse than failing fast and retrying.
Practice 7: Build Observability From Day One
Scalability decisions require data. Without metrics, you are guessing at bottlenecks.
The three pillars:
Metrics (Prometheus + Grafana): p50/p95/p99 latency, error rate, throughput, CPU/memory utilization, cache hit ratio, database connection pool usage, consumer lag. Set alerts on SLO breaches.
Logs (ELK stack or Datadog): Structured JSON logs with request_id, user_id, service name, and duration. Centralized log aggregation enables searching across all services.
Traces (OpenTelemetry + Jaeger): Follow a single request across 5–15 microservices. Identify which service contributes the most latency. Without distributed tracing, debugging a slow request in a microservices architecture is guesswork.
Interview application: "Every service exposes Prometheus metrics on /metrics. Grafana dashboards show p99 latency, error rate, and throughput per service. Alerts fire to PagerDuty when p99 exceeds 500ms or error rate exceeds 0.5% for 5 consecutive minutes. OpenTelemetry traces propagate a trace_id across all services for end-to-end latency analysis."
Practice 8: Auto-Scale Based on the Right Metrics
Auto-scaling adds and removes instances based on metrics. The choice of metric determines how responsive and cost-effective the scaling is.
CPU-based scaling: Scale when average CPU exceeds 70%. Simple and widely used. Limitation: CPU may not correlate with request latency—a service can have low CPU but high latency due to database contention.
Request-rate-based scaling: Scale based on requests per second per instance. More directly tied to user experience. Better for API services where each request has predictable CPU cost.
Queue-depth-based scaling: For worker services consuming from Kafka or SQS, scale based on queue depth or consumer lag. If messages accumulate faster than they are processed, add workers.
For structured practice applying these scalability best practices across complete system design problems, Grokking the System Design Interview covers backend architecture decisions in every design solution.
For advanced patterns including multi-region service deployment, distributed consensus, and production-scale auto-scaling strategies, Grokking the Advanced System Design Interview builds the depth required for L6+ interviews.
Frequently Asked Questions
What is the most important scalability best practice?
Designing stateless services. Statelessness enables horizontal scaling—the simplest and most effective way to increase capacity. Store all session and user-specific data externally (Redis, DynamoDB) so any instance can handle any request.
Why is the database usually the bottleneck?
Application servers are stateless and cheap to replicate. Databases hold state, enforce consistency, and manage disk I/O—all of which have physical limits. Read replicas, caching, connection pooling, and sharding exist because the database hits its capacity ceiling before anything else in the stack.
What is connection pooling and why does it matter?
Connection pooling maintains a fixed set of reusable database connections shared across requests. Without pooling, each request opens a new connection—and at high concurrency, the database runs out of connection slots. PgBouncer (PostgreSQL) and HikariCP (Java) are standard implementations.
How do I make a service idempotent?
Require client-generated idempotency keys on all state-changing requests. Check if the key has been processed before executing. Store key-to-result mappings with a TTL (24–72 hours). Use database UPSERT operations for natural idempotency. This prevents duplicate operations during retries.
When should I use asynchronous processing?
When the user does not need to wait for the result: sending emails, generating thumbnails, updating analytics, processing notifications. Publish events to Kafka or SQS and let consumers process them at their own pace. Keep the synchronous request path fast (under 100ms).
What is a circuit breaker and when do I need one?
A circuit breaker stops calling a failing downstream service after consecutive failures, preventing cascading failures. It transitions through three states: closed (normal), open (requests rejected), half-open (testing recovery). Use circuit breakers for every inter-service dependency in a microservices architecture.
How do I choose the right auto-scaling metric?
CPU utilization for compute-bound services. Request rate for API services with predictable per-request cost. Queue depth or consumer lag for worker services processing messages from Kafka or SQS. The metric should correlate with the user experience you are optimizing.
What is graceful degradation?
Serving reduced functionality when a non-critical dependency fails, instead of returning an error. If the recommendation engine is down, show trending items. If personalization fails, show generic content. The core service continues operating while non-critical features are temporarily disabled.
How many read replicas should I use?
Start with 1–2 for development and 2–3 for production. Each replica roughly doubles read capacity (minus replication overhead). For a 90% read / 10% write workload, 3 replicas handle approximately 4x the read load of the primary alone.
What observability tools should I mention in a system design interview?
Prometheus + Grafana for metrics, ELK stack or Datadog for centralized logging, OpenTelemetry + Jaeger for distributed tracing. Mention specific metrics you would track: p99 latency, error rate, cache hit ratio, connection pool utilization, and consumer lag. Set alerts on SLO breaches.
TL;DR
Scalable backend services are built on eight core practices: stateless services (store no local state—scale horizontally by adding instances), database scaling (read replicas, connection pooling, caching, sharding—the database is always the bottleneck), idempotent writes (client-generated keys prevent duplicates during retries), asynchronous processing (push heavy work to Kafka/SQS—keep API latency under 100ms), rate limiting and backpressure (protect services from overload with token bucket and circuit breakers), design for failure (circuit breakers, retries with backoff and jitter, graceful degradation), observability (Prometheus metrics, structured logs, OpenTelemetry traces—you cannot scale what you cannot measure), and auto-scaling on the right metric (CPU for compute-bound, request rate for APIs, queue depth for workers). The database scales last and hardest—invest in caching and read replicas before reaching for sharding. In interviews, demonstrate that scalability is embedded in every design decision, not added as an afterthought.
GET YOUR FREE
Coding Questions Catalog

$197

$72

$78