Best practices for designing scalable backend services

Question

Design Gurus · Accepted Answer

Scalable backend services are server-side components designed to handle increasing traffic, data volume, and user concurrency without degradation in latency, throughput, or availability. Scalability is not a single technique—it is a collection of design decisions that compound: stateless services enable horizontal scaling, connection pooling prevents database exhaustion, caching reduces redundant computation, asynchronous processing decouples request handling from heavy work, and sharding distributes data beyond single-node limits. In system design interviews, every architecture you propose is a backend service architecture. When the interviewer asks "How does this handle 10x traffic?", they are testing whether you have embedded scalability into the design from the start or bolted it on as an afterthought. The best answers demonstrate that scalability is not a phase—it is a property of well-designed services.

Key Takeaways

Stateless services are the foundation of horizontal scaling. If a service stores no session data locally, any instance can handle any request. Adding instances doubles capacity. Removing instances loses nothing.  
The database is almost always the bottleneck. Application servers scale horizontally with ease. Databases do not. Every scalability strategy—caching, read replicas, sharding, CQRS—exists because the database hits its limit before anything else.  
Design every write operation as idempotent. In distributed systems, retries are inevitable. An idempotent operation produces the same result whether executed once or ten times. This prevents duplicate orders, double charges, and corrupted state.  
Asynchronous processing decouples request latency from work duration. A user does not need to wait for an email to send, a thumbnail to generate, or analytics to record. Push heavy work to background queues.  
Observability is not optional. You cannot scale what you cannot measure. Instrument every service with metrics (latency, error rate, throughput), logs (structured, searchable), and traces (request flow across services).

Practice 1: Design Stateless Services

A stateless service stores no client-specific data between requests. Every request contains all the information needed to process it—typically via a JWT token, API key, or session identifier that maps to an external store (Redis, DynamoDB).

Why it matters: Stateless services scale horizontally by adding instances behind a load balancer. Any instance can handle any request because no instance holds unique state. Auto-scaling adds instances during traffic spikes and removes them during lulls—without state migration or session draining.

The statefulness trap: A service that stores user sessions in local memory becomes stateful. If that instance fails, sessions are lost. If the load balancer routes a user to a different instance, their session is missing. Sticky sessions (routing users to the same instance) are a band-aid that prevents horizontal scaling.

Interview application: "All services are stateless. User sessions are stored in Redis with a 30-minute TTL. JWTs contain the user_id and role, validated on every request without a database lookup. This means I can auto-scale from 3 to 30 instances during a traffic spike with zero session management complexity."

Practice 2: Scale the Database Separately

Application servers are cheap to scale—add more containers. Databases are expensive to scale—adding capacity requires read replicas, connection pooling, caching, or sharding, each with significant trade-offs.

Read Replicas

Create 1–3 read-only copies of the primary database. Route read traffic to replicas and write traffic to the primary. For read-heavy workloads (90% reads), this 3–4x multiplies read capacity.

Trade-off: Replicas may serve slightly stale data due to replication lag (typically milliseconds, but can spike to seconds under heavy write load). Acceptable for feeds and dashboards; not acceptable for payment balances.

Connection Pooling

Every database connection consumes memory and file descriptors. Without pooling, each request opens a new connection—and at 10,000 concurrent requests, the database runs out of connections before it runs out of CPU. Connection pools (PgBouncer for PostgreSQL, HikariCP for Java) maintain a fixed set of reusable connections, serving thousands of requests through dozens of connections.

Interview application: "At 10,000 QPS, direct connections would exhaust PostgreSQL's default 100-connection limit. I would deploy PgBouncer in transaction mode with a pool of 50 connections. Each request borrows a connection for the duration of its query and returns it immediately. This serves 10,000 QPS through 50 connections."

Caching

A Redis cache with 90–95% hit ratio reduces database load by 10–20x. Cache frequently accessed data with appropriate TTLs. Use cache-aside (lazy loading) for read-heavy patterns: check cache first, query database on miss, populate cache.

Best practices for designing scalable backend services

Key Takeaways

Practice 1: Design Stateless Services

Practice 2: Scale the Database Separately

Read Replicas

Connection Pooling

Caching

Sharding

Practice 3: Make Every Write Idempotent

Practice 4: Process Heavy Work Asynchronously

Practice 5: Implement Rate Limiting and Backpressure

Practice 6: Design for Failure

Practice 7: Build Observability From Day One

Practice 8: Auto-Scale Based on the Right Metrics

Frequently Asked Questions

What is the most important scalability best practice?

Why is the database usually the bottleneck?

What is connection pooling and why does it matter?

How do I make a service idempotent?

When should I use asynchronous processing?

What is a circuit breaker and when do I need one?

How do I choose the right auto-scaling metric?

What is graceful degradation?

How many read replicas should I use?

What observability tools should I mention in a system design interview?

TL;DR