Concept Deep-Dive · Traffic Tier

Load Balancing for System Design Interviews

The layered LB architecture, algorithm tradeoffs, what health checks miss, why session affinity is a tax, and how tail latency amplifies through the LB. The depth most candidates skip until the interviewer probes.

Arslan AhmadBy Arslan Ahmad·Last updated May 2026·Reading time ~22 min

01Why Load Balancing Has More Depth Than You Think

Most candidates treat load balancing as a checkbox: "we'd put a load balancer in front." Then they move on. The interviewer lets them, until the deep-dive phase, when the probes start: "What algorithm? Why? What happens during a deploy when half the backends are slow? How do health checks work, and what failure modes do they miss? Why does p99 latency look bad even though average latency looks fine?"

That's where mid-level candidates get exposed. The topic is genuinely broader than the algorithm choice. Production load balancing is a system: layers of LBs at different scopes, health checks that probe specific things and miss others, session affinity decisions that have real costs, and tail latency dynamics that show up at scale.

This page covers the depth most candidates skip. The algorithm choices are here, but they're not the point. The point is the architecture around the LB and the failure modes that interviewers probe.

The Senior Move

The senior signal in load balancing isn't naming round-robin. It's recognizing that production systems have multiple LBs at different layers (DNS, regional, internal mesh), each solving a different problem. Naming the layered architecture, even briefly, distinguishes you from candidates who think "load balancer" means one box in front of the API tier.

02What Load Balancing Actually Does

Load balancing distributes incoming traffic across multiple backend servers. That's the simple version. The longer version: load balancing solves four problems at once, and the choice of LB and configuration determines which it solves well.

  1. Capacity scaling. No single backend can handle all traffic. The LB lets you add backends and serve more requests proportionally.
  2. Availability. If a backend fails, the LB routes around it. Single-server availability is low; LB-fronted multi-server availability is high.
  3. Traffic shaping. The LB can do TLS termination, request inspection, geographic routing, and rate limiting. It's a control point for traffic before it reaches application servers.
  4. Deployment safety. Rolling deploys, canary releases, blue-green deployments. The LB is the mechanism that shifts traffic between versions of the application.

What load balancing does not do, despite what some interview prep material implies:

  • Reduce latency by itself. The LB adds a hop. The benefit comes from balanced backend utilization, not from the LB being faster than the backend.
  • Fix overloaded backends. If every backend is at 95% CPU, distributing traffic evenly across them just makes them all 95% loaded. LB is not capacity; it's distribution.
  • Replace observability. The LB sees request counts and rough latency, but it doesn't know what's slow inside the backend. You still need application metrics, traces, and logs.

03The Layered LB Architecture

Production systems at scale don't have one load balancer; they have several, at different layers. Each layer solves a different problem. Naming the layers explicitly is the move that separates senior candidates from mid-level ones.

The Three Layers of Load Balancing

Three layers of load balancing in a production systemCLIENTLAYER 1DNS / globalDNS / GLOBAL LBgeographic routingRoute 53,CloudflareLAYER 2Regional / L7REGIONAL LB · USL7, TLS, routingREGIONAL LB · EUL7, TLS, routingALB, NLB,Envoy, NginxAPI 1API 2API 3API 1API 2API 3LAYER 3Service mesh/ internalSERVICE-TO-SERVICE LBload balancing happens at every service-to-service hopUSER SVCFEED SVCSEARCHPAYMENTNOTIFYEnvoy,Linkerd,IstioEach layer solves a different problem.DNS routes geographically. Regional LB does L7. Mesh does internal service-to-service.

Production systems at scale have load balancers at three layers: DNS for global routing, regional L7 LBs for the front door, and a service mesh for internal service-to-service traffic. Each layer has its own algorithm choices and failure modes.

Layer 1: DNS / Global LB

The first layer of load balancing happens at DNS resolution. When a user types your URL, the DNS resolver returns an IP address. Geographic DNS (Route 53, Cloudflare, NS1) returns different IPs to different users based on geography, latency, or regional health. The user is routed toward a regional LB before any TCP connection is established.

Properties: very coarse-grained (DNS is cached for seconds to minutes), unaware of individual request shape, can't make request-level decisions. The LB choice happens once per DNS lookup, not once per request.

What it solves: geographic routing, regional failover (if a whole region is down, DNS stops returning its IP), and the first level of load distribution at the global scale.

Layer 2: Regional / L7 LB

Once a user reaches a region, they hit a regional load balancer. This is the LB most candidates think of: Application Load Balancer (AWS ALB), Google Cloud LB, Envoy, Nginx, HAProxy. It does TLS termination, sees the HTTP request, and routes based on path, host, header, or method.

Properties: per-request decisions, full visibility into request metadata, can do sophisticated routing rules. Latency overhead is single-digit milliseconds. This is where most algorithm choices we'll cover in Section 4 apply.

What it solves: per-request distribution to the API tier, TLS termination so backends don't need certificates, request-based routing (api.example.com vs admin.example.com to different backend pools), and the integration point for rate limiting and request shaping.

Layer 3: Service mesh / internal LB

Once a request enters the application tier, modern systems run a service mesh: load balancing happens between every internal service, not just at the edge. Each service-to-service call goes through a local sidecar proxy (Envoy is the canonical example) that handles load balancing, retries, circuit breaking, and observability.

Properties: the LB lives next to each service instance, so the network hop is minimal. The mesh has full visibility into every internal request, which is also what makes service-to-service observability tractable at scale. Tools: Envoy, Linkerd, Istio, Consul Connect, AWS App Mesh.

What it solves: internal service-to-service traffic distribution, retries with backoff, circuit breaking when a downstream service degrades, and observability of every internal request without instrumenting the application code.

The interview move

When the interviewer asks "what kind of load balancer would you use?", the strong response acknowledges the layers. "We'd have DNS-based geographic routing at the edge, an L7 regional LB for the API tier, and probably a service mesh for internal traffic if we have enough services. The algorithm question is mostly about the regional LB; the others have different concerns." That's the senior framing.

04The Algorithms (Framed by Tradeoffs)

The algorithm is what most candidates focus on. It's the smallest part of the topic, but it's still worth knowing five algorithms and what they trade away. The ones below cover almost every production case.

Round-Robin

The default · simple workloads

The LB cycles through backends in order: request 1 to backend A, request 2 to backend B, request 3 to backend C, then back to A. Stateless. Trivial to implement. The default for most cloud LBs and Nginx.

What it tradesAssumes all requests cost the same and all backends have the same capacity. Falls apart when backends are heterogeneous (some bigger, some smaller) or when request costs vary widely.

Least-Connections

When request times vary

The LB tracks how many active connections each backend has and routes to the backend with the fewest. Naturally adjusts to backends that are slow or stuck: their connection count grows, so the LB stops sending them traffic.

What it tradesRequires the LB to maintain per-backend state. Slightly more expensive than round-robin. Doesn't help if connections are short-lived (the count never gets high enough to differentiate).

Consistent Hashing

Cache affinity, sticky routing

The LB hashes the request (by user ID, session ID, or some other key) and routes consistently: requests with the same key always land on the same backend. Used heavily for cache locality (the backend's local cache stays warm for that key).

What it tradesDistribution can be uneven if some keys are much more active than others. Adding a backend remaps roughly 1/N of keys, which is much better than naive hash-mod-N. Sharding uses the same idea.

Weighted

Heterogeneous backend capacity

Each backend gets a weight reflecting its capacity. A backend with weight 4 receives 4x more traffic than a backend with weight 1. Works with any base algorithm (weighted round-robin, weighted least-connections, etc.).

What it tradesOperators have to set the weights and keep them updated as backends change. In autoscaling environments, this is friction; the weight has to track the instance type, which the LB doesn't always know.

Least-Response-Time

Latency-sensitive workloads

The LB tracks response time per backend and routes to the fastest. Combines availability detection (a slow backend gets less traffic) with naturally adapting to backends that are stressed.

What it tradesRequires latency measurement, which adds operational complexity. Can produce oscillation (the slow backend gets less traffic, recovers, gets more traffic, slows down again) if the LB updates aggressively. Most production LBs use exponential moving averages to smooth this.

The interview move on algorithms

"Which algorithm would you use?" The strong response picks one and ties it to the workload. "Round-robin for stateless API traffic where requests are roughly uniform. Consistent hashing for our cache layer, where session affinity gives us better cache hit rates. Least-connections for the database connection pool, where some queries are slow and we want to drain traffic from a stuck backend." Three sentences, three workloads, three algorithms with reasons.

L4 vs L7

You'll see "Layer 4 (TCP)" and "Layer 7 (HTTP)" load balancing distinguished in older guides. The distinction is real but matters less in 2026 than it used to. L4 is faster but blind to request content; L7 sees the request and can route on it. Almost all modern application LBs are L7. L4 LBs (AWS NLB, classic load balancers) survive for raw TCP traffic and for the highest-throughput cases. If asked, mention both, but the default for application traffic is L7.

05Health Checks: The Probe and the Failure Modes It Misses

The LB knows which backends are healthy through health checks: it pings each backend periodically, and routes traffic only to backends that respond healthy. This sounds simple. The depth probe is what the health check actually measures and what failure modes it misses.

What a health check actually measures

Three increasingly thorough health check styles:

  • TCP-level. Can the LB open a TCP connection to the backend? If yes, healthy. Cheap and fast. Misses application-level failures: the process is up, the port is open, but the application is wedged or returning errors to every request.
  • HTTP-level (shallow). The LB requests a specific path (often /health) and looks for a 200 OK response. Better than TCP-level. Misses any failure mode the health check endpoint doesn't exercise: the backend can return 200 to /health while every other path returns 500.
  • HTTP-level (deep). The health check endpoint actually exercises critical dependencies (database connection, cache reachability, dependent services). Returns 200 only if everything works. More accurate but more expensive, and the health check itself can take down the system if the dependency check is too aggressive.

Most production systems use HTTP-level shallow checks at the LB (cheap, fast, frequent) plus deeper checks at lower frequency from a separate monitoring system. The combination gives quick failover without overloading dependencies with health-check traffic.

Failure modes that health checks miss

Failure 01

Partial failure (the backend is degraded, not dead)

The backend responds 200 to health checks but takes 5 seconds per real request because of a slow database query, a memory leak, or a downstream dependency. The LB happily sends it traffic. Users see slow responses. The health check passes the entire time.

The fix is response-time-based health checks or active load shedding: if a backend's p99 latency exceeds a threshold, drain traffic from it even if it's "healthy." Most modern LBs (Envoy, AWS ALB with target group health) support outlier detection that does this.

Failure 02

Health-check storm during incident

A real incident happens. The LB starts marking backends unhealthy. Traffic concentrates on the remaining backends, which now overload, fail their own health checks, and drop out. The system death-spirals as healthy backends are marked unhealthy in turn.

The fix is panic mode: if more than a threshold percentage of backends are marked unhealthy (say, 50%), the LB sends traffic to all of them anyway, accepting some failed requests rather than concentrating all load on a few survivors. Envoy's "panic mode" is the canonical implementation.

Failure 03

Asymmetric failure (the backend can read but not write)

The backend's read path works (the health check passes by reading something). The write path is broken (a permission error, a misconfigured connection). Read traffic succeeds; every write returns 500. The health check tests one path; the application has multiple.

The fix is operation-specific health checks: a read health check and a write health check, with the LB routing reads and writes through different rules. Or accept that some failure modes won't be caught by health checks alone, and rely on application-level metrics to detect them.

Failure 04

Slow-start traffic spikes after deploy

A new backend instance comes up, passes the health check immediately, and gets a full slice of traffic. The instance hasn't warmed its caches, prepared connection pools, or JIT-compiled hot paths. The first thousand requests are all slow or error. By the time the instance is actually ready, it has already failed a bunch of users.

The fix is slow-start mode: when a new backend joins the pool, give it a fraction of normal traffic for some warm-up window. Most modern LBs support this. Combine with deeper readiness checks that confirm the instance is actually ready, not just running.

The interview move

"How do health checks work?" is a question with a real depth. The strong response covers what the health check probes, the partial-failure modes it misses, and at least one mitigation (outlier detection, panic mode, slow start). The weak response says "the LB pings each backend and removes unhealthy ones from rotation." Same topic, dramatically different signal.

06Session Affinity (and Why It's a Tax)

Session affinity (also called sticky sessions) means the LB sends a given user's requests to the same backend instance every time. This is convenient for stateful applications: the backend can keep session state in local memory because subsequent requests will come back to it. It is also a tax that most modern systems try to avoid.

When you actually need it

Three legitimate cases:

  • Cache locality. The backend's local cache stays warm for a user if they keep coming back to the same backend. Cache hit rates go up. The caching deep-dive covers this dynamic.
  • Stateful protocols. WebSockets, long-polling connections, server-sent events. Once a connection is established, it has to keep going to the same backend until it ends.
  • Per-user expensive setup. Some applications do expensive per-user initialization (loading per-user models, building large in-memory data structures). Sticking a user to one backend amortizes the cost.

Why it's a tax

Session affinity prevents the LB from doing its job. The whole point of load balancing is to distribute traffic; affinity counteracts that. Specific costs:

  • Uneven load. If users are unevenly active, the affinity routing concentrates traffic on backends with active users. Some backends are overloaded; others are idle.
  • Failover loses state. If a backend fails, all the users stuck to it lose their session-local state. The LB has to either rebuild the session elsewhere or fail the user's session.
  • Deploys are slower. Rolling deploys have to drain affinity sessions before terminating an instance. Either the deploy is slow (waits for sessions to expire) or the deploy disrupts users (kills active sessions).
  • Cache-coupling becomes app-coupling. Once your application relies on local cache and affinity, it's harder to add capacity by spinning up more instances. The new instances have cold caches; user routing has to migrate.

The modern approach: stateless backends

Production systems in 2026 try to be stateless at the application tier. State lives in the database or in a shared cache (Redis, Memcached). Backends are interchangeable; any backend can serve any user. The LB can do pure load distribution without affinity considerations.

For genuinely stateful workloads (WebSockets), the affinity is acceptable because the alternative is much worse. For session caching that "could" use affinity, the move is to put session data in Redis instead and let any backend serve any user.

The interview move

"Would you use sticky sessions?" The strong response defaults to no and explains why. "I'd avoid affinity by default. Session state goes in Redis. Backends are interchangeable. We'd use affinity only for WebSocket connections where the protocol requires it. The cost of affinity outweighs the convenience for most application traffic."

07Tail Latency Amplification: The Staff-Level Probe

This is the load-balancing topic most likely to come up at staff and above and most likely to expose mid-level candidates. The idea: when the LB fans out a request to multiple backends and waits for all of them, the slowest one determines the latency. The system's p99 is dominated by tail latencies of individual backends, not by their averages.

The math

Suppose each backend has p99 of 100ms and you make a single request. Your p99 is 100ms. Now suppose your application calls 10 backends in parallel and waits for all of them. The probability that any single backend is in its 99th percentile is 1%. The probability that at least one of 10 backends is in its 99th percentile is much higher: roughly 1 - 0.99^10 = 9.6%. Your application-level p99 is no longer 100ms; it's whatever p90 of a single backend is, which is much higher than the average.

This effect compounds with fan-out. A request that touches 100 backends in parallel will see at least one in its 99th percentile on most requests. The system's effective latency is the slow tail of any individual backend, amplified by the fan-out.

Why this matters for load balancing

Load balancing affects tail latency in two ways. First, the algorithm choice matters: round-robin can route to a slow backend even if other backends are idle, because round-robin doesn't see latency. Least-response-time avoids this. Second, the LB is the only place where tail latency dynamics can be measured and shaped: backend code can't see the cross-backend distribution; only the LB can.

Three mitigations

  • Hedged requests. If a request hasn't returned within some short timeout (say, 95th percentile latency), send a second request to a different backend. Take whichever returns first. The cost is roughly 5% extra request volume; the benefit is that p99 drops to roughly p95 of the underlying backends.
  • Tied requests. A more sophisticated variant where the second request can cancel the first if the second responds first, reducing the duplicate work.
  • Outlier detection. The LB tracks per-backend latency distribution. Backends that are unusually slow are temporarily removed from rotation. This drains traffic from a degraded backend before it pollutes the system's tail latency.

The interview move

This concept comes up when the interviewer asks "why is your p99 so much worse than your average?" or "what would you do if one backend was slow?" The strong response names tail latency amplification as the underlying dynamic, then proposes one of the mitigations. Naming the math (the 1 - 0.99^N formula) is a depth signal at staff and above.

The system's effective latency is the slow tail of any individual backend, amplified by the fan-out. Load balancing is the only layer where this can be measured and shaped.

08Geographic Routing

Most production systems serve users globally. Geographic load balancing routes users to the region closest to them, reducing the speed-of-light cost of every request.

Three patterns to know:

DNS-based geographic routing

The most common pattern. The DNS layer (Layer 1 from the architecture diagram) returns different IPs to different users based on geographic IP databases or DNS resolver location. A user in Tokyo gets the IP of the APAC region's regional LB; a user in London gets the EU region's IP.

Properties: coarse (DNS is cached and resolves once per domain lookup), simple to operate, works without any application awareness. The default approach for global content delivery.

Anycast routing

Multiple regions advertise the same IP address to the internet. BGP routing automatically sends each user's traffic to the topologically nearest region. The user doesn't see different IPs; the network handles the routing.

Properties: fast failover (when a region goes down, BGP withdraws the advertisement and routes shift), no DNS caching to invalidate. Used heavily by CDNs (Cloudflare, Fastly) and by some cloud LBs (Google Cloud Global Load Balancer).

Application-level region selection

The application explicitly chooses which region to route to, often based on user attributes (their home region, where their data lives, sovereignty requirements). The LB at the chosen region's edge takes it from there.

Properties: fine-grained, can honor per-user requirements like data sovereignty. Adds application complexity. Used in multi-tenant systems where different tenants have different region requirements.

2026 Reality Check

Most products start with DNS-based geographic routing because it's simple and works at scale. Anycast is the next step when you need faster regional failover. Application-level region selection appears when you have specific sovereignty or data-locality requirements, often forced by GDPR or similar regulations. The choice cascades through your entire architecture, including replication.

09How Load Balancing Interacts With Other Concepts

  • Load balancing × Caching. Consistent hashing in the LB sends the same user to the same backend, which improves cache hit rates dramatically. Round-robin defeats this. The choice of LB algorithm directly affects cache effectiveness. The caching deep-dive covers cache placement strategies.
  • Load balancing × Sharding. The LB needs to know which backend (or shard) owns which key. With consistent hashing, this knowledge can live in the LB itself. Sharding uses the same consistent-hashing primitive at the data layer.
  • Load balancing × Replication. Read traffic can be load-balanced across replicas; write traffic has to go to the primary. The LB is where this routing decision happens. Replication and consistency covers the broader topic.
  • Load balancing × Rate limiting. The LB is a natural enforcement point for rate limits because it sees every request. Limit at the LB; the backends are spared from processing rejected requests. The dedicated rate limiting deep-dive covers this in detail.
  • Load balancing × Observability. The LB sees every request. It's where most production traffic metrics come from. The choice of LB and how it exports metrics shapes how observable your system is.

For more cross-concept interactions, see the concepts library hub.

10Practice Scenarios

Three scenarios. Read the setup. Decide your load balancing approach before opening the reveal.

Scenario 01

After a deploy, p99 latency triples even though average latency stays roughly the same. What's happening?

The deploy added 20% capacity (new instances) using the same regional LB with round-robin. Average latency is unchanged. p99 climbs from 200ms to 600ms within minutes of the deploy.

How to think about this

The new instances are routed to immediately by round-robin, but they haven't warmed up. Caches are cold, JIT compilation hasn't happened, connection pools haven't filled. The new instances respond slowly for the first minutes after they join. The LB doesn't see this as "unhealthy" because the health check passes; it sees normal-looking responses that happen to take 600ms.

The fix is slow-start mode at the LB: new backends receive a fraction of normal traffic for a warm-up window (say 60 seconds), ramping up linearly. During that window, the new instance has time to warm up without being slammed by traffic. The result: p99 stays near 200ms during the deploy because new instances aren't pulled into the fast traffic until they're actually fast.

Strong answer: "Tail latency from cold-cache new instances. Slow-start mode at the LB to ramp new backends gradually. Combine with deeper readiness checks that confirm the instance is genuinely ready, not just running."

Scenario 02

A real-time chat application uses WebSocket connections. How do you load balance them?

Users connect via WebSocket and stay connected for hours. Messages flow over the connection in both directions. There are 100K concurrent connections per region, balanced across 50 backends.

How to think about this

WebSockets are a stateful protocol. Once established, the connection has to keep going to the same backend. This forces session affinity at the LB.

The choice of how to route the initial connection still matters. Round-robin works but ignores backend load. Least-connections is better because it actually balances. Consistent hashing on user ID can give cache locality, but the affinity is forced by the protocol regardless.

The harder part is what happens during deploys and partial failures. A backend that goes down kicks all its WebSocket connections; clients have to reconnect. The reconnect storm can overload the remaining backends. Mitigations: drain connections gradually before terminating a backend (don't kill mid-session), use rate-limited reconnection on the client side, and overprovision capacity so the remaining backends can absorb a wave of reconnects.

Strong answer: "Layer 4 LB (NLB or equivalent) with least-connections for the initial connection routing, then session affinity for the lifetime of the WebSocket because the protocol requires it. Connection draining on deploys, client-side reconnect with exponential backoff, and capacity headroom to absorb reconnect storms."

Scenario 03

During an incident, half your backends start returning 500 errors but pass health checks. What happens, and how do you mitigate?

A bad deploy introduces a bug that causes 30% of requests to error. The bug only triggers on real user traffic, not on the health-check endpoint. The LB sees all backends as healthy because health checks pass. Users see 30% error rate.

How to think about this

This is the asymmetric failure mode. Health checks aren't catching the actual user-facing failures because they don't exercise the failing code path.

Three layered mitigations:

1. Outlier detection in the LB. The LB tracks success rates per backend in real time. Backends with elevated 5xx rates get drained from rotation, even if health checks pass. Envoy's outlier detection does this; AWS ALB has similar behavior with target group health.

2. Faster rollback path. When error rates spike, the deploy system should automatically revert. The detection is in metrics, not in health checks. Connect alerts on error rate to deploy automation.

3. Better health checks. The health check endpoint should exercise the same code path that real requests do. If real requests hit the database, so should the health check. Generic /health endpoints that just return 200 catch processes that are dead, not bugs that affect specific paths.

Strong answer: "Outlier detection at the LB to drain failing backends based on error rate, automated rollback wired to error-rate alerts, and richer health checks that exercise real code paths. The health check alone won't catch this; it has to be the LB plus metrics plus deploy automation."

11Load Balancing FAQ

L4 or L7?

For application traffic in 2026, L7 is the default. You get request-level visibility, can route on path or header, and can do TLS termination. L4 survives for cases where you need raw TCP throughput (very high QPS pure data plane), or for protocols where the LB doesn't need to inspect the payload (some database protocols, custom binary protocols). Don't say "L4" without a reason; say "L7" by default.

How many load balancers do I actually need?

For a global product, three layers: DNS for geographic routing, regional L7 LB for the API tier, and possibly a service mesh for internal traffic. For a single-region product, two layers: DNS (or just a static hostname) plus regional LB. For a single-server product, often zero (the server handles incoming traffic directly). Match the number of layers to the actual complexity of your traffic.

Should I run my own load balancer or use a managed one?

Managed unless you have a specific reason. AWS ALB/NLB, Google Cloud LB, Cloudflare, and similar services handle the operational burden. Self-managed (Nginx, HAProxy, Envoy) is appropriate when you need control the managed services don't offer (very specific routing logic, on-prem deployments, custom protocols). Most products should default to managed.

What's the difference between a load balancer and an API gateway?

Overlap is real. Load balancers focus on traffic distribution and availability. API gateways add concerns like authentication, rate limiting, request transformation, and API versioning. Many systems run both: a regional LB in front of an API gateway, which then routes to backend services. Some systems collapse them: the API gateway handles load balancing too. The naming matters less than what concerns the layer is solving.

How do load balancers handle TLS?

Two modes. TLS termination: the LB decrypts traffic, sees the request in plaintext, can route on it, then sends plaintext to the backend (or re-encrypts if backend traffic must be encrypted). The default for most products. TLS passthrough: the LB doesn't decrypt; it just routes encrypted bytes. Used when the backend must terminate TLS (compliance, end-to-end encryption requirements). Performance is similar; the difference is what the LB can see.

What does "anycast" mean and when does it matter?

Anycast is when multiple servers in different regions advertise the same IP address. Network routing automatically sends each user to the topologically nearest server. Used heavily by CDNs and global load balancers (Cloudflare, Google Cloud's global LB). It matters when you need faster regional failover than DNS can provide, or when you don't want clients to cache geographic IPs in DNS. For most products, DNS-based geographic routing is sufficient.

What about gRPC and HTTP/2?

HTTP/2 multiplexes many requests over a single TCP connection. This breaks naive connection-level load balancing: ten requests over one connection all go to the same backend. To balance HTTP/2 traffic correctly, the LB needs to be HTTP/2-aware (most modern L7 LBs are) and balance at the request level, not the connection level. gRPC has the same dynamic; gRPC LB is a real concern that needs an LB that understands the protocol. Envoy and the cloud L7 LBs handle this; older Nginx versions did not. If you're designing a system with heavy gRPC, mention this explicitly.

How does autoscaling interact with load balancing?

The LB needs to discover new backends as autoscaling adds them and remove them as autoscaling drains them. In managed cloud LBs, this happens through target groups: the autoscaler updates the target group, the LB picks up the change. The new backend should not receive full traffic immediately (slow-start mode) and a draining backend should finish in-flight requests before terminating (connection draining). Both behaviors are configurable in modern LBs.

What's the right way to think about p99 vs average latency?

Average is a measure of typical performance; p99 is a measure of worst-case user experience. Most users see something like the average. The unlucky 1% see something like the p99. At scale, "1% of requests" is a lot of users; an unhealthy p99 means real complaints. Always look at both. The interview signal: candidates who only quote averages are missing the tail-latency dynamics. Candidates who quote both, and explain why p99 might differ, signal real production experience.

Continue

Message Queues and Event Buses →

The next concept on the recommended learning path. Decoupling producers from consumers, ordering guarantees, durability, exactly-once vs at-least-once delivery, and the depth probes Kafka and SQS produce in interviews.

Design Gurus logo
One-Stop Portal For Tech Interviews.
Copyright © 2026 Design Gurus, LLC. All rights reserved.