On this page

The Illusion of Uptime

Cascading Failures

The Mechanism of Propagation

The Solution: Timeouts and Deadlines

The Circuit Breaker Pattern

Circuit Breaker States

The Thundering Herd Problem

Retry Storms

Mitigation: Exponential Backoff and Jitter

When the Load Balancer Fails

Active-Passive Redundancy

Health Checks

Idempotency: Making Retries Safe

Conclusion

Defensive Coding: Strategies for Handling Network Timeouts and Errors

Image
Arslan Ahmad
Go beyond simple coding tutorials. This article explains the architectural patterns required to keep large-scale systems running when hardware or networks fail.
Image

The Illusion of Uptime

Cascading Failures

The Mechanism of Propagation

The Solution: Timeouts and Deadlines

The Circuit Breaker Pattern

Circuit Breaker States

The Thundering Herd Problem

Retry Storms

Mitigation: Exponential Backoff and Jitter

When the Load Balancer Fails

Active-Passive Redundancy

Health Checks

Idempotency: Making Retries Safe

Conclusion

Software engineering education often focuses on ideal scenarios.

You write a function, provide it with valid input, and assert that it returns the correct output. When the tests pass on your local machine, the feature is considered complete.

This flawless execution flow, where networks are fast, databases are online, and resources are infinite, is known as the Happy Path.

However, the reality of deploying software to a production environment is starkly different.

In modern distributed systems, applications are split into multiple services running on different physical or virtual machines. These components communicate over a network that is inherently unreliable.

If you design your system solely for the Happy Path, you are building a fragile architecture. A single timeout in a minor service can trigger a total system collapse.

To become a senior engineer or a software architect, you must shift your mindset. You must stop hoping that failure won't happen and start designing specifically for when it does.

This article provides a deep dive into fault tolerance and system resilience. We will move beyond the basics of "adding a load balancer" to explore the technical mechanics of cascading failures, retry storms, and the architectural patterns used to mitigate them.

The Illusion of Uptime

The first step in designing for failure is acknowledging that 100% uptime for any single component is impossible.

In a large-scale system with thousands of servers, hard drives fail every day.

When a Junior Developer encounters a failed network request, the instinct is often to catch the error and log it, or perhaps retry the request immediately.

While this works for minor blips, it creates dangerous instabilities under load.

We need to understand Resource Exhaustion.

Every request that enters your system consumes resources: memory (RAM), CPU cycles, and network sockets.

Most importantly, synchronous requests consume execution threads.

If a service receives a request and has to wait for a database to respond, that thread sits idle, holding onto its memory, unable to do any other work.

This technical constraint is the foundation of most large-scale system failures.

Cascading Failures

A Cascading Failure is a failure in one component of a distributed system that triggers the failure of other components, eventually bringing down the entire system. It is the architectural equivalent of a chain reaction.

The Mechanism of Propagation

Consider a standard architecture where Service A (Frontend) calls Service B (API), which calls Service C (Database).

  1. The Trigger: Service C experiences a minor slowdown. Instead of responding in 10 milliseconds, it takes 2 seconds.

  2. The Block: Service B sends a request to Service C. The thread in Service B must wait 2 seconds for the response. During this time, the thread is blocked.

  3. The Pile-up: As user traffic continues to arrive, more threads in Service B get stuck waiting for Service C.

  4. The Exhaustion: Service B has a fixed thread pool (e.g., 200 threads). Within seconds, all 200 threads are blocked waiting for Service C. Service B can no longer accept new requests, even if they don't depend on Service C. Service B becomes unresponsive.

  5. The Propagation: Service A is now trying to call Service B. Because Service B is unresponsive, Service A's threads begin to hang. Service A's thread pool fills up. Service A crashes.

The root cause was a slow database, but the result is that the entire application is offline.

Image

The Solution: Timeouts and Deadlines

The most fundamental defense against cascading failures is the correct application of Network Timeouts.

A timeout puts a hard limit on how long a service will wait for a response.

If Service B is configured with a 500-millisecond timeout for calls to Service C, it will abort the request if Service C is too slow.

By aborting the request, Service B frees up the execution thread. It returns an error to the user, but the service itself remains healthy and ready to handle other requests. It is better to fail a small percentage of requests fast than to crash the entire server by waiting forever.

The Circuit Breaker Pattern

While timeouts prevent threads from hanging indefinitely, they still allow the system to attempt requests that are likely to fail. If Service C is completely down, Service B might still send thousands of requests per second, all of which will time out.

This is wasteful. It consumes network bandwidth and CPU cycles on Service B (to set up the connection and timer). Worse, it keeps hitting Service C, potentially preventing it from recovering.

To solve this, we implement the Circuit Breaker pattern. This is a state-machine based software component that sits between two services. It monitors the failure rate of outgoing requests.

Circuit Breaker States

The Circuit Breaker operates in three distinct states:

1. Closed (Normal Operation): In the default state, the circuit is "Closed." Traffic flows freely between the services. The circuit breaker counts the number of successful and failed requests. If the error rate stays below a configured threshold (e.g., 5% failure rate), it remains Closed.

2. Open (Failure Detected): If the failure rate exceeds the threshold, the breaker "trips" and moves to the "Open" state. In this state, the breaker immediately fails all outgoing requests without sending them to the destination.

Image

This provides two critical benefits:

  • Fail Fast: The calling service gets an immediate error, avoiding thread blocking and resource exhaustion.
  • Recovery Time: The failing service stops receiving traffic, giving it a quiet period to reboot or recover without load.

3. Half-Open (Recovery Testing): After a defined "sleep window" (e.g., 30 seconds), the breaker moves to the "Half-Open" state. It allows a limited number of test requests to pass through to the destination.

  • If these requests succeed, the breaker assumes the destination is healthy and resets to Closed.
  • If these requests fail, the breaker assumes the destination is still down and reverts to Open.

The Thundering Herd Problem

One of the most difficult challenges in system design is handling the traffic load immediately after a failure.

Imagine a scenario where a high-traffic system relies on a caching layer (like Redis) to store expensive database query results. If the cache crashes, thousands of incoming requests will miss the cache and hit the database directly.

The database is usually not provisioned to handle the full traffic load of the application. It will immediately spike to 100% CPU utilization and become unresponsive. The system administrators might restart the database, but as soon as it comes online, the backlog of pending requests hits it again, crashing it instantly.

This phenomenon, where a massive spike of synchronized traffic overwhelms a service, is called the Thundering Herd.

Retry Storms

A specific variation of this is the Retry Storm. When a service times out, clients (mobile apps or other microservices) are often programmed to retry the request.

If 10,000 users fail to load their feed at the same time, and their apps all retry exactly 1 second later, the server is hit with 10,000 requests simultaneously. If it fails again, they retry again. This keeps the server in a permanent state of failure.

Mitigation: Exponential Backoff and Jitter

To prevent Thundering Herds and Retry Storms, we must alter how clients retry requests. We use two mathematical techniques:

1. Exponential Backoff Instead of retrying at fixed intervals (every 1 second), the client increases the wait time after each failure.

  • Attempt 1 fails: Wait 1 second.
  • Attempt 2 fails: Wait 2 seconds.
  • Attempt 3 fails: Wait 4 seconds.
  • Attempt 4 fails: Wait 8 seconds.

This significantly reduces the load on the server over time.

2. Jitter Even with exponential backoff, if a service outage affects all users at once, they will all start their backoff timers at the same time. They will still hit the server in synchronized waves.

Jitter adds a random variation to the wait time. Instead of waiting exactly 2 seconds, Client A waits 1.9 seconds, and Client B waits 2.2 seconds. This desynchronizes the traffic, spreading the load out over time and giving the server "breathing room" to recover.

When the Load Balancer Fails

In system design interviews, "Add a Load Balancer" is a common answer to scalability questions. A Load Balancer (LB) distributes traffic across multiple application servers.

But from a reliability standpoint, a single load balancer represents a Single Point of Failure (SPOF). If the load balancer crashes, the IP address associated with your website stops responding. It does not matter if you have 500 healthy application servers behind it; no traffic can reach them.

Active-Passive Redundancy

To eliminate this SPOF, we typically use an Active-Passive high availability configuration.

We provision two load balancers:

  • Active Node: Handles 100% of the traffic.
  • Passive Node: Sits idle, doing nothing but monitoring the Active Node.

These two devices are connected via a dedicated heartbeat network connection. They constantly exchange signals to verify health.

If the Active Node experiences a hardware failure or power loss, the heartbeat stops. The Passive Node detects this loss of signal. It immediately triggers a Failover.

During failover, the Passive Node broadcasts a network update (typically using a protocol like VRRP or ARP spoofing) to claim the Virtual IP address (VIP) of the cluster.

Image

The network routers update their tables, and traffic is instantly re-routed to the Passive Node. To the end-user, this transition is often seamless.

Health Checks

The load balancer is also responsible for monitoring the application servers behind it. It performs Health Checks to ensure it never sends traffic to a dead server.

A simple health check might verify that the server responds to a "ping." However, a Deep Health Check is more effective. This involves the load balancer requesting a specific endpoint (e.g., /health) where the application performs internal checks, verifying it can connect to the database and cache before returning a "200 OK" status.

Idempotency: Making Retries Safe

We have discussed the importance of retrying requests when networks fail. But retries introduce a data integrity risk.

Suppose a user makes a payment. The request reaches the server, and the server charges the credit card. However, just as the server is about to send the "Success" response, the network cuts out.

The user sees an error message: "Connection Failed." Naturally, the user clicks "Pay" again.

If the system processes this second request as a new transaction, the user will be charged twice. This is a critical failure of system logic.

To solve this, we design operations to be Idempotent. An idempotent operation is one that can be applied multiple times without changing the result beyond the initial application.

In the payment example, the client generates a unique Idempotency Key (like a UUID) for the transaction before sending it.

  1. Request 1: Server receives Key abc-123. Charges card. Saves abc-123 in a database as "Processed". Crashes before response.

  2. Request 2 (Retry): Server receives Key abc-123. Checks database. Sees abc-123 is already processed. Returns "Success" immediately without charging the card again.

Designing for idempotency allows us to use aggressive retry strategies (like Exponential Backoff) without fear of corrupting the system's data.

Conclusion

Designing for failure distinguishes an experienced system architect from a junior developer. It requires looking beyond the code to the infrastructure it runs on. It means accepting that in a distributed environment, chaos is the norm, not the exception.

By moving beyond the Happy Path, you ensure that your systems are robust, self-healing, and reliable.

Key Takeaways:

  • Assume Failure: Build systems that expect components to break.

  • Kill Cascading Failures: Use Timeouts to stop slow dependencies from exhausting your resources.

  • Fail Fast: Implement Circuit Breakers to stop traffic to dead services and allow them to recover.

  • Prevent Storms: Use Exponential Backoff and Jitter to manage retries and avoid Thundering Herds.

  • Eliminate SPOFs: Use Active-Passive redundancy for load balancers and critical infrastructure.

  • Ensure Safety: Implement Idempotency so that network errors and retries do not duplicate data.

System Design Interview

What our users say

ABHISHEK GUPTA

My offer from the top tech company would not have been possible without this course. Many thanks!!

Arijeet

Just completed the “Grokking the system design interview”. It's amazing and super informative. Have come across very few courses that are as good as this!

AHMET HANIF

Whoever put this together, you folks are life savers. Thank you :)

More From Designgurus
Substack logo

Designgurus on Substack

Deep dives, systems design teardowns, and interview tactics delivered daily.

Read on Substack
Annual Subscription
Get instant access to all current and upcoming courses for one year.

Access to 50+ courses

New content added monthly

Certificate of completion

$29.08

/month

Billed Annually

Recommended Course
Grokking the Advanced System Design Interview

Grokking the Advanced System Design Interview

38,854+ students

4.1

Grokking the System Design Interview. This course covers the most important system design questions for building distributed and scalable systems.

View Course
Join our Newsletter

Get the latest system design articles and interview tips delivered to your inbox.

Read More

REST vs GraphQL vs gRPC

Arslan Ahmad

Arslan Ahmad

Why Practicing System Design Is Crucial for Software Engineers

Arslan Ahmad

Arslan Ahmad

7 Tips to Stand Out in Your System Design Interview

Arslan Ahmad

Arslan Ahmad

Mastering Estimation in System Design Interviews

Arslan Ahmad

Arslan Ahmad

Image
One-Stop Portal For Tech Interviews.
Copyright © 2026 Design Gurus, LLC. All rights reserved.