Strategies for designing highly available and fault-tolerant systems

Question

Design Gurus · Accepted Answer

High availability is a system's ability to remain accessible and operational for a target percentage of time, typically measured in "nines" (99.9%, 99.99%, 99.999%). Fault tolerance is a system's ability to continue functioning correctly when individual components fail. These are distinct but complementary concepts: high availability asks "How do we stay accessible when things break?" while fault tolerance asks "How do we keep working correctly when things break?" In system design interviews, discussing both proactively—without waiting to be asked—is one of the clearest signals of senior-level thinking. Interviewers at every FAANG company evaluate whether candidates design for failure from the start or treat reliability as an afterthought.

Key Takeaways

Design for failure, not for perfection. In distributed systems, failures are not exceptional events—they are routine. Networks partition, disks corrupt, servers crash, and deployments introduce bugs. Every component in your design should have a failure plan.  
High availability is achieved through redundancy (eliminating single points of failure), replication (keeping multiple copies of data), load balancing (distributing traffic), and automated failover (switching to healthy components without human intervention).  
Fault tolerance is achieved through graceful degradation (serving reduced functionality instead of failing completely), circuit breakers (preventing cascading failures), retries with backoff (handling transient errors), and idempotent operations (safe to retry).  
In interviews, quantify availability using SLA targets. "I would design for 99.99% availability, which allows approximately 52 minutes of downtime per year" is a scored statement.  
Netflix's architecture is the canonical example: Chaos Monkey intentionally kills production servers to verify that fault tolerance mechanisms work. Their system continues streaming to 200M+ subscribers even when individual components fail.

High Availability vs Fault Tolerance: The Distinction That Matters

Many candidates use these terms interchangeably. Interviewers notice.

High availability is about uptime. A system that is available 99.99% of the time is down for only 52 minutes per year. High availability is achieved by eliminating single points of failure—running multiple instances of every component across multiple availability zones or regions so that if one fails, another handles the traffic.

Fault tolerance is about correctness under failure. A fault-tolerant system does not just stay online when a component fails—it continues producing correct results. A database that loses a replica and silently serves stale data is available but not fault-tolerant. A database that detects the replica loss, promotes a follower with current data, and continues serving consistent reads is both available and fault-tolerant.

Dimension High Availability Fault Tolerance
Goal Minimize downtime Maintain correct behavior during failures
Metric Uptime percentage (nines) Recovery time, data consistency
Achieved through Redundancy, load balancing, failover Graceful degradation, circuit breakers, retries
Scope Across components (if one fails, another takes over) Within components (the system handles the failure intelligently)
Example Two load balancers in active-passive; if primary fails, secondary handles traffic Netflix's recommendation service returns trending content when the personalization engine fails

Interview tip: When the interviewer asks "How do you make this system highly available?", address both: "I would achieve high availability through multi-AZ deployment with automated failover, and fault tolerance through circuit breakers and graceful degradation so the system continues serving useful results even when individual services are degraded."

Availability Targets: Speaking the Language of Nines

Interviewers expect you to quantify availability using SLA targets, not vague phrases like "highly available."

Availability Annual Downtime Common Use Cases
99% (two nines) 3.65 days Internal tools, batch processing
99.9% (three nines) 8.76 hours Standard SaaS applications
99.99% (four nines) 52.6 minutes E-commerce, payment systems
99.999% (five nines) 5.26 minutes Critical infrastructure, telecom, healthcare

Interview application: "For this payment processing system, I would target 99.99% availability—52 minutes of downtime per year maximum. Achieving this requires multi-AZ deployment, automated failover with sub-30-second detection, and no single points of failure in the write path."

Each additional nine dramatically increases engineering complexity and cost. Designing for 99.999% requires multi-region active-active deployment, which introduces distributed consensus challenges. Knowing when to target three nines versus five nines—and explaining the cost trade-off—demonstrates mature engineering judgment.

The Seven Core Strategies

1. Redundancy: Eliminating Single Points of Failure

Every component in your architecture should run as multiple instances. If any single instance fails, the system continues operating through the remaining instances.

Load balancers: Deploy in active-passive or active-active pairs. If the primary load balancer fails, the secondary takes over. AWS ALB and NLB provide this natively across availability zones.

Application servers: Run a minimum of 3 instances behind a load balancer with auto-scaling. If one server crashes, the load balancer routes traffic to the remaining servers and auto-scaling launches a replacement.

Databases: Use leader-follower replication with automatic failover. Aurora PostgreSQL supports up to 15 read replicas with automatic promotion of a replica to primary if the leader fails. DynamoDB provides built-in multi-AZ replication.

Caches: Run Redis with sentinel-managed replicas. If the primary Redis node fails, sentinel promotes a replica within seconds.

Interview phrasing: "I have identified three single points of failure in this design: the load balancer, the primary database, and the cache. I would eliminate each by deploying redundant instances with automated failover."

2. Replication: Keeping Multiple Copies of Data

Replication ensures that data survives hardware failures and remains accessible from multiple locations.

Leader-follower (master-slave): One node handles writes; followers replicate data and handle reads. Simple but creates a write bottleneck. PostgreSQL, MySQL, and MongoDB support this natively.

Multi-leader (multi-master): Multiple nodes accept writes independently. Higher write throughput but introduces conflict resolution challenges. Cassandra and DynamoDB Global Tables use this model.

Leaderless: Any node can accept reads or writes. Quorum-based consistency ensures correctness. Cassandra supports tunable consistency with quorum reads/writes.

Interview trade-off: "I chose leader-follower replication because our system is read-heavy (10:1 ratio). The leader handles all writes, and 3 read replicas distribute the read load. The trade-off is that followers may serve slightly stale data during replication lag—acceptable for our social media feed but not for the payment service."

3. Failover: Automatic Recovery From Failures

Failover is the process of automatically switching to a redundant component when the primary fails.

Active-passive failover: A standby component waits idle and takes over when the primary fails. Simple but wastes resources during normal operation. Used for databases (Aurora failover) and load balancers.

Active-active failover: Multiple components share the load simultaneously. If one fails, the others absorb the additional traffic. More efficient but requires all components to stay in sync. Used for application servers and CDN edge locations.

Failover detection: Health checks ping each component at regular intervals (typically every 5–30 seconds). If a component fails multiple consecutive checks, it is removed from the pool and failover begins. Route 53 health checks, ELB health checks, and Redis sentinel all implement this pattern.

4. Circuit Breakers: Preventing Cascading Failures

When a downstream service fails, a circuit breaker prevents upstream services from repeatedly calling it—which would waste resources, increase latency, and potentially cause cascading failures across the entire system.

The circuit breaker has three states: closed (requests flow normally), open (requests are rejected immediately without calling the failing service), and half-open (a limited number of test requests check if the service has recovered).

Real-world example: Netflix built Hystrix, a circuit breaker library that prevents cascading failures across their 1,000+ microservices. When the recommendation service fails, Hystrix opens the circuit and the UI shows trending content instead—the user experience degrades gracefully rather than the entire page failing.

Interview phrasing: "Between the notification service and the email provider, I would implement a circuit breaker. If the email provider fails 5 consecutive requests within 30 seconds, the circuit opens. Notifications are queued for retry instead of being dropped. After 60 seconds, the circuit enters half-open state and tests with a single request."

Strategies for designing highly available and fault-tolerant systems

Key Takeaways

High Availability vs Fault Tolerance: The Distinction That Matters

Availability Targets: Speaking the Language of Nines

The Seven Core Strategies

1. Redundancy: Eliminating Single Points of Failure

2. Replication: Keeping Multiple Copies of Data

3. Failover: Automatic Recovery From Failures

4. Circuit Breakers: Preventing Cascading Failures

5. Graceful Degradation: Serving Reduced Functionality

6. Multi-Region Deployment: Surviving Regional Outages

7. Monitoring and Observability: Detecting Failures Before Users Do

How to Discuss Availability and Fault Tolerance in Interviews

Frequently Asked Questions

What is the difference between high availability and fault tolerance?

How do I discuss availability targets in a system design interview?

What are the most common high availability patterns?

What is a circuit breaker in system design?

How does Netflix achieve high availability?

What is graceful degradation?

How many replicas should a database have for high availability?

What is the relationship between CAP theorem and high availability?

When should I discuss availability in a system design interview?

How do I test fault tolerance in production?

TL;DR

Dimension	High Availability	Fault Tolerance
Goal	Minimize downtime	Maintain correct behavior during failures
Metric	Uptime percentage (nines)	Recovery time, data consistency
Achieved through	Redundancy, load balancing, failover	Graceful degradation, circuit breakers, retries
Scope	Across components (if one fails, another takes over)	Within components (the system handles the failure intelligently)
Example	Two load balancers in active-passive; if primary fails, secondary handles traffic	Netflix's recommendation service returns trending content when the personalization engine fails

Availability	Annual Downtime	Common Use Cases
99% (two nines)	3.65 days	Internal tools, batch processing
99.9% (three nines)	8.76 hours	Standard SaaS applications
99.99% (four nines)	52.6 minutes	E-commerce, payment systems
99.999% (five nines)	5.26 minutes	Critical infrastructure, telecom, healthcare