Strategies for designing highly available and fault-tolerant systems

High availability is a system's ability to remain accessible and operational for a target percentage of time, typically measured in "nines" (99.9%, 99.99%, 99.999%). Fault tolerance is a system's ability to continue functioning correctly when individual components fail. These are distinct but complementary concepts: high availability asks "How do we stay accessible when things break?" while fault tolerance asks "How do we keep working correctly when things break?" In system design interviews, discussing both proactively—without waiting to be asked—is one of the clearest signals of senior-level thinking. Interviewers at every FAANG company evaluate whether candidates design for failure from the start or treat reliability as an afterthought.

Key Takeaways

  • Design for failure, not for perfection. In distributed systems, failures are not exceptional events—they are routine. Networks partition, disks corrupt, servers crash, and deployments introduce bugs. Every component in your design should have a failure plan.
  • High availability is achieved through redundancy (eliminating single points of failure), replication (keeping multiple copies of data), load balancing (distributing traffic), and automated failover (switching to healthy components without human intervention).
  • Fault tolerance is achieved through graceful degradation (serving reduced functionality instead of failing completely), circuit breakers (preventing cascading failures), retries with backoff (handling transient errors), and idempotent operations (safe to retry).
  • In interviews, quantify availability using SLA targets. "I would design for 99.99% availability, which allows approximately 52 minutes of downtime per year" is a scored statement.
  • Netflix's architecture is the canonical example: Chaos Monkey intentionally kills production servers to verify that fault tolerance mechanisms work. Their system continues streaming to 200M+ subscribers even when individual components fail.

High Availability vs Fault Tolerance: The Distinction That Matters

Many candidates use these terms interchangeably. Interviewers notice.

High availability is about uptime. A system that is available 99.99% of the time is down for only 52 minutes per year. High availability is achieved by eliminating single points of failure—running multiple instances of every component across multiple availability zones or regions so that if one fails, another handles the traffic.

Fault tolerance is about correctness under failure. A fault-tolerant system does not just stay online when a component fails—it continues producing correct results. A database that loses a replica and silently serves stale data is available but not fault-tolerant. A database that detects the replica loss, promotes a follower with current data, and continues serving consistent reads is both available and fault-tolerant.

DimensionHigh AvailabilityFault Tolerance
GoalMinimize downtimeMaintain correct behavior during failures
MetricUptime percentage (nines)Recovery time, data consistency
Achieved throughRedundancy, load balancing, failoverGraceful degradation, circuit breakers, retries
ScopeAcross components (if one fails, another takes over)Within components (the system handles the failure intelligently)
ExampleTwo load balancers in active-passive; if primary fails, secondary handles trafficNetflix's recommendation service returns trending content when the personalization engine fails

Interview tip: When the interviewer asks "How do you make this system highly available?", address both: "I would achieve high availability through multi-AZ deployment with automated failover, and fault tolerance through circuit breakers and graceful degradation so the system continues serving useful results even when individual services are degraded."

Availability Targets: Speaking the Language of Nines

Interviewers expect you to quantify availability using SLA targets, not vague phrases like "highly available."

AvailabilityAnnual DowntimeCommon Use Cases
99% (two nines)3.65 daysInternal tools, batch processing
99.9% (three nines)8.76 hoursStandard SaaS applications
99.99% (four nines)52.6 minutesE-commerce, payment systems
99.999% (five nines)5.26 minutesCritical infrastructure, telecom, healthcare

Interview application: "For this payment processing system, I would target 99.99% availability—52 minutes of downtime per year maximum. Achieving this requires multi-AZ deployment, automated failover with sub-30-second detection, and no single points of failure in the write path."

Each additional nine dramatically increases engineering complexity and cost. Designing for 99.999% requires multi-region active-active deployment, which introduces distributed consensus challenges. Knowing when to target three nines versus five nines—and explaining the cost trade-off—demonstrates mature engineering judgment.

The Seven Core Strategies

1. Redundancy: Eliminating Single Points of Failure

Every component in your architecture should run as multiple instances. If any single instance fails, the system continues operating through the remaining instances.

Load balancers: Deploy in active-passive or active-active pairs. If the primary load balancer fails, the secondary takes over. AWS ALB and NLB provide this natively across availability zones.

Application servers: Run a minimum of 3 instances behind a load balancer with auto-scaling. If one server crashes, the load balancer routes traffic to the remaining servers and auto-scaling launches a replacement.

Databases: Use leader-follower replication with automatic failover. Aurora PostgreSQL supports up to 15 read replicas with automatic promotion of a replica to primary if the leader fails. DynamoDB provides built-in multi-AZ replication.

Caches: Run Redis with sentinel-managed replicas. If the primary Redis node fails, sentinel promotes a replica within seconds.

Interview phrasing: "I have identified three single points of failure in this design: the load balancer, the primary database, and the cache. I would eliminate each by deploying redundant instances with automated failover."

2. Replication: Keeping Multiple Copies of Data

Replication ensures that data survives hardware failures and remains accessible from multiple locations.

Leader-follower (master-slave): One node handles writes; followers replicate data and handle reads. Simple but creates a write bottleneck. PostgreSQL, MySQL, and MongoDB support this natively.

Multi-leader (multi-master): Multiple nodes accept writes independently. Higher write throughput but introduces conflict resolution challenges. Cassandra and DynamoDB Global Tables use this model.

Leaderless: Any node can accept reads or writes. Quorum-based consistency ensures correctness. Cassandra supports tunable consistency with quorum reads/writes.

Interview trade-off: "I chose leader-follower replication because our system is read-heavy (10:1 ratio). The leader handles all writes, and 3 read replicas distribute the read load. The trade-off is that followers may serve slightly stale data during replication lag—acceptable for our social media feed but not for the payment service."

3. Failover: Automatic Recovery From Failures

Failover is the process of automatically switching to a redundant component when the primary fails.

Active-passive failover: A standby component waits idle and takes over when the primary fails. Simple but wastes resources during normal operation. Used for databases (Aurora failover) and load balancers.

Active-active failover: Multiple components share the load simultaneously. If one fails, the others absorb the additional traffic. More efficient but requires all components to stay in sync. Used for application servers and CDN edge locations.

Failover detection: Health checks ping each component at regular intervals (typically every 5–30 seconds). If a component fails multiple consecutive checks, it is removed from the pool and failover begins. Route 53 health checks, ELB health checks, and Redis sentinel all implement this pattern.

4. Circuit Breakers: Preventing Cascading Failures

When a downstream service fails, a circuit breaker prevents upstream services from repeatedly calling it—which would waste resources, increase latency, and potentially cause cascading failures across the entire system.

The circuit breaker has three states: closed (requests flow normally), open (requests are rejected immediately without calling the failing service), and half-open (a limited number of test requests check if the service has recovered).

Real-world example: Netflix built Hystrix, a circuit breaker library that prevents cascading failures across their 1,000+ microservices. When the recommendation service fails, Hystrix opens the circuit and the UI shows trending content instead—the user experience degrades gracefully rather than the entire page failing.

Interview phrasing: "Between the notification service and the email provider, I would implement a circuit breaker. If the email provider fails 5 consecutive requests within 30 seconds, the circuit opens. Notifications are queued for retry instead of being dropped. After 60 seconds, the circuit enters half-open state and tests with a single request."

5. Graceful Degradation: Serving Reduced Functionality

A fault-tolerant system does not crash when a non-critical component fails. It continues serving core functionality with reduced features.

Examples from production systems:

Netflix: If the recommendation engine fails, users see trending content instead of personalized suggestions. The core service (video streaming) continues uninterrupted.

Twitter: If the "who to follow" service fails, the sidebar shows a generic prompt. The core service (timeline) continues uninterrupted.

Amazon: If the review service fails, product pages display without reviews. The core service (search and purchase) continues uninterrupted.

Interview application: "I would classify each service as critical or non-critical. The payment service is critical—if it fails, we show an error. The recommendation service is non-critical—if it fails, we show popular items instead of personalized ones. This graceful degradation preserves the core user experience."

6. Multi-Region Deployment: Surviving Regional Outages

A single data center or availability zone can suffer complete outages from power failures, natural disasters, or network partitions. Multi-region deployment distributes the system across geographically separated regions.

Active-passive multi-region: One region handles all traffic; the other is a standby that receives replicated data. If the primary region fails, DNS (Route 53) redirects traffic to the secondary. Simpler but wastes standby resources and has higher failover time.

Active-active multi-region: Both regions handle traffic simultaneously. Users are routed to the nearest region via latency-based DNS. If one region fails, the other absorbs all traffic. More complex—requires conflict resolution for cross-region writes—but provides lower latency and faster failover.

Interview phrasing: "For four-nines availability, I would deploy active-active across us-east-1 and eu-west-1. Route 53 with latency-based routing directs users to the nearest region. DynamoDB Global Tables replicate data bidirectionally with single-digit millisecond lag. If us-east-1 fails completely, eu-west-1 handles 100% of traffic within 60 seconds."

7. Monitoring and Observability: Detecting Failures Before Users Do

High availability is meaningless without the ability to detect and respond to failures quickly. Monitoring provides the visibility that makes all other strategies effective.

The three pillars of observability: Metrics (quantitative measurements like latency, error rate, throughput), Logs (detailed records of individual events), and Traces (end-to-end path of a request through distributed services).

SLOs and error budgets: An SLO (Service Level Objective) of 99.99% availability means your error budget is 52 minutes of downtime per year. When the error budget is consumed, new deployments stop until reliability improves. Google popularized this approach through their SRE practices.

Interview phrasing: "I would set up CloudWatch alarms on p99 latency (threshold: 500ms), error rate (threshold: 1%), and consumer lag (threshold: 10,000 messages). If any alarm fires, the on-call engineer receives a PagerDuty notification within 60 seconds."

For structured practice incorporating high availability and fault tolerance into complete system design solutions, Grokking the System Design Interview includes reliability strategies in every design problem. For advanced patterns including multi-region active-active deployment, consensus protocols, and chaos engineering, Grokking the Advanced System Design Interview covers production-scale reliability architectures.

How to Discuss Availability and Fault Tolerance in Interviews

During requirements: "What is the availability target? Is this a system where 52 minutes of annual downtime is acceptable (four nines) or do we need five nines?" This question demonstrates that you think about reliability from the start.

During high-level design: "I am designing with no single points of failure. Every component runs as multiple instances across at least two availability zones." Point to each component on your diagram and confirm redundancy.

During deep dive: "The database is the most critical component. I would use Aurora with 3 read replicas and automated failover. If the primary fails, Aurora promotes a replica within 30 seconds. During failover, writes are briefly unavailable but reads continue from replicas."

During trade-offs: "Achieving four nines with multi-AZ deployment is relatively straightforward. Achieving five nines requires active-active multi-region, which introduces cross-region replication lag and conflict resolution complexity. For this use case, four nines is sufficient and the operational simplicity is worth the trade-off."

Frequently Asked Questions

What is the difference between high availability and fault tolerance?

High availability minimizes downtime by running redundant components so the system stays accessible. Fault tolerance ensures the system continues functioning correctly when components fail, through graceful degradation, circuit breakers, and intelligent error handling. A system can be highly available but not fault-tolerant if it serves incorrect results during failures.

How do I discuss availability targets in a system design interview?

Quantify using "nines": 99.9% (8.76 hours/year downtime), 99.99% (52 minutes/year), 99.999% (5.26 minutes/year). State the target appropriate for the use case, explain what strategies achieve it, and discuss the cost trade-off of adding more nines.

What are the most common high availability patterns?

Redundancy (multiple instances of every component), replication (multiple copies of data), load balancing (distributing traffic), automated failover (switching to healthy components), and multi-region deployment (surviving regional outages).

What is a circuit breaker in system design?

A pattern that prevents cascading failures by stopping requests to a failing downstream service. It has three states: closed (normal flow), open (requests rejected immediately), and half-open (testing if the service recovered). Netflix's Hystrix is the most famous implementation.

How does Netflix achieve high availability?

Netflix runs 1,000+ microservices on AWS across multiple regions. They use Chaos Monkey to intentionally kill production instances, circuit breakers (Hystrix) to prevent cascading failures, a custom CDN (Open Connect) with ISP-level caching, and graceful degradation (trending content replaces failed personalization). This architecture serves 200M+ subscribers with minimal perceived downtime.

What is graceful degradation?

A fault tolerance strategy where non-critical features are disabled or replaced with simpler alternatives when their backing services fail, while core functionality continues. Example: Amazon shows product pages without reviews when the review service fails. The core purchase flow remains unaffected.

How many replicas should a database have for high availability?

Three replicas across separate availability zones is the standard recommendation. This tolerates the failure of any single replica while maintaining both read capacity and data durability. For multi-region availability, add replicas in a second region.

What is the relationship between CAP theorem and high availability?

The CAP theorem states that during a network partition, you must choose between consistency and availability. Systems that prioritize availability (Cassandra, DynamoDB in default mode) serve requests even when some replicas are unreachable, potentially returning stale data. Systems that prioritize consistency (PostgreSQL, Spanner) reject requests during partitions to prevent stale reads.

When should I discuss availability in a system design interview?

During requirements gathering, ask about the availability target. During high-level design, identify and eliminate single points of failure. During the deep dive, explain failover mechanisms for critical components. During trade-offs, discuss the cost of achieving higher availability levels. Proactively raising availability at each phase signals senior-level thinking.

How do I test fault tolerance in production?

Chaos engineering: intentionally inject failures (kill servers, introduce network latency, corrupt data) in a controlled manner and verify the system recovers correctly. Netflix's Simian Army, Gremlin, and AWS Fault Injection Simulator are common tools. Mentioning chaos engineering in an interview signals production-grade thinking.

TL;DR

High availability (staying accessible) and fault tolerance (maintaining correctness during failures) are distinct but complementary system design concepts. Achieve high availability through redundancy (no single points of failure), replication (multiple data copies), load balancing (traffic distribution), and automated failover. Achieve fault tolerance through graceful degradation (reduced functionality instead of crash), circuit breakers (preventing cascading failures), retries with backoff, and idempotent operations. Quantify availability using nines: 99.99% allows 52 minutes of downtime annually. In interviews, discuss availability during every phase—requirements, design, deep dive, and trade-offs. Netflix's architecture is the canonical example: Chaos Monkey verifies fault tolerance by intentionally killing production servers, circuit breakers prevent cascading failures, and graceful degradation ensures users always see content even when personalization services fail.

TAGS
System Design Interview
System Design Fundamentals
CONTRIBUTOR
Design Gurus Team
-

GET YOUR FREE

Coding Questions Catalog

Design Gurus Newsletter - Latest from our Blog
Boost your coding skills with our essential coding questions catalog.
Take a step towards a better tech career now!
Explore Answers
How do software engineering internships work?
How to prepare for coding interviews in PHP?
What are cloud delivery models?
What type of tool is Twilio?
How to structure a tech interview?
Visualizing solution expansions for future feature requests
Related Courses
Course image
Grokking the Coding Interview: Patterns for Coding Questions
Grokking the Coding Interview Patterns in Java, Python, JS, C++, C#, and Go. The most comprehensive course with 476 Lessons.
4.6
Discounted price for Your Region

$197

Course image
Grokking Modern AI Fundamentals
Master the fundamentals of AI today to lead the tech revolution of tomorrow.
3.9
Discounted price for Your Region

$72

Course image
Grokking Data Structures & Algorithms for Coding Interviews
Unlock Coding Interview Success: Dive Deep into Data Structures and Algorithms.
4
Discounted price for Your Region

$78

Image
One-Stop Portal For Tech Interviews.
Copyright © 2026 Design Gurus, LLC. All rights reserved.