How do you prevent split‑brain during failover?

Split brain is a failure state where a cluster becomes partitioned and two nodes both believe they are the active leader. That creates conflicting writes, broken invariants, and unrecoverable user impact. Preventing split brain during failover is not about a single feature. It is an end to end discipline that combines quorum, fencing, leases, and careful operational defaults so that only one writer can exist at any time.

Why It Matters

Split brain is the failure that turns a routine failover into data loss. When two primaries accept writes at the same time, you risk double charge, lost updates, and unique key violations that are invisible until reconciliation.

In a system design interview, this topic reveals whether you can reason beyond health checks and think in terms of safety, quorum, and recovery. In production, a good failover plan balances availability and consistency while keeping latency predictable and cost reasonable.

Frameworks like CAP and PACELC remind us that stronger consistency under partitions implies lower availability, and that even without partitions you still trade latency for consistency. The goal is to be deliberate about where you sit on that spectrum while removing the risk of two writers.

How It Works (Step by Step)

1. Detect failure using quorum-based health checks

A single heartbeat timeout is never enough to trigger failover. Systems use majority-based health checks across multiple control nodes. If most nodes agree that the leader is unreachable, a failure is declared. This avoids false positives from transient network delays.

2. Freeze or fence the old leader before promoting a new one Before electing a new leader, the system must ensure the old one cannot continue writing. Fencing is implemented through tokens or leases.

Fencing tokens: Each leader election generates a strictly increasing token. Any node presenting an older token is rejected.
Leases: The leader holds a time-limited lease that must be renewed periodically. Once expired, it cannot write until re-elected.

3. Elect a new leader using a majority vote Consensus algorithms like Raft or Paxos are used to ensure only the majority partition can promote a new leader. Minority partitions remain read-only. This guarantees that there will never be two leaders simultaneously.

4. Enforce fencing across every component Every subsystem—API gateways, databases, cache writers, and background jobs—should validate the current epoch or lease before accepting writes. Even a single unfenced path can reintroduce split-brain risk.

5. Redirect writes through a single logical endpoint Clients send write requests to a virtual “writer endpoint,” which always points to the active leader. This endpoint is updated after failover, and clients use a short TTL or frequent DNS refresh to avoid writing to outdated nodes.

6. Safely recover and rejoin the failed leader When the old leader comes back online, it rejoins as a follower. It synchronizes its data logs, reconciles any missing entries, and only then resumes serving read traffic. Writes are not allowed until full replication is complete.

7. Handle special cases like two-node clusters Two-node setups can’t form a majority quorum. A third “witness” node or tiebreaker vote is required. Without it, automatic failover should be disabled, and the system should enter read-only mode until both nodes reconnect.

8. Apply infrastructure-level fencing as a last resort In environments where tokens or leases aren’t enforceable, systems use external fencing mechanisms like STONITH (Shoot The Other Node In The Head), which physically powers off the old leader to guarantee only one active writer.

9. Continuously test failover safety Chaos testing and simulated network partitions help validate that fencing, election timing, and rejoin logic behave as expected. Measuring time-to-detect, time-to-fence, and time-to-recover helps tune thresholds for optimal reliability.

Real World Example

Consider a retail checkout service shaped like Amazon scale. The stateful tier is a relational database with a handful of replicas in one region and asynchronous replicas in a second region. The system uses a coordinator for leader election and fencing.

Health and election. The control plane runs on three coordinator nodes. The database instance that holds the leader role renews a lease every second. If the majority cannot see a renewal for several seconds, the lease expires and a new leader is elected from the pool.
Fencing. Every transaction carries the current epoch received from the coordinator. The storage tier rejects any write that presents an older epoch. If the old leader comes back after a network partition, its epoch is stale and all writes are refused.
Traffic routing. The application uses a writer endpoint that always points to the coordinator chosen leader. Clients resolve it frequently and do not cache beyond a short TTL.
Recovery. When the failed instance returns, it is forced to follow the new leader, fetch the missing logs, and only then serve reads.

This recipe scales across many products. Kubernetes control planes rely on etcd with Raft to ensure a single control leader. Many message logs and databases embed fencing into their write paths. The common thread is simple. Majority decides leadership and tokens enforce it, end to end.

Common Pitfalls or Trade offs

Blind failover with no fencing Promoting a new node without blocking the old one is the textbook route to split brain. Always pair promotion with token or lease enforcement.
Two node clusters without a witness With two nodes, both sides can believe they are right. Add a neutral vote or accept read only as the safe mode.
Leases without clock discipline Lease based leadership relies on clocks that do not drift beyond tolerated bounds. Use NTP, bound skew in configuration, and prefer majority based election rather than pure time based locks.
Incomplete enforcement It is not enough to put fencing at the database. Gate the job queue, cache writers, and background workers. Any producer of writes must validate the current epoch.
Unbounded client caching Clients that cache the leader address for minutes continue to send writes to an old node. Use a virtual writer name with short TTL and teach SDKs to refresh it.
Overeager timeouts If thresholds are too tight, transient pauses lead to unnecessary promotions and the cluster flaps. Tune by measuring real traffic, disk latency, and network jitter.

Interview Tip

Interviewers often ask How would you promote a follower to primary without risking dual writers. A strong answer highlights majority based election and fencing tokens. For example, I keep an epoch in ZooKeeper, only the elected leader can increment it, and all writers must attach the current epoch. Storage rejects stale epochs, so even if the old leader wakes up, its writes fail. Close with one recovery step, such as read only join plus log catch up.

Key Takeaways

Split brain means two writers. Prevent it with majority election plus fencing.
Store a monotonically increasing epoch in a coordinator and attach it to every write.
Use a virtual writer endpoint with short TTL to move clients quickly during failover.
Two node clusters need a witness vote or a read only safe mode.
Practice partition drills to tune thresholds before production discovers the gaps.

Table of Comparison

Approach	How it prevents split brain	Latency impact	Operational complexity	Best fit
Majority election with Raft or similar	Only a partition with a majority can elect a leader, minority stays read only	Low to moderate due to log replication and commit on majority	Moderate, needs coordinator quorum and stable disks	Control planes, metadata stores, configuration leaders
Lease based leadership from a coordinator	Leader writes are accepted only while the lease is valid and renewed on time	Low, simple read of lease with periodic renewal	Low to moderate, depends on clock sync guarantees	Stateless services choosing one active worker, cache primaries
Fencing tokens with monotonically increasing epoch	Every write carries the epoch, storage rejects stale epochs	Low, extra check per write	Moderate, requires changes across producers and storage	Databases without native consensus, job schedulers, file stores
Witness or tie breaker vote	Prevents two node deadlock by granting a third vote to one side	Very low	Low, often a simple service in a neutral zone	Two node clusters in small footprints
Infrastructure fencing such as STONITH	Hard power off blocks the old leader from touching shared storage	None on write path, but failover may take longer to complete	High, risky if misconfigured	Legacy stacks or storage that cannot enforce tokens

FAQs

Q1. What is split brain in a distributed system?

It is a partition where two nodes both act as leader and accept writes. The fix is majority based election plus fencing so only one writer can proceed.

Q2. How do fencing tokens work?

A coordinator issues a new epoch on each promotion. Every write includes that epoch. Storage and consumers reject any request that presents an older epoch, which blocks the previous leader from writing after failover.

Q3. Do I need a witness node in a two node setup?

Yes if you want automatic promotion. Without a third vote you cannot prove a majority. Use a witness or run in read only safe mode until connectivity is restored.

Q4. Are quorum reads and writes enough to stop split brain?

Quorum read and write reduce divergence but they do not manage leadership. You still need a single writer chosen by majority and enforced by tokens or leases. Use quorum on the data plane and consensus on the control plane.

Q5. How should I pick lease and timeout values?

Start with heartbeats around one second, lease a few seconds longer, and promotion delays slightly above that. Measure real jitter and tune. The goal is to avoid flapping while keeping failover quick.

Q6. How do I test for split brain risks before launch?

Run partition drills that cut links between racks or zones, delay packets asymmetrically, and pause the leader process. Verify that old leaders are fenced, that clients find the new writer quickly, and that rejoin goes through read only catch up.

Further Learning

Build a solid foundation with the step by step patterns in the course Grokking System Design Fundamentals. For deeper failover playbooks, consensus, and fencing patterns used in large scale production, explore Grokking Scalable Systems for Interviews. Both courses include exercises that mirror real interview prompts and production incidents.