How do you implement write fences post‑failover to prevent stale writes?

Failover ensures availability, but it also introduces the risk of stale writes—when an old leader continues writing after a new one is elected. To prevent this, systems use write fences, a technique that validates whether a write request comes from the current leader. This ensures that only the most recent and valid source can modify data after failover.

Why It Matters

Without write fences, a system recovering from a partition or crash might let outdated leaders overwrite fresh data. This leads to corruption, double increments, or lost updates. In system design interviews, explaining write fences shows deep understanding of consistency models, leader election, and failover safety—key expectations in scalable architecture discussions.

How It Works (Step-by-Step)

Leader Election and Epoch Creation When a new leader is elected, a monotonic epoch number (generation ID) is created and stored in a reliable coordination service (like ZooKeeper or etcd).
Epoch Propagation The new leader attaches this epoch to all write requests. Every replica or data node uses it as a “fence token.”
Storage Validation Each shard or partition maintains its last accepted epoch. Before committing a write, the system checks if the request’s epoch is greater than or equal to this stored epoch.
Reject Stale Writes If a write’s epoch is lower than the current epoch, it’s automatically rejected. This prevents old leaders or lagging nodes from modifying up-to-date data.
Atomic Persistence Epoch validation and data commit must occur atomically to avoid race conditions between writers.
Extend to Async Systems In message-driven or event-sourced architectures, include the epoch token in each message header. Consumers reject messages from outdated epochs.
Monitoring and Recovery Track metrics such as stale_write_rejections or epoch_mismatch to identify failover-related inconsistencies.

Real-World Example

Suppose a payment system running in two regions experiences a failover. Region A’s leader (epoch 3) goes down, and Region B becomes leader with epoch 4. Any lingering requests from Region A (epoch 3) are now invalid. When these stale writes reach the database, they are rejected because the stored epoch is already 4. This protects financial integrity across regions—just like how Amazon or Stripe handle failovers safely.

Common Pitfalls or Trade-offs

Checking Epoch at API Layer Only Enforcing fencing only at the application level leaves storage vulnerable to direct writes from outdated services.
Using Timestamps Instead of Epochs Clock drift between nodes can cause false acceptance or rejection. Use integer epochs instead of timestamps.
Forgetting Background Workers Batch jobs or message consumers can accidentally perform stale writes if they don’t validate epochs.
Non-Atomic Updates If epoch and data updates aren’t atomic, a race condition can let invalid writes slip in.
Global vs. Shard-Level Epochs A global epoch simplifies logic but can cause unnecessary rejections. Per-shard epochs offer better granularity.
Ignoring Idempotency Write fences stop stale writes but not duplicates. Always pair them with idempotent write operations.

Interview Tip

In interviews, if asked how you avoid stale writes after failover, say: “I would assign a monotonically increasing epoch to every new leader and attach it to all writes. The storage layer enforces this by rejecting any write with a lower epoch. This ensures that only the current leader can make changes after failover.”

Key Takeaways

Write fences block stale writes from old leaders using epoch validation.
Always enforce checks near data, not just at the gateway.
Pair with idempotency and atomic commits for stronger safety.
Use per-shard epochs for fine-grained control.
Monitor stale write rejection metrics to ensure resilience.

Table of Comparison

Technique	Main Guarantee	Enforcement Layer	Strengths	Weaknesses	Best Use Case
Write Fence (Epoch Token)	Only current leader can write	Storage / Commit Path	Simple, deterministic	Requires token propagation	Leader-based systems
Quorum Writes	Majority confirmation per write	Consensus Layer	Strong protection	Higher latency	Critical consistency systems
Idempotency Keys	Prevent duplicate effects	API & DB	Good for retries	Doesn’t block stale leaders	Payments, orders
Compare-and-Set (CAS)	Detect concurrent updates	DB Row	Prevents overwrites	Doesn’t revoke old leaders	Config updates, profiles
Write Lock / Lease	Only one active writer	Coordination Service	Simple for small systems	Risk of lock expiry	Single master workloads
Paused Writes During Failover	No writes during transition	Control Plane	Simple, safe	Temporarily reduces availability	Manual failovers

FAQs

Q1. What is a write fence in distributed systems?

A write fence ensures only the current leader can modify data after failover by checking a monotonically increasing epoch token.

Q2. Why do stale writes happen after failover?

They occur when the old leader continues sending write requests after a new leader is elected, leading to data overwrites.

Q3. How do you generate epoch numbers?

Epochs are typically generated and stored by a coordination service like ZooKeeper or etcd, which ensures they increase strictly on every leadership change.

Q4. Do write fences slow down performance?

Minimal impact. Epoch checks are lightweight integer comparisons, usually cached or co-located with metadata.

Q5. Are write fences enough to ensure consistency?

They prevent stale writes but should be combined with idempotency and replication consistency mechanisms.

Q6. How are write fences tested in production systems?

Through chaos engineering and simulated failover tests that verify old writers are consistently rejected.

Further Learning

Deepen your understanding of leader election and failover in Grokking the System Design Interview — packed with visual examples and real-world architectures.
For foundational topics like consistency, replication, and failure handling, explore Grokking System Design Fundamentals.
To master distributed coordination and fault-tolerant patterns, advance to Grokking Scalable Systems for Interviews.