How do you implement write fences post‑failover to prevent stale writes?
Failover ensures availability, but it also introduces the risk of stale writes—when an old leader continues writing after a new one is elected. To prevent this, systems use write fences, a technique that validates whether a write request comes from the current leader. This ensures that only the most recent and valid source can modify data after failover.
Why It Matters
Without write fences, a system recovering from a partition or crash might let outdated leaders overwrite fresh data. This leads to corruption, double increments, or lost updates. In system design interviews, explaining write fences shows deep understanding of consistency models, leader election, and failover safety—key expectations in scalable architecture discussions.
How It Works (Step-by-Step)
-
Leader Election and Epoch Creation When a new leader is elected, a monotonic epoch number (generation ID) is created and stored in a reliable coordination service (like ZooKeeper or etcd).
-
Epoch Propagation The new leader attaches this epoch to all write requests. Every replica or data node uses it as a “fence token.”
-
Storage Validation Each shard or partition maintains its last accepted epoch. Before committing a write, the system checks if the request’s epoch is greater than or equal to this stored epoch.
-
Reject Stale Writes If a write’s epoch is lower than the current epoch, it’s automatically rejected. This prevents old leaders or lagging nodes from modifying up-to-date data.
-
Atomic Persistence Epoch validation and data commit must occur atomically to avoid race conditions between writers.
-
Extend to Async Systems In message-driven or event-sourced architectures, include the epoch token in each message header. Consumers reject messages from outdated epochs.
-
Monitoring and Recovery Track metrics such as
stale_write_rejectionsorepoch_mismatchto identify failover-related inconsistencies.
Real-World Example
Suppose a payment system running in two regions experiences a failover. Region A’s leader (epoch 3) goes down, and Region B becomes leader with epoch 4. Any lingering requests from Region A (epoch 3) are now invalid. When these stale writes reach the database, they are rejected because the stored epoch is already 4. This protects financial integrity across regions—just like how Amazon or Stripe handle failovers safely.
Common Pitfalls or Trade-offs
-
Checking Epoch at API Layer Only Enforcing fencing only at the application level leaves storage vulnerable to direct writes from outdated services.
-
Using Timestamps Instead of Epochs Clock drift between nodes can cause false acceptance or rejection. Use integer epochs instead of timestamps.
-
Forgetting Background Workers Batch jobs or message consumers can accidentally perform stale writes if they don’t validate epochs.
-
Non-Atomic Updates If epoch and data updates aren’t atomic, a race condition can let invalid writes slip in.
-
Global vs. Shard-Level Epochs A global epoch simplifies logic but can cause unnecessary rejections. Per-shard epochs offer better granularity.
-
Ignoring Idempotency Write fences stop stale writes but not duplicates. Always pair them with idempotent write operations.
Interview Tip
In interviews, if asked how you avoid stale writes after failover, say: “I would assign a monotonically increasing epoch to every new leader and attach it to all writes. The storage layer enforces this by rejecting any write with a lower epoch. This ensures that only the current leader can make changes after failover.”
Key Takeaways
-
Write fences block stale writes from old leaders using epoch validation.
-
Always enforce checks near data, not just at the gateway.
-
Pair with idempotency and atomic commits for stronger safety.
-
Use per-shard epochs for fine-grained control.
-
Monitor stale write rejection metrics to ensure resilience.
Table of Comparison
| Technique | Main Guarantee | Enforcement Layer | Strengths | Weaknesses | Best Use Case |
|---|---|---|---|---|---|
| Write Fence (Epoch Token) | Only current leader can write | Storage / Commit Path | Simple, deterministic | Requires token propagation | Leader-based systems |
| Quorum Writes | Majority confirmation per write | Consensus Layer | Strong protection | Higher latency | Critical consistency systems |
| Idempotency Keys | Prevent duplicate effects | API & DB | Good for retries | Doesn’t block stale leaders | Payments, orders |
| Compare-and-Set (CAS) | Detect concurrent updates | DB Row | Prevents overwrites | Doesn’t revoke old leaders | Config updates, profiles |
| Write Lock / Lease | Only one active writer | Coordination Service | Simple for small systems | Risk of lock expiry | Single master workloads |
| Paused Writes During Failover | No writes during transition | Control Plane | Simple, safe | Temporarily reduces availability | Manual failovers |
FAQs
Q1. What is a write fence in distributed systems?
A write fence ensures only the current leader can modify data after failover by checking a monotonically increasing epoch token.
Q2. Why do stale writes happen after failover?
They occur when the old leader continues sending write requests after a new leader is elected, leading to data overwrites.
Q3. How do you generate epoch numbers?
Epochs are typically generated and stored by a coordination service like ZooKeeper or etcd, which ensures they increase strictly on every leadership change.
Q4. Do write fences slow down performance?
Minimal impact. Epoch checks are lightweight integer comparisons, usually cached or co-located with metadata.
Q5. Are write fences enough to ensure consistency?
They prevent stale writes but should be combined with idempotency and replication consistency mechanisms.
Q6. How are write fences tested in production systems?
Through chaos engineering and simulated failover tests that verify old writers are consistently rejected.
Further Learning
-
Deepen your understanding of leader election and failover in Grokking the System Design Interview — packed with visual examples and real-world architectures.
-
For foundational topics like consistency, replication, and failure handling, explore Grokking System Design Fundamentals.
-
To master distributed coordination and fault-tolerant patterns, advance to Grokking Scalable Systems for Interviews.
GET YOUR FREE
Coding Questions Catalog
$197

$78
$78