How do you implement write fences post‑failover to prevent stale writes?

Failover ensures availability, but it also introduces the risk of stale writes—when an old leader continues writing after a new one is elected. To prevent this, systems use write fences, a technique that validates whether a write request comes from the current leader. This ensures that only the most recent and valid source can modify data after failover.

Why It Matters

Without write fences, a system recovering from a partition or crash might let outdated leaders overwrite fresh data. This leads to corruption, double increments, or lost updates. In system design interviews, explaining write fences shows deep understanding of consistency models, leader election, and failover safety—key expectations in scalable architecture discussions.

How It Works (Step-by-Step)

  1. Leader Election and Epoch Creation When a new leader is elected, a monotonic epoch number (generation ID) is created and stored in a reliable coordination service (like ZooKeeper or etcd).

  2. Epoch Propagation The new leader attaches this epoch to all write requests. Every replica or data node uses it as a “fence token.”

  3. Storage Validation Each shard or partition maintains its last accepted epoch. Before committing a write, the system checks if the request’s epoch is greater than or equal to this stored epoch.

  4. Reject Stale Writes If a write’s epoch is lower than the current epoch, it’s automatically rejected. This prevents old leaders or lagging nodes from modifying up-to-date data.

  5. Atomic Persistence Epoch validation and data commit must occur atomically to avoid race conditions between writers.

  6. Extend to Async Systems In message-driven or event-sourced architectures, include the epoch token in each message header. Consumers reject messages from outdated epochs.

  7. Monitoring and Recovery Track metrics such as stale_write_rejections or epoch_mismatch to identify failover-related inconsistencies.

Real-World Example

Suppose a payment system running in two regions experiences a failover. Region A’s leader (epoch 3) goes down, and Region B becomes leader with epoch 4. Any lingering requests from Region A (epoch 3) are now invalid. When these stale writes reach the database, they are rejected because the stored epoch is already 4. This protects financial integrity across regions—just like how Amazon or Stripe handle failovers safely.

Common Pitfalls or Trade-offs

  • Checking Epoch at API Layer Only Enforcing fencing only at the application level leaves storage vulnerable to direct writes from outdated services.

  • Using Timestamps Instead of Epochs Clock drift between nodes can cause false acceptance or rejection. Use integer epochs instead of timestamps.

  • Forgetting Background Workers Batch jobs or message consumers can accidentally perform stale writes if they don’t validate epochs.

  • Non-Atomic Updates If epoch and data updates aren’t atomic, a race condition can let invalid writes slip in.

  • Global vs. Shard-Level Epochs A global epoch simplifies logic but can cause unnecessary rejections. Per-shard epochs offer better granularity.

  • Ignoring Idempotency Write fences stop stale writes but not duplicates. Always pair them with idempotent write operations.

Interview Tip

In interviews, if asked how you avoid stale writes after failover, say: “I would assign a monotonically increasing epoch to every new leader and attach it to all writes. The storage layer enforces this by rejecting any write with a lower epoch. This ensures that only the current leader can make changes after failover.”

Key Takeaways

  • Write fences block stale writes from old leaders using epoch validation.

  • Always enforce checks near data, not just at the gateway.

  • Pair with idempotency and atomic commits for stronger safety.

  • Use per-shard epochs for fine-grained control.

  • Monitor stale write rejection metrics to ensure resilience.

Table of Comparison

TechniqueMain GuaranteeEnforcement LayerStrengthsWeaknessesBest Use Case
Write Fence (Epoch Token)Only current leader can writeStorage / Commit PathSimple, deterministicRequires token propagationLeader-based systems
Quorum WritesMajority confirmation per writeConsensus LayerStrong protectionHigher latencyCritical consistency systems
Idempotency KeysPrevent duplicate effectsAPI & DBGood for retriesDoesn’t block stale leadersPayments, orders
Compare-and-Set (CAS)Detect concurrent updatesDB RowPrevents overwritesDoesn’t revoke old leadersConfig updates, profiles
Write Lock / LeaseOnly one active writerCoordination ServiceSimple for small systemsRisk of lock expirySingle master workloads
Paused Writes During FailoverNo writes during transitionControl PlaneSimple, safeTemporarily reduces availabilityManual failovers

FAQs

Q1. What is a write fence in distributed systems?

A write fence ensures only the current leader can modify data after failover by checking a monotonically increasing epoch token.

Q2. Why do stale writes happen after failover?

They occur when the old leader continues sending write requests after a new leader is elected, leading to data overwrites.

Q3. How do you generate epoch numbers?

Epochs are typically generated and stored by a coordination service like ZooKeeper or etcd, which ensures they increase strictly on every leadership change.

Q4. Do write fences slow down performance?

Minimal impact. Epoch checks are lightweight integer comparisons, usually cached or co-located with metadata.

Q5. Are write fences enough to ensure consistency?

They prevent stale writes but should be combined with idempotency and replication consistency mechanisms.

Q6. How are write fences tested in production systems?

Through chaos engineering and simulated failover tests that verify old writers are consistently rejected.

Further Learning

TAGS
System Design Interview
System Design Fundamentals
CONTRIBUTOR
Design Gurus Team
-

GET YOUR FREE

Coding Questions Catalog

Design Gurus Newsletter - Latest from our Blog
Boost your coding skills with our essential coding questions catalog.
Take a step towards a better tech career now!
Image
One-Stop Portal For Tech Interviews.
Copyright © 2025 Design Gurus, LLC. All rights reserved.