01Why This Is the Hardest Concept on the Rubric
Most candidates can recite "CAP theorem." Most cannot reason about whether their database choice should be CP or AP for a given workload. Most can name "eventual consistency." Most cannot explain what specifically goes wrong if you read a value before its replication has caught up. The gap between knowing the words and using them is wider here than for any other concept on the rubric.
This is also the area where staff loops grade hardest. Senior loops accept "we'd use Postgres with read replicas, eventual consistency on the replicas, strong consistency on the primary." Staff loops want you to walk through what specifically happens when a user reads from a lagging replica, what the application does about it, and how the overall guarantees compose. The vocabulary alone is not enough.
This page treats replication and consistency as one topic. They look like two but they aren't separable: replication creates the consistency problem, and consistency models exist to describe what replication produces. The page is organized as a single argument with two faces.
The Senior Move
Don't recite CAP. Use it. The strong sentence is "Postgres with synchronous replication is CP under partition: writes block until the replica acknowledges, so during a partition we lose availability of writes but never lose consistency. Cassandra with quorum reads and writes is AP-leaning: it stays available under most partitions but you can read stale data." That's the move. CAP as a tool, applied to specific systems, with the consequence named.
02Why We Replicate (and Why It's a Tradeoff)
Replication is keeping more than one copy of data, on different machines, kept in sync somehow. We do this for three reasons:
- Availability. If one copy goes down, another can serve. Single-machine durability is high but single-machine availability is low. Replication is how you survive hardware failure, network partition, and operational mistakes.
- Read scale. Multiple replicas can serve reads in parallel. A primary plus three replicas can serve roughly 4x the read load of the primary alone, with the caveat that reads from replicas may be slightly stale.
- Geographic latency. A user in Tokyo querying a database in Virginia waits at least 150ms round-trip. A replica in Tokyo cuts that to a few milliseconds. Geographic replication is how you serve a global audience.
Each reason creates a different design pressure. Replicating for availability wants synchronous replication so no data is lost on failover. Replicating for read scale wants asynchronous replication so the primary doesn't slow down. Replicating for geography wants something in between, because synchronous cross-region replication adds the geographic latency you were trying to avoid.
The three goals are in tension. The replication topology you choose, and the consistency guarantees you accept, are how you negotiate the tension. Most production systems pick a default position and document which dimension they sacrificed.
03The Three Replication Topologies
Three topologies cover almost every production system. Each has a default consistency model that follows from how it works. The topology choice is the load-bearing decision; consistency falls out of it.
Three Replication Topologies
The three replication topologies. Leader-follower is the default for most products. Multi-leader appears for active-active geographic deployments. Leaderless shows up in highly-available NoSQL systems.
Leader-Follower
Default. Use when in doubt.
One node accepts writes (the leader, sometimes called primary). Other nodes (followers, replicas) receive a stream of changes from the leader and apply them in order. Writes go to the leader; reads can go to either, with the caveat that follower reads may be slightly stale.
Simple to reason about, mature operationally, good fit for most workloads. The default for Postgres, MySQL, MongoDB, and most managed services.
Multi-Leader
Active-active across regions
Two or more nodes each accept writes and replicate to each other asynchronously. Used when you need writes accepted in multiple regions for latency, with the understanding that the same record might be written in two regions before they sync up.
Low write latency in every region. Conflict resolution is the cost: the system has to decide what to do when two regions wrote the same record.
Leaderless
High availability, tunable consistency
No designated leader. The client writes to multiple nodes in parallel and reads from multiple nodes, requiring a quorum to consider the operation successful. Cassandra, DynamoDB, and Riak use this model.
Maximum availability under partitions. Consistency is tunable per-query through the W (write quorum) and R (read quorum) parameters. The cost is operational complexity.
Synchronous vs asynchronous within leader-follower
Within leader-follower, there's a second decision: synchronous or asynchronous replication.
- Synchronous. The leader doesn't acknowledge a write until at least one follower has confirmed it. Data is durably replicated before the user sees success. The cost is higher write latency: the write waits for the slowest synchronous follower.
- Asynchronous. The leader acknowledges the write immediately and ships the change to followers in the background. Writes are fast. The cost is potential data loss: if the leader fails before the change is replicated, the write is gone.
- Semi-synchronous. The leader waits for at least one follower to acknowledge before responding to the client. Other followers receive the change asynchronously. A practical compromise that gets most of the durability benefit without the latency cost of waiting for everyone.
Most production systems run semi-synchronously by default and tolerate some replication lag. Reaching for full synchronous replication is appropriate for financial systems where any data loss is unacceptable.
04The Consistency Staircase
Consistency is not binary. It is a staircase of guarantees, with "no consistency at all" at the bottom and "everyone sees the same thing in the same order at the same time" at the top. Each step costs more than the one below it. Most production systems live somewhere in the middle, and the senior signal is knowing where on the staircase your application actually needs to be.
The five steps below cover almost every consistency level you'll discuss in interviews. There are more (the academic literature is full of variants), but these five are what shows up in practice.
Eventual Consistency
Cost: cheapestIf you stop writing, all replicas will eventually agree. No bound on when. No guarantee about what intermediate reads see. This is the weakest meaningful consistency model.
Despite the weak guarantee, it covers most product workloads. Activity feeds, profile data, content listings, analytics dashboards: all of these tolerate "the data might be a few seconds stale" without breaking.
When to useMost read-heavy, non-financial workloads. The default for follower reads in leader-follower, the default for Cassandra and DynamoDB at quorum-of-one.Read-Your-Writes
Cost: lowStronger than eventual: after a user writes, that same user immediately sees their own write on subsequent reads. Other users may not see it yet, but the writer always does.
This matches how users intuitively expect applications to behave. When you post a comment, you expect to see your comment immediately. If you don't, the application looks broken even if technically nothing is wrong.
How it's implementedRoute the writer's reads to the leader (or to a known up-to-date replica) for some time after the write. Many ORMs and database clients handle this transparently.Monotonic Reads
Cost: low to mediumIf a user reads a value at time T, subsequent reads will not return an older value. The user might still see staleness, but the staleness only moves forward in time.
Without monotonic reads, a user can see a value, then see an older value on the next read. This happens when consecutive reads hit different replicas with different replication lag. It's confusing and visible to users.
How it's implementedPin a user's reads to a specific replica, or include a "minimum version" token in subsequent reads so the system can pick a replica fresh enough to satisfy it.Causal Consistency
Cost: medium to highIf event A causally precedes event B (B was written knowing about A), then any reader who sees B must also see A. Causally-related events are seen in order; concurrent events may be seen in any order.
This matters for systems where causality is meaningful and visible. Comment threads, where reply B is causally after parent A. Social systems where a post and its likes have a causal relationship. Without causal consistency, users see weird orderings: a reply before its parent, a like before the thing being liked.
How it's implementedVector clocks or version vectors track causal relationships explicitly. The cost is operational complexity. Most systems approximate causal consistency with simpler mechanisms (per-user session pinning, application-layer ordering).Linearizable (Strong) Consistency
Cost: highestEvery operation appears to take effect at some single instant between its start and end. Any read that starts after a write completes is guaranteed to see that write. This is "strong consistency" in everyday usage.
Linearizability is what most people mean when they say "consistent." It's also expensive: the system needs coordination on every write, which limits throughput, increases latency, and reduces availability under partition.
When to useFinancial transactions, inventory at checkout, authentication state, any case where reading stale data would cause a real-world incorrect outcome. Spanner, FoundationDB, and similar systems offer linearizability globally; Postgres offers it on the primary.Where most production systems live
Most production systems are eventual or read-your-writes for most of their data, with linearizable available for the operations that need it. A typical web application might have:
- User account writes: linearizable on the primary database
- Profile reads: read-your-writes for the writer, eventual for everyone else
- Activity feed: eventual consistency, refreshes on demand
- Payment transactions: linearizable, possibly with multi-phase commit
- Search results: eventual consistency, indexed asynchronously
This is the senior insight: you don't pick one consistency model for your whole application. You pick consistency models per-operation, defaulting to eventual, escalating to stronger only where the application demands it. Defaulting to linearizable everywhere is the over-engineering trap.
Consistency is not binary. It is a staircase. The senior signal is knowing where on the staircase each piece of data actually needs to be.
05CAP and PACELC as Reasoning Tools
CAP and PACELC are the two formal frameworks for thinking about consistency tradeoffs in distributed systems. They are taught in every system design guide. They are also misused in almost every interview. The point of the frameworks is not to recite them; it is to use them to reason about your specific design.
CAP, the original
CAP says: in the presence of a network Partition, you must choose between Consistency and Availability. You cannot have all three when partitions happen.
The framework's value is in forcing the choice during a partition. A partitioned database has to either refuse to serve writes (giving up availability) or serve writes that might conflict with the other side (giving up consistency). There is no third option.
- CP systems sacrifice availability for consistency. Examples: Spanner, FoundationDB, MongoDB with majority writes, single-primary Postgres. During a partition, they refuse to serve writes that can't be coordinated.
- AP systems sacrifice consistency for availability. Examples: Cassandra at low quorum, Riak, DynamoDB with eventual consistency reads. During a partition, they serve writes anyway and reconcile later.
The mistake candidates make is thinking CAP applies all the time. It doesn't. When there's no partition (which is most of the time), you can have both consistency and availability. CAP only forces the choice when the network breaks.
PACELC, the refinement
PACELC extends CAP with a question about what happens when there isn't a partition. If there's a Partition, choose between Availability and Consistency. Else, choose between Latency and Consistency.
The "else" half is the more interesting half in practice, because partitions are rare but every-day operations have to choose between latency and consistency continuously. A synchronous-replication database that waits for all replicas to acknowledge writes has high latency to gain consistency. An async-replication database that acknowledges writes immediately has low latency at the cost of a window where the data might not be replicated yet.
Common database classifications:
| Database | P → A or C? | E → L or C? | Class |
|---|---|---|---|
| Postgres (sync) | C | C | PC/EC |
| Postgres (async) | C | L | PC/EL |
| Cassandra | A | L (default) | PA/EL |
| DynamoDB (eventual) | A | L | PA/EL |
| Spanner | C | C | PC/EC |
| MongoDB (majority) | C | C | PC/EC |
The classification reveals the architectural philosophy. Cassandra prioritizes availability and latency at the cost of consistency. Spanner prioritizes consistency at the cost of latency. Both choices are valid; they target different workloads.
The interview move on CAP and PACELC
Don't recite the letters. Use them to explain a specific choice. "I'd use Postgres synchronous replication for the payment system. PACELC says it's PC/EC: we accept higher write latency and reduced availability under partition in exchange for consistency. The application can't tolerate stale or lost financial data, so the tradeoff is the right one." That sentence does the work.
A Common Misunderstanding
"CP" doesn't mean "always consistent." It means "if there's a partition, prefer consistency." When there's no partition (the common case), CP systems are still as consistent as they can be. The framework is about behavior under partition specifically. Saying "Cassandra is AP so it's eventually consistent" conflates two things: AP describes partition behavior, eventual consistency describes the default consistency model. Cassandra can be configured for stronger consistency through quorum settings; it's still AP under partition.
06Replication Lag and What Breaks Because of It
In any asynchronous-replication system, there's a window of time between when the leader has accepted a write and when all followers have applied it. This is replication lag. Most of the time it's milliseconds. Sometimes it's seconds. Under load or partial failure, it can be minutes or hours. Each of these regimes causes different application-visible problems.
Failure Mode 01
Reading your own writes from a stale replica
The user posts a comment. The write goes to the leader. The user immediately reads from a follower that hasn't received the change yet. The comment doesn't appear. The user thinks the application is broken. They post the comment again. Now there are two comments.
This is the most common consistency-related bug in production. The fix is read-your-writes: route the user's reads to the leader (or a known fresh replica) for some time after their write. Most application frameworks support this with "sticky sessions" or "read-after-write" routing.
Failure Mode 02
Monotonic read violations
The user reads a value, sees X. Refreshes the page. Sees an older value, X-1. Refreshes again. Sees X. The data appears to time-travel.
This happens when consecutive reads hit different replicas with different lag. Reader hits replica A (caught up), then replica B (lagging), then replica A again. The fix is to pin a user's reads to a single replica, or to track a "minimum version" the reader has seen and select replicas that meet it.
Failure Mode 03
Causal violation in feeds
User posts a comment. The comment appears in the feed (replicated to the feed's followers). A reply to the comment is posted before the original comment finishes replicating to the followers. The reply appears before the comment it's replying to. Discussion threads look broken.
The fix is causal consistency at the application or middleware layer, often with vector clocks. Most production systems approximate this by ordering related writes through the same replication stream so they can't be reordered.
Failure Mode 04
Lost writes during failover
The leader accepts a write but fails before replicating it to any follower. A follower is promoted to leader. The write is gone, with no record that it ever happened. The user sees their action succeed (because the original leader acknowledged) but it's not in the new database.
The fix is synchronous or semi-synchronous replication for writes that must not be lost. Accept the latency cost. For financial systems, this is the only acceptable choice.
The depth probe
"What happens if your replica is 30 seconds behind the leader?" is a question with concrete consequences depending on the application. Strong response: "Reads from that replica would be 30 seconds stale. For our profile data, fine. For a checkout flow, we'd route to the leader. For activity feeds, we'd accept the staleness because users won't notice the difference between 'now' and '30 seconds ago' on most feed content." That's reasoning about lag, not just naming the concept.
07Conflict Resolution in Multi-Leader and Leaderless Systems
If two writers write to the same record on different leaders before the leaders sync up, there's a conflict. The system has to decide which write wins, or whether to keep both, or whether to merge them.
Three common strategies:
Last-Write-Wins (LWW)
Each write carries a timestamp. The write with the highest timestamp wins; older writes are discarded. Simple, fast, and operationally cheap.
The cost is silent data loss. If two users write to the same field, one of their writes is gone. The user whose write was discarded has no indication anything went wrong. This is acceptable for some workloads (caches, last-active timestamps, user preferences where the "latest" really is the answer) and unacceptable for others (anywhere the user expects their write to persist).
Implementation gotcha: clock skew across servers can produce wrong winners. Most LWW systems use logical clocks (Lamport timestamps, hybrid logical clocks) rather than wall-clock time to avoid this.
Application-layer merge
The system surfaces the conflict to the application, which decides how to resolve it. Useful when the application can compute a meaningful merge: a shopping cart can union two conflicting carts; a counter can sum two conflicting increments; a comment can keep both versions.
The cost is application complexity. Every record that can conflict needs merge logic. The application has to handle three-way comparisons. Tests get harder. Debugging gets harder.
CRDTs (Conflict-free Replicated Data Types)
Specialized data structures that mathematically guarantee conflict-free merging. A G-Counter (grow-only counter) can be merged by taking the maximum of each replica's count. An OR-Set can union concurrent additions. The system can merge replicas without coordination and the result is deterministic.
CRDTs are powerful but limited. They work for specific data shapes (counters, sets, sequences). They're complex to implement and to debug. They're seeing adoption in collaborative editing systems (Figma, Google Docs use CRDT-like structures) and some distributed databases (Riak, AntidoteDB), but they're not yet a default tool.
The interview move on conflicts
If the question involves multi-region writes or leaderless replication, the interviewer is going to probe how you'd handle conflicts. The strong answer commits to a strategy and explains the cost. "We'd use last-write-wins for user preferences because the cost of losing a stale value is low. For shopping carts we'd use application-layer merging because users expect their items to persist. We'd avoid CRDTs unless we had a specific data type that warranted the complexity." That sentence is the difference between someone who has thought about conflicts and someone who hasn't.
08Cross-Region Replication: The Hardest Case
Replicating across data centers in the same region is straightforward. Replicating across regions (US to Europe to Asia) introduces hard physical constraints. The speed of light bounds round-trip latency: New York to Tokyo is roughly 150ms, even on perfect fiber. Synchronous replication across regions adds that latency to every write.
Three architectural patterns for cross-region replication:
Single-region primary with async replicas elsewhere
One region is the primary. Writes go there. Other regions have asynchronous read replicas. Writes from non-primary regions cross the region boundary, paying the latency cost. Reads from non-primary regions are local but possibly stale.
This is the simplest pattern and the right default. Use it unless you have specific requirements that demand something more complex.
Active-active with conflict resolution
Each region accepts writes locally and replicates asynchronously to other regions. Writes are fast in every region. The cost is conflicts when the same record is written in two regions before they sync up.
Use this when local write latency matters more than perfect consistency, and when conflicts are rare or the data shape supports merge. Common in collaborative tools and some social products.
Globally-distributed strongly-consistent (Spanner-style)
Specialized databases that provide strong consistency across regions through coordinated commits, often using atomic clocks or hybrid logical clocks. Writes are slow (you pay cross-region latency on every commit) but the consistency model is the same as a single-region database.
Use this when you need both global presence and strong consistency, and you can absorb the latency cost. Financial systems and inventory at global scale are the canonical examples.
2026 Reality Check
For most products, cross-region replication is for disaster recovery, not for active-active write throughput. A primary region with read replicas in other regions handles most use cases. The complexity of true active-active with conflict resolution is rarely worth it unless you have specific latency or sovereignty requirements that demand it.
Data sovereignty as a forcing function
Increasingly, cross-region replication is shaped by legal requirements, not just technical ones. GDPR mandates that EU user data lives in EU regions. Various national data protection laws have similar provisions. The replication topology has to honor these constraints.
The pattern that emerges is per-region partitioning of data plus cross-region replication only for non-personal data. A user's profile and activity live in their region. Reference data (product catalogs, content) replicates globally. The replication design follows the sovereignty boundary.
09How Replication and Consistency Interact With Other Concepts
- Replication × Sharding. Sharding gives you horizontal scale; replication gives you availability. You usually need both. Each shard has its own replicas. Failover is per-shard, not system-wide. The sharding deep-dive covers the interaction.
- Replication × Caching. Caches add another consistency layer. The cache might be fresher than a lagging replica, or staler. Reasoning about end-to-end consistency requires accounting for all layers, not just the database. The caching deep-dive covers cache invalidation.
- Replication × Database selection. Different databases have different replication models, and the model often matters more than the database name. Choosing Postgres versus Cassandra is largely a choice between leader-follower and leaderless replication semantics. The database selection deep-dive covers this.
- Consistency × Application logic. The right consistency level depends on what the application is trying to do. Authentication can't tolerate eventual consistency; activity feeds can. Picking the consistency model is a product decision, not just an infrastructure decision.
- Replication × Observability. Replication lag is one of the metrics you have to monitor in production. Lag spikes are an early warning of capacity problems, network issues, or hot keys. The on-call runbook for "replica lag is 30 seconds" should be specific.
For more on cross-concept interactions, see the concepts library hub.
10Practice Scenarios
Three scenarios. Read the setup. Decide your replication and consistency approach before opening the reveal.
Scenario 01
A user posts a comment, then immediately reloads the page and doesn't see their comment. What's happening, and what do you do?
The system has a single Postgres primary and three async read replicas. Reads are load-balanced across replicas. Replication lag is normally 50ms, sometimes spikes to 2 seconds under load.
How to think about this
This is a read-your-writes failure. The user wrote to the primary, then read from a replica that hadn't yet caught up. The user sees their own write missing.
Two reasonable fixes:
1. Route the writer's reads to the primary. For some short window after a write (say 30 seconds), the writer's reads bypass the replica pool and go directly to the primary. Simple to implement, slightly increases primary read load.
2. Track minimum freshness in the read. The write returns a version number. Subsequent reads include "give me data at least as fresh as version N." The router selects a replica that satisfies the constraint, falling back to the primary if needed. More complex but doesn't add load to the primary unnecessarily.
Strong answer: "Read-your-writes routing for the writer for 30 seconds after their write. Most users won't notice; the writer's experience is the one that needs to be perfect."
Scenario 02
You're designing a global e-commerce platform. Inventory must be accurate (overselling is bad). User profiles can be slightly stale. How do you handle replication?
Customers in 40 countries. Three primary regions: US, EU, APAC. Inventory is shared globally (a product is either available or sold out). Profiles are per-region (tied to the user's home region for sovereignty).
How to think about this
Two different replication strategies for two different data types.
Inventory: Strong consistency required. Overselling is real money lost. The right choice is a globally-consistent strongly-typed store, probably Spanner or CockroachDB. Writes are slower (cross-region commit latency) but inventory operations aren't latency-critical compared to inventory accuracy. We accept the cost.
Profiles: Per-region storage with eventual consistency across regions. EU users' profiles live in EU; US in US; APAC in APAC. Reads are local and fast. Cross-region reads are rare and tolerate staleness.
Strong answer: "Spanner or equivalent for inventory because correctness matters more than write latency. Per-region Postgres for profiles with async replication across regions for failover only. The two stores are isolated; we don't try to globally-consistently replicate everything."
Scenario 03
A collaborative document editor needs to allow simultaneous edits from multiple users. How do you handle replication and conflicts?
Like Google Docs or Figma. Multiple users edit the same document at the same time. Edits should appear in real-time across all users. The document state should converge to the same final state regardless of order.
How to think about this
This is the canonical CRDT use case. Operational Transformation (OT) is the older technique; CRDTs (specifically sequence CRDTs like RGA or Y.js's data structures) are the modern preference.
The architecture: each client maintains a local copy of the document. Edits are represented as operations (insert character at position, delete range, etc.) that can be applied in any order and produce the same result. The client sends operations to a coordinator which broadcasts them to all other clients. Each client applies operations as they arrive; the CRDT properties guarantee convergence.
Replication is effectively peer-to-peer through the coordinator. There's no leader; every client's local copy is a valid version. Persistent storage is a separate concern (snapshot the document state periodically, log operations between snapshots).
Strong answer: "CRDT-based replication for the document. Each edit is a CRDT operation. Operations broadcast through a coordinator to all clients. Persistent storage holds snapshots and an operation log. We don't try to make the document strongly consistent across users; the CRDT properties give us convergence without coordination."
11Replication and Consistency FAQ
What's the difference between replication and sharding?
Replication keeps multiple copies of the same data; sharding splits data across machines. Replication gives you availability and read scale; sharding gives you write scale and storage scale. They're complementary: most systems shard their data and replicate each shard. Confusing the two is a common mistake. The sharding deep-dive covers the differences.
Is "eventual consistency" weaker than "strong consistency" or just different?
Weaker, in the sense that any guarantee strong consistency provides, eventual consistency cannot. The "eventual" part is the gap: at any moment in time, a read might be stale. The system promises convergence, not immediacy. For most workloads, eventual consistency is sufficient and the latency benefits are worth the staleness window. For workloads where staleness causes incorrect outcomes (financial, inventory, authentication), you need stronger guarantees.
How do I choose between synchronous and asynchronous replication?
Synchronous if data loss is unacceptable (financial, regulatory). Asynchronous if write latency matters more than the small risk of losing recent writes during a failover. Semi-synchronous (wait for one replica to acknowledge) is a practical default that gets most of the durability benefit without paying full latency cost. Start with semi-synchronous and adjust if requirements demand otherwise.
What happens to in-flight writes during a failover?
It depends on the replication mode. Synchronous: writes that were acknowledged are durable; in-flight writes that hadn't yet replicated are rejected. Asynchronous: writes that were acknowledged but hadn't yet replicated can be lost. Semi-synchronous: writes acknowledged by at least one replica are durable; writes not yet replicated to anyone can be lost. The user's experience during failover varies accordingly.
How long should I expect failover to take?
For managed services (RDS, Cloud SQL), typically 30 seconds to 2 minutes including failure detection and DNS update. For self-managed Postgres with tools like Patroni, similar range. Manual failover is much slower (minutes to hours) and is appropriate only for systems where automatic failover would create more risk than the manual process. The on-call runbook should specify expected failover time.
Should I use multi-leader replication?
Probably not, unless you have a specific reason. Multi-leader replication makes conflict resolution a constant operational burden. The use cases where it's worth the complexity (low-latency writes from many regions, offline-capable mobile clients with sync) are real but specialized. For most products, single-leader with read replicas is the right default, and async cross-region replication for disaster recovery handles geographic concerns.
What's the difference between consensus and consistency?
Consistency is what the application sees; consensus is how the system agrees on what happened. Consensus protocols (Raft, Paxos) are the mechanisms that produce strong consistency in distributed systems. They're how Spanner, etcd, and similar systems coordinate writes across replicas. You don't usually need to know consensus deeply for system design interviews, but knowing the names and that "strong consistency requires consensus" is fair game at staff level.
How does this interact with sharding?
Each shard has its own replicas. The system has N shards × M replicas per shard. Failover is per-shard (one shard can fail over without affecting others). Cross-shard transactions are hard regardless of replication, which is why sharded systems often abandon them and force application-layer reconciliation. The sharding deep-dive covers this in more detail.