How would you build document versioning with diff storage + conflicts?
Document versioning with diff storage solves two problems at once. It keeps a full history of a document without exploding storage cost and it enables safe collaboration when people edit at the same time. The core idea is simple. Store occasional full snapshots and store the changes between versions as diffs. When two edits race, detect the conflict, try an automatic merge, and fall back to human resolution when needed. This pattern shows up in code hosts, wikis, and collaborative editors. In a system design interview it lets you demonstrate data modeling, consistency choices, and merge strategy with clear trade offs.
Why It Matters
- Storage efficiency. Text edits are usually small compared to the whole document. Diffs reduce cost by orders of magnitude while preserving history for audit and rollback.
- Fast collaboration. With a good conflict strategy users do not block each other with locks. They keep working and the system converges safely.
- Compliance and audit. Version trails with authorship, timestamps, and reason codes make reviews and rollbacks trustworthy.
- Performance flexibility. You can balance read cost against write cost with snapshot cadence and cache policy which is a classic interview trade off in scalable architecture and distributed systems.
How It Works Step by Step
Step 1 Model the data
Create four primary entities.
- Document. Stable identifier, owner, access policy, current head version id.
- Version. Monotonic sequence number, parent version id, author, timestamp, checksum of materialized content, optional branch.
- Patch. A diff from one concrete version to another target. Store the diff payload, diff type such as line or character, and stats such as insert and delete counts.
- Snapshot. A materialized copy of the document at a particular version. Use snapshots to bound reconstruction time.
Link them as a directed chain. Document points to the latest Version. Each Version either has a Snapshot payload or a Patch payload with a pointer to its parent. Maintain referential integrity so you can replay the chain.
Step 2 Choose a diff representation
Pick a representation that fits your content and merge needs.
- Line based diffs are simple and fast for code and prose with paragraphs.
- Word or character based diffs enable higher fidelity merges for rich text.
- Operation style diffs such as insert delete retain triplets or JSON Patch are compact and easy to transform during merges.
- For large structured documents consider chunking into paragraphs or blocks and diff per block to reduce conflict scope.
Internally you can compute diffs with classic Myers, or libraries like diff match patch. Store the result as a compact sequence of operations plus a checksum of the target text for verification.
Step 3 Balance snapshots and diffs
Reconstructing version N requires applying a chain of diffs starting from the nearest snapshot before N. If you snapshot every K versions you trade storage for read latency.
- Read heavy workload. Snapshot more often and cache the most accessed versions.
- Write heavy workload. Snapshot less often to favor low write cost and rely on background compaction jobs that roll a long chain into a fresh snapshot.
A practical default is a snapshot every 20 to 50 versions with adaptive tuning based on observed latency.
Step 4 Write path with optimistic concurrency
Clients edit from some base version. On save they send a patch plus the base version id. The server checks that the base matches the current head.
- If base equals head. Apply the patch to head, verify checksum, create a new Version record that points to the patch, update head.
- If base differs from head. That is a conflict. Do not overwrite. Return a conflict response that includes the latest head and enough context for a three way merge.
Use a short lived precondition token linked to the base version to prevent lost updates. Store idempotency keys so a retried save does not create duplicate versions.
Step 5 Conflict handling and merging
There are three common strategies with increasing capability and complexity.
- Last write wins. Easiest and risky. You acknowledge the conflict but choose one edit. This is fine for fields that are single writer such as a title.
- Three way merge. Use base, latest head, and local edit to compute a merged result. Non overlapping edits apply cleanly. Overlaps become conflict hunks that the user must resolve. This is the standard for code hosts and wikis.
- Real time collaboration with transforms or CRDT. Model edits as operations instead of rigid patches. Operational transform remaps remote operations against local ones to preserve intent. CRDT assigns stable identifiers to content atoms such as characters and guarantees eventual convergence across replicas even offline. These models improve collaboration but increase storage and CPU overhead.
A pragmatic approach is hybrid. Use three way merge as the default. Switch to an operation model for high concurrency documents like shared meeting notes.
Step 6 Read path and caching
To fetch version N resolve the nearest snapshot S at or before N then apply patches from S plus one through N. Cache the materialized result of popular versions such as head to avoid repeated replay. Keep a small patch cache as well because recent reads often require the same sequence.
Step 7 Compaction and integrity
Background jobs should
- Collapse long chains by creating a new snapshot and deleting redundant patches while preserving history.
- Rechunk blocks when document structure changes so future diffs remain small.
- Verify checksums end to end so a corrupt patch is detected before it reaches users.
- Reclaim dead content from abandoned branches.
Step 8 Security and access control
Version trails can leak sensitive text through diffs. Protect them.
- Enforce access checks for every version and patch read.
- Allow redaction at the diff level so protected sections do not show raw text.
- Sign or authenticate write operations so auditors can trust authorship.
Step 9 Scale out
Partition by document id so all writes for a document go to a single shard which simplifies conflict checks. Replicate shards for availability. Expose a change feed per document so clients can subscribe to updates and rebase quickly. Emit metrics such as merge success rate, average patch size, reconstruction latency, and percent of reads served from cache. This is the language interviewers expect when you talk about scalable architecture.
Real World Example
Think about an internal catalog editor at Amazon where thousands of employees and vendors update product descriptions all day. Each product description is a document. Most edits are small. The service stores a snapshot every few versions and diffs in between. When two editors touch the same paragraph at the same time the system runs a three way merge. If the same sentence was changed in two different ways the editor shows a guided conflict view. Popular products are read more often than they are written so the service caches materialized head versions near the application layer. This approach mirrors what you would build for a wiki or a subtitle editor for a streamer like Netflix. The pattern is the same even if the content and scale differ.
Common Pitfalls or Trade offs
-
Long chains increase read latency. Use adaptive snapshot cadence and compaction.
-
Diff granularity mismatch. Line diffs on rich text can cause needless conflicts. Pick character or block diffs for fine control.
-
Naive last write wins on shared sections destroys user intent. Prefer three way merge or operation transforms for concurrent paragraphs.
-
Binary content is not diff friendly. Use content addressable chunks with deduplication and treat them as attachments with their own versioning.
-
Unbounded history. Apply retention policies such as keep all for thirty days then compact to weekly snapshots.
-
Missing integrity checks. Every patch should carry a checksum of the target so the system detects corruption immediately.
-
Cache incoherence. Invalidate or update caches on every new head version and tag cached entries with the version id.
Interview Tip
Expect a numbers based follow up. For example you are told that average document size is fifty kilobytes and the average patch is two kilobytes. Reads are three times writes. What snapshot cadence minimizes expected read latency under a storage cap You could argue for a snapshot every twenty versions, which bounds reconstruction to nineteen patches, then propose a head cache that serves eighty percent of reads directly. Tie this back to workload shape and show how you would tune the cadence from telemetry.
Key Takeaways
- Store sparse history by mixing snapshots and diffs with checksums for safety.
- Use optimistic concurrency with a base version precondition to avoid lost updates.
- Default to three way merge and escalate to transforms or CRDT for heavy collaboration.
- Manage reconstruction cost with adaptive snapshots and caching.
- Treat security and audit as first class. Diffs can leak sensitive text if not handled carefully.
Table of Comparison
| Approach | Storage Cost | Read Cost | Write Cost | Merge Model | Best For |
|---|---|---|---|---|---|
| Diff based versioning with snapshots | Low | Medium (depends on chain length) | Low | Three way merge by default | Docs with frequent small edits and audit needs |
| Full snapshot per version | High | Low (single fetch) | High | No merge or simple overwrite | Small documents or strict immutability |
| Event sourcing with text operations | Low to Medium | Medium to High (due to replay) | Low | Operational transform or CRDT | Live collaboration and offline edits |
| CRDT only model | Medium (per character metadata) | Low (once converged) | Low | Native convergence without central locking | High conflict environments and mobile offline |
FAQs
Q1. What is the difference between forward and reverse diffs?
Forward diffs turn an older version into a newer version. Reverse diffs do the opposite. If you frequently read the latest head, store reverse diffs from head backward so head reads are single fetch. If you frequently read history, forward diffs from old to new can be better. Many systems mix both and rely on snapshots to cap replay cost.
Q2. How do you detect and surface conflicts without locks?
Use optimistic concurrency. When a client saves it includes the base version id it edited from. If the server head changed since that base the save is rejected as a conflict. The client fetches the latest head and runs a three way merge with its local edits. Only overlapping edits require human input.
Q3. When should I choose operational transform or CRDT?
Pick an operation model when you need live multi user editing with low friction. Transforms are simpler to store when a central server orders operations. CRDT shines when clients can be offline for long periods because convergence does not require a single order. Both cost more CPU and storage metadata than basic three way merge, so use them where collaboration benefits justify the extra complexity.
Q4. How often should I take snapshots?
Tune cadence from telemetry. If reconstruction time grows beyond a target p95 add more snapshots. Start with one snapshot every few dozen versions, measure read latency, and let a background job adjust the interval per document based on read write ratio.
Q5. How do you version large binary files like images or PDFs?
Do not diff raw bytes line by line. Use chunked storage with content addressable hashes and deduplication. Treat each logical asset as a set of chunks. A version then becomes a manifest of chunk identifiers. This keeps storage small and makes uploads resumable.
Q6. How do you keep history secure?
Protect read paths with access checks, encrypt at rest, and allow redaction so sensitive sections are not exposed through diffs. Keep an audit log that records who changed what and why, and sign important actions so auditors can prove authenticity.
Further Learning
Level up your foundation with Grokking System Design Fundamentals and practice the core storage and merge patterns with guided exercises. If you want to push deeper into scale and collaboration patterns, enroll in Grokking Scalable Systems for Interviews to drill trade offs with realistic scenarios and interview checklists.
GET YOUR FREE
Coding Questions Catalog
$197

$78
$78