How would you build e‑discovery & legal hold for enterprise data?

Ediscovery and legal hold sound like pure policy, yet they are a deep systems problem. You are asked to preserve, search, and export enterprise data across mail, chat, docs, code, tickets, storage, and SaaS apps while proving integrity to auditors. Done well, the platform becomes a reliable safety net for litigation and regulation without crushing developer velocity. This guide shows how to design an ediscovery and legal hold system that scales, stays defensible, and passes a real system design interview.

Why It Matters

Modern enterprises generate petabytes across distributed systems. When a matter starts, you must freeze relevant content, prevent deletion, and enable fast search and production. The design touches scalable architecture, privacy controls, data residency, and cost. Interviewers use this topic to test how you model policy in code, apply immutability, and balance in place preservation with archive strategies. A good design protects the company, the customer, and the engineer who operates it.

How It Works Step by Step

Scope the matter and custodians
- Intake a matter with description, jurisdiction, and time window.
- Map custodians to identities across directories and SaaS apps.
- Resolve identity graph for aliases, group membership, and role changes over time.
Connect to data sources
- Build connectors for email, chat, docs, wiki, tickets, source control, object storage, and databases.
- Prefer event feeds, journaling, and change data capture to avoid brittle full scans.
- Normalize metadata such as sender, participants, timestamps, tenant, region, and data classification.
Choose preservation mode
- In place hold uses storage features such as object lock and retention policies to prevent deletion in the primary system.
- Copy to archive writes an immutable copy into a WORM bucket with versioning and compliance mode.
- Many platforms do both for belt and suspenders in high risk matters.
Immutability and chain of custody
- Enforce WORM semantics in storage with compliance mode and retention dates.
- Compute content hashes such as SHA two five six and store alongside metadata.
- Record every transition in an append only audit log with actor, action, timestamp, and reason codes.
Index for search and review
- Create a metadata index for filters such as custodian, date, and source.
- Create a full text index with language detection, OCR for images and PDFs, and audio transcription when needed.
- Support proximity, fuzzy, and fielded queries with access checks applied at query time.
Policy and overrides
- Retention policies continue to expire non held data while hold overrides deletion for items in scope.
- Data minimization rules exclude private or out of scope fields and apply redaction when exporting.
- Encryption keys respect organization policy, including customer managed keys where required.
Review workflow
- Provide de duplication and near duplicate detection.
- Thread emails, group chat conversations by time windows, and stitch document versions using version IDs.
- Allow reviewers to tag responsive, non responsive, and privileged with inter reviewer consistency checks.
Export and production
- Export to common formats with Bates numbering and load files for review tools.
- Preserve native encodings and include complete metadata sidecars.
- Package a manifest that lists hashes, custody chain, and storage proofs.
Release and deletion
- When a matter closes, lift holds and resume normal retention.
- Track delayed deletions for objects sitting in extended retention tiers to maintain compliance.
Scale and reliability

Use a streaming pipeline to ingest events at high volume with backpressure and retries.
Partition indices by tenant and time.
Run replica shards and cross region replication for resilience and locality.

Security and privacy

Enforce role based access control for legal, compliance, and admin teams.
Add attribute based constraints such as geography, business unit, or classification.
Mask sensitive fields and log every access for later audit.

Real World Example

Consider a global social platform similar to Instagram. Data spans mail, chat, code, images, and logs across three regions. A matter opens on Monday for thirty custodians and a six month window. The platform issues in place holds on mail and chat, plus copies image and document versions into an immutable archive with compliance mode for one year. A streaming job journals new messages within seconds, computes hashes, and updates the full text index. Reviewers filter by custodian and date, run phrase queries, and export responsive items with Bates numbers. The audit trail shows every hold placement, search, view, and export. When the case closes, the system releases holds, resumes standard retention, and documents deletions with proofs.

Common Pitfalls or Trade offs

Over collection versus targeted scope Collecting entire mailboxes and drives raises cost and privacy risk. Use matter filters and custodian mapping to minimize scope.
Index lag and missed items If connectors only crawl nightly, you can lose critical messages. Prefer journaling or CDC with retries and dead letter queues.
Weak immutability Application flags are not enough. Use storage level WORM with compliance mode and prevent rogue changes by requiring multi party approvals.
Residency and cross border movement Copying data to a single region can violate local law. Keep preservation and indexes in region and aggregate only metadata across regions.
Permission drift Reviewer access that bypasses tenant or region boundaries creates exposure. Apply access checks at query and export time with defense in depth.
Cold storage rehydration time Archiving everything too early slows review. Keep recent items in warm tiers and migrate to cold after the peak review window.

Interview Tip

Expect a follow up like this: how would you prove that an object produced from archive is identical to the original in primary storage. Mention content hashes, custody chain, storage proof from object lock, and a reproducible export that replays the integrity checks. Bonus points for describing how you would detect gaps using sequence numbers and audit reconciliation.

Key Takeaways

Ediscovery and legal hold translate policy into code using immutability, indexing, and precise access control.
In place holds reduce data movement while archive copies add a safety net for deletion bugs.
Chain of custody and hashing make the platform defensible in court and audit.
Streaming ingestion and partitioned indexes keep the system fast at enterprise scale.
Privacy, residency, and cost tuning are first class design constraints.

Table of Comparison

Approach	Alternative	When to Choose	Scale and Cost	Key Risks
In-place hold in primary system	Copy to immutable archive	Prefer when primary storage supports object lock and retention	Lower storage cost and less duplication	Vendor gaps can weaken immutability
Immutable storage with object lock (compliance mode)	Application-level lock flags	Use for high-risk matters and regulated data	Higher cost and stricter operations	Operational friction and longer release workflows
Centralized global index	Federated per-region index with brokered search	Choose centralized when residency is simple and latency budgets allow	Cheaper to operate at small scale	Cross-border data movement and single blast radius
Snapshot-based preservation	Continuous journaling and CDC	Snapshots fit small data and low change rates	Simple but can miss quick edits	Gap risk between snapshots
Full-text index with OCR and transcription	Metadata-only index	Use full text for complex matters and chat-heavy orgs	Higher compute and storage	Privacy exposure if access checks are weak
Role-based access control (RBAC)	Attribute-based control (ABAC)	RBAC is simpler for small orgs	Lower policy complexity	Coarse controls and privilege bloat

FAQs

Q1. What is a legal hold in ediscovery?

A legal hold instructs systems to preserve specific data for a matter. It overrides normal deletion or retention until the matter closes and the hold is released.

Q2. How is legal hold different from retention policy?

Retention policy defines how long data is kept by default. Legal hold is a targeted exception that freezes specific content regardless of normal expiry.

Q3. What makes preservation defensible in court?

Immutability with WORM storage, complete custody chain, reproducible hashing, and audited workflows that show who accessed or changed what and when.

Q4. Do I need both in place hold and an archive copy?

Many enterprises use both. In place hold reduces movement. Archive copy adds resilience against bugs, misconfigurations, or admin mistakes.

Q5. How do you handle data residency requirements?

Keep preservation and search indexes in region and synchronize only minimal metadata such as hash and document ID to a global control plane.

Q6. How do you scale search across billions of items?

Partition by tenant and time, use tiered indexes, apply query time access checks, and cache frequent filters. Stream new events to keep index freshness.

Further Learning

Build confidence for interview day with the practical patterns in Grokking the System Design Interview where you can map policy to architecture step by step. If you want deeper scale strategies for ingestion, indexing, and storage tiers, enroll in Grokking Scalable Systems for Interviews and practice trade off analysis on real workloads.