How do you ensure ordering guarantees across partitions?

Compaction in log structured storage is the quiet background worker that keeps your LSM tree fast and lean. Data is first written to memory, flushed into sorted files on disk, then periodically merged so that reads do not scan a pile of overlapping files. Tuning compaction is the art of deciding when to merge, how much to merge, and which files to merge so that you hit your target service level while staying within CPU, memory, and IO budgets.

Why It Matters

Compaction directly controls three user visible outcomes in any distributed system that relies on an LSM tree.

Read latency and tail behavior because overlapping files create extra lookups and cache misses.
Write throughput and durability because aggressive merges can steal IO from foreground appends or cause write stalls.
Cost because unnecessary rewriting increases write amplification which burns SSD lifespan and compute time. For a system design interview, the strongest answers show you can reason about these trade offs and choose a compaction strategy that matches workload shape, retention policy, and hardware.

How It Works step by step

Incoming writes go to a memtable and a write ahead log for durability.
When the memtable fills, the engine flushes it to disk as an immutable sorted file often called an SSTable. These fresh files usually land in level zero where they overlap in key space.
A background scheduler picks compaction jobs. Each job reads a set of sorted files, merges keys, discards tombstoned or expired rows, and writes new files to deeper levels with less overlap.
The engine limits read write interference using compaction rate limits, job concurrency caps, and soft or hard stall thresholds when level zero grows too large.
The scheduler computes compaction debt which is an estimate of bytes that need merging to keep the tree healthy. Your goal is to keep debt bounded while meeting tail latency goals.

Key tuning knobs you will use:

Choose a strategy fit to the workload
- Size tiered compaction merges a handful of similar sized files at the same level. This favors fast writes and low compaction pressure at the cost of higher read amplification and more overlap.
- Leveled compaction keeps almost non overlapping ranges per level so each key is present once per level. This improves point reads and range scans but increases the number of bytes rewritten.
- Universal compaction chooses small sets of files regardless of level and is often used for SSD only deployments with smaller datasets or frequent updates.
- Time window compaction groups files by time windows which is great for time series data and TTL heavy workloads because old windows can be merged and dropped with little overlap with fresh data.
Shape the level sizes
- Pick a base level size and a level size multiplier. Larger multipliers reduce the total number of levels which can help writes, while smaller multipliers reduce read amplification.
- Let the engine adapt base level size to the dataset. Dynamic level sizing helps when data volume grows quickly.
Control file size and file count
- Target file size should align with your IO block size and cache. Larger files cut down on metadata and open file overhead but make each compaction job heavier.
- Set level zero file thresholds. A soft limit triggers more aggressive scheduling and a hard limit stalls writes to protect read latency.
- Enable subcompactions when available so a large job is split into independent ranges that run in parallel.
Bound background pressure
- Use separate thread pools for flush and compaction.
- Cap the number of concurrent jobs per level so compactions do not thrash the same hot ranges.
- Rate limit compaction IO to preserve headroom for foreground reads and writes.
Manage tombstones and TTLs
- Turn on compaction filters to drop expired data early.
- If range tombstones are common, schedule priority passes that compact overlapping ranges quickly to reclaim space and reduce read amplification from shadowed keys.
Compression and caching
- Keep compression light or disabled on level zero and the first disk level to minimize write latency.
- Use stronger compression on deeper levels where files change less often.
- Size your block cache and Bloom filters to make the best of the reduced overlap after compaction.
Separate hot and cold data
- Use column families or table level strategies so hot write heavy tables can stay on size tiered or time window while read critical tables use leveled.
- For cold historical data, widen time windows and increase file size to reduce compaction frequency.
Monitor the right signals
- Level zero file count, pending compaction bytes, estimated compaction debt.
- Foreground write stall time and ratio of bytes read to bytes written by compaction.
- P99 and P999 read latency during compaction, cache hit rate, and Bloom false positive rate.
- SSD wear metrics if you manage your own hardware.
Tune by experiment not guesswork
- Start from a strategy that matches the workload, then change one knob at a time and measure.
- Benchmark with production like key distribution, value sizes, and TTL patterns.
- Track both average and tail behavior since compaction effects often show up as outliers.

Real World Example

Imagine a feed service like Instagram that stores timeline events in an LSM backed store. The workload is append heavy with periodic fan out reads over recent time ranges. During a major product launch the service sees a spike of small writes with a short TTL since many events expire in days.

Step one Choose time window compaction for the timeline table. Files are grouped by hourly windows for the recent horizon and daily windows for older data. Expired data can be dropped by window with minimal overlap.
Step two Increase target file size and enable subcompactions so that a single large merge across a busy window is split into parallel ranges.
Step three Keep compression light on the newest windows and strong on older levels.
Step four Raise the soft threshold for level zero files during the launch window so the scheduler keeps up without hard stalls. The result is predictable tail latency on recent reads, bounded write amplification, and quick space reclamation when TTLs expire.

Common Pitfalls or Trade offs

One size compaction across all tables

Different tables have different access patterns. Mixing hot transactional data and cold archival data under the same strategy usually hurts both.
Over aggressive leveled compaction on cheap SSDs You may end up rewriting the same keys many times. This increases wear and can push you into write stalls.
Large files with small memory budgets If file size grows but block cache and filter memory do not, read amplification rises because each lookup touches multiple blocks.
Ignoring level zero Many incidents start when level zero grows unchecked. Protect it with clear soft and hard thresholds and with burst capacity in the scheduler.
Compaction at the wrong time Nightly jobs can look safe until a backfill or a reindex arrives. Spread compaction across the day and rely on rate limiting rather than single big jobs.

Interview Tip

Frame your answer around workload first. Say read heavy or write heavy, point lookup or range scan, TTL and update patterns, and data growth. Pick a matching strategy and call out two tuning moves with expected outcomes. For example, for write heavy time series, say time window compaction with hourly windows for the hot horizon, compaction filters for TTL, subcompactions on, and strong compression on cold levels. Close with the metric you would watch such as level zero file count or compaction debt.

Key Takeaways

Compaction is a continuous merge process that shapes read latency, write throughput, and cost.
Strategy choice is workload driven. Size tiered favors writes. Leveled favors reads. Time window fits time series and TTL heavy data.
Keep level zero healthy with thresholds, parallelism, and rate limits.
Use compaction filters and time windows to reclaim space early and reduce read amplification.
Tune by measurement. Watch tail latency, compaction debt, and the ratio of compaction bytes written to foreground bytes written.

Table of Comparison

Below is a compact view you can lift into your interview notes.

Compaction strategy	Best workload	Read amp	Write amp	Space amp	Operational notes
Size tiered	High write throughput with moderate read needs	Higher	Lower	Moderate	Simple scheduling with more overlap, good for hot ingest
Leveled	Low latency point reads and narrow range scans	Lower	Higher	Lower	Keys mostly unique per level, excellent for indexes
Universal	Small to medium datasets on SSD with frequent updates	Moderate	Moderate	Lower	Flexible file picking, useful when update rate is high
Time window	Time series and TTL heavy data	Lower for recent windows	Lower to moderate	Lower	Natural expiry per window, minimal overlap across time

FAQs

Q1. What is compaction in an LSM tree storage engine?

Compaction is a background merge that rewrites overlapping sorted files into fewer non overlapping files, discards tombstones and expired rows, and restores read efficiency.

Q2. How do I choose between size tiered and leveled compaction?

Pick size tiered for write centric ingest where reads can tolerate more overlap. Pick leveled when low read latency and predictable tail behavior matter most.

Q3. What causes write stalls during compaction and how do I avoid them?

Stalls appear when level zero grows past a hard threshold or when compaction jobs consume all IO. Raise concurrency carefully, set soft and hard thresholds, and apply rate limits.

Q4. What file size should I use for SSTables?

Match file size to your block cache and IO characteristics. Larger files reduce metadata overhead but create heavier merges. Start moderate, then profile.

Q5. When should I use time window compaction?

Use it when data arrives mostly in time order and expires by TTL. It keeps recent windows small and fast while allowing old windows to compact and drop cheaply.

Q6. How does compression interact with compaction?

Keep compression light on fresh levels for fast writes and stronger on deep levels where files change rarely. This balances latency and space savings.

Further Learning

Ready to level up your storage design skills

Master the fundamentals of LSM trees, caching, and sharding in Grokking System Design Fundamentals.
Practice end to end architectures and trade off analysis in Grokking the System Design Interview.
Build intuition for workload driven choices such as compaction strategy and rate limiting in Grokking Scalable Systems for Interviews.