LSM‑trees vs B‑trees: Which index fits your workload?

Choosing between an LSM tree and a B tree is a classic storage engine decision in a system design interview. Both are ordered indexes yet they favor very different cost profiles. LSM trees batch writes and turn many random I O operations into sequential work. B trees update pages in place and offer predictable point reads with excellent locality. The right fit comes from matching your workload shape to each index model.

Why It Matters

Indexes define the core latency and throughput envelope of your data layer. Pick the wrong structure and you fight constant write stalls, compaction storms, or cold read slowness. Pick the right one and you unlock scalable architecture with stable tail latencies. Interviewers expect you to reason from workload to structure using concrete signals such as read to write ratio, point versus range access, data temperature, device type, and service level objectives. This is where you turn generic database talk into specific design choices.

How It Works Step by Step

LSM tree write path

Append each write to a write ahead log for durability.
Apply the write to an in memory memtable usually a skip list or red black tree.
When the memtable fills, flush it as an immutable sorted run often called an SSTable on disk.
Background compaction merges overlapping runs across levels. You choose leveled compaction lower space amplification and more write amplification or tiered compaction higher space amplification and fewer rewrites.
Reads check the memtable first, then consult per file Bloom filters and index blocks to avoid unnecessary I O, and finally read the target block.

B tree write path

Append to the write ahead log.
Find the leaf page via a short root to leaf traversal and modify in place inside the buffer pool.
If a page is full, split it into siblings and update the parent.
Checkpoints and background flushes write dirty pages to disk.

Performance model mental shortcuts

LSM trees reduce random writes by batching but pay with read amplification and compaction work. Bloom filters and block cache hide much of the pain for point reads on hot keys.
B trees deliver low read amplification and great range scans because pages are already in order. They pay with random write cost and potential fragmentation when the working set does not fit in memory.
On modern SSDs, sequential versus random matters less than on spinning disks, but compaction write amplification and cache hit rates still dominate cost.
Think in three amplifications space, write, read. LSM trees often win on space and write for ingest heavy flows. B trees win on read for point reads and mixed transactional traffic.

Real World Example

Imagine two services inside an ecommerce platform.

Service A is a click stream and metrics pipeline that ingests tens of millions of events per minute, runs time bounded queries, and expires data after a few days. The access pattern is append heavy with occasional range scans by time. An LSM tree backed store shines here.

Sequential flushes absorb spikes, tombstones make TTL deletion easy, and leveled compaction keeps space under control. Service B is a product catalog and order ledger with many secondary indexes, strict consistency, and lots of small point reads inside transactions. B tree indexes fit perfectly. Locality keeps hot leaves warm, splits are predictable, and range scans for pagination are simple.

Common Pitfalls or Trade offs

LSM trees can suffer from read amplification. Without solid Bloom filters, a single get may touch many files.
Compaction stalls can create p latency spikes. Tuning level sizes, throttling, and I O quotas is essential.
Tombstones from deletes and updates can accumulate and slow reads until compacted.
B trees on write heavy workloads cause random write pressure and page fragmentation that hurts SSD life and increases flush churn.
Small range scans can be slower on LSM trees if data is spread across many runs. Large range scans favor careful compaction and large block sizes.
Secondary indexes are easier to keep transactionally consistent with B trees. With LSM trees you must plan for write amplification across multiple LSM structures.

Interview Tip

Start with three signals. One, the read to write ratio. Two, point versus range access. Three, durability and TTL behavior. Then say the sentence. If writes are bursty with time based expiration and we can tolerate slightly higher read cost, we choose an LSM tree with leveled compaction and Bloom filters sized to a target false positive rate. If we need low latency point reads and many secondary indexes inside transactions, we choose B tree indexes with a tuned fill factor and a generous buffer pool. This shows structured reasoning rather than preference.

Key Takeaways

LSM tree batches writes, excels at high ingest, and relies on compaction plus Bloom filters to control read cost.
B tree updates pages in place, excels at point reads and short range scans, and benefits from large buffer pools.
Think in space, write, and read amplification and choose the profile that matches your workload.
Device matters. Spinning disks favor LSM more than SSDs but compaction tuning still dominates cost.
TTL heavy and log like data fits LSM. Transactional OLTP with many secondary indexes fits B tree.

Table of Comparison

Dimension	LSM Tree	B Tree
Primary fit	High ingest, logging, metrics, time series, cache warm keys	OLTP, product catalog, user profile, order ledger
Point reads	Good with Bloom filters and cache; otherwise multiple files	Excellent; short root-to-leaf path with locality
Range scans	Good with leveled compaction and large blocks; may span runs	Excellent; consecutive pages are naturally ordered
Write cost	Sequential flush then background compaction; fewer random writes	In-place updates with random writes and page splits
Space amplification	Low to moderate; depends on compaction strategy	Low with tuned fill factor and regular vacuum
Write amplification	Moderate to high due to compaction, especially with churn	Low to moderate; mostly single page plus WAL
Read amplification	Higher; mitigated by Bloom filters and index blocks	Low
Latency tail	Risk of compaction stalls and L0 pile-ups	Risk of checkpoint spikes and write stalls under memory pressure
TTL and deletes	Easy via tombstones; compacted away later	Immediate page update but leaves fragmentation
Secondary indexes	More complex; amplified writes across multiple LSMs	Straightforward transactional maintenance
Device synergy	Great on HDD and good on SSD	Good on SSD; weaker on HDD under random write load
Tuning knobs	Level sizes, Bloom bits per key, compaction policy, block size	Page size, fill factor, buffer pool, checkpoints
Representative engines	RocksDB, LevelDB, Cassandra, HBase	Postgres, InnoDB, SQL Server, many KV engines

FAQs

Q1. When should I choose an LSM tree for a system design interview?

Pick LSM when the workload is ingest heavy, data is append like or time ordered, you can size Bloom filters, and you are comfortable tuning compaction to manage read and space amplification.

Q2. When should I choose a B tree?

Pick B tree for transactional services with many point reads, strict consistency, and several secondary indexes where locality and predictable read cost matter.

Q3. How do Bloom filters help LSM reads?

They let you skip files that cannot contain the key. A well sized filter with ten bits per key drives false positives below one percent and cuts random I O.

Q4. Are LSM trees always faster on SSDs?

No. SSDs reduce the penalty of random writes but compaction still multiplies write cost. If the service is read dominated a B tree may still lead.

Q5. What about large range scans?

B trees are naturally strong. LSM trees can approach similar performance with leveled compaction large blocks and prefetch but scans may touch many runs.

Q6. How do deletes behave across both structures?

LSM uses tombstones that persist until compaction removes the dead keys. B tree updates pages immediately and may need periodic defragmentation.

Further Learning

To practice the reasoning patterns that interviewers reward, study our step by step approach in Grokking the System Design Interview.

If you want a deeper exploration of storage engines, caching, and throughput latency trade offs inside distributed systems, enroll in Grokking Scalable Systems for Interviews. Both courses include projects and review checklists you can reuse in interviews.