LSM‑trees vs B‑trees: Which index fits your workload?
Choosing between an LSM tree and a B tree is a classic storage engine decision in a system design interview. Both are ordered indexes yet they favor very different cost profiles. LSM trees batch writes and turn many random I O operations into sequential work. B trees update pages in place and offer predictable point reads with excellent locality. The right fit comes from matching your workload shape to each index model.
Why It Matters
Indexes define the core latency and throughput envelope of your data layer. Pick the wrong structure and you fight constant write stalls, compaction storms, or cold read slowness. Pick the right one and you unlock scalable architecture with stable tail latencies. Interviewers expect you to reason from workload to structure using concrete signals such as read to write ratio, point versus range access, data temperature, device type, and service level objectives. This is where you turn generic database talk into specific design choices.
How It Works Step by Step
LSM tree write path
-
Append each write to a write ahead log for durability.
-
Apply the write to an in memory memtable usually a skip list or red black tree.
-
When the memtable fills, flush it as an immutable sorted run often called an SSTable on disk.
-
Background compaction merges overlapping runs across levels. You choose leveled compaction lower space amplification and more write amplification or tiered compaction higher space amplification and fewer rewrites.
-
Reads check the memtable first, then consult per file Bloom filters and index blocks to avoid unnecessary I O, and finally read the target block.
B tree write path
- Append to the write ahead log.
- Find the leaf page via a short root to leaf traversal and modify in place inside the buffer pool.
- If a page is full, split it into siblings and update the parent.
- Checkpoints and background flushes write dirty pages to disk.
Performance model mental shortcuts
-
LSM trees reduce random writes by batching but pay with read amplification and compaction work. Bloom filters and block cache hide much of the pain for point reads on hot keys.
-
B trees deliver low read amplification and great range scans because pages are already in order. They pay with random write cost and potential fragmentation when the working set does not fit in memory.
-
On modern SSDs, sequential versus random matters less than on spinning disks, but compaction write amplification and cache hit rates still dominate cost.
-
Think in three amplifications space, write, read. LSM trees often win on space and write for ingest heavy flows. B trees win on read for point reads and mixed transactional traffic.
Real World Example
Imagine two services inside an ecommerce platform.
Service A is a click stream and metrics pipeline that ingests tens of millions of events per minute, runs time bounded queries, and expires data after a few days. The access pattern is append heavy with occasional range scans by time. An LSM tree backed store shines here.
Sequential flushes absorb spikes, tombstones make TTL deletion easy, and leveled compaction keeps space under control. Service B is a product catalog and order ledger with many secondary indexes, strict consistency, and lots of small point reads inside transactions. B tree indexes fit perfectly. Locality keeps hot leaves warm, splits are predictable, and range scans for pagination are simple.
Common Pitfalls or Trade offs
-
LSM trees can suffer from read amplification. Without solid Bloom filters, a single get may touch many files.
-
Compaction stalls can create p latency spikes. Tuning level sizes, throttling, and I O quotas is essential.
-
Tombstones from deletes and updates can accumulate and slow reads until compacted.
-
B trees on write heavy workloads cause random write pressure and page fragmentation that hurts SSD life and increases flush churn.
-
Small range scans can be slower on LSM trees if data is spread across many runs. Large range scans favor careful compaction and large block sizes.
-
Secondary indexes are easier to keep transactionally consistent with B trees. With LSM trees you must plan for write amplification across multiple LSM structures.
Interview Tip
Start with three signals. One, the read to write ratio. Two, point versus range access. Three, durability and TTL behavior. Then say the sentence. If writes are bursty with time based expiration and we can tolerate slightly higher read cost, we choose an LSM tree with leveled compaction and Bloom filters sized to a target false positive rate. If we need low latency point reads and many secondary indexes inside transactions, we choose B tree indexes with a tuned fill factor and a generous buffer pool. This shows structured reasoning rather than preference.
Key Takeaways
-
LSM tree batches writes, excels at high ingest, and relies on compaction plus Bloom filters to control read cost.
-
B tree updates pages in place, excels at point reads and short range scans, and benefits from large buffer pools.
-
Think in space, write, and read amplification and choose the profile that matches your workload.
-
Device matters. Spinning disks favor LSM more than SSDs but compaction tuning still dominates cost.
-
TTL heavy and log like data fits LSM. Transactional OLTP with many secondary indexes fits B tree.
Table of Comparison
| Dimension | LSM Tree | B Tree |
|---|---|---|
| Primary fit | High ingest, logging, metrics, time series, cache warm keys | OLTP, product catalog, user profile, order ledger |
| Point reads | Good with Bloom filters and cache; otherwise multiple files | Excellent; short root-to-leaf path with locality |
| Range scans | Good with leveled compaction and large blocks; may span runs | Excellent; consecutive pages are naturally ordered |
| Write cost | Sequential flush then background compaction; fewer random writes | In-place updates with random writes and page splits |
| Space amplification | Low to moderate; depends on compaction strategy | Low with tuned fill factor and regular vacuum |
| Write amplification | Moderate to high due to compaction, especially with churn | Low to moderate; mostly single page plus WAL |
| Read amplification | Higher; mitigated by Bloom filters and index blocks | Low |
| Latency tail | Risk of compaction stalls and L0 pile-ups | Risk of checkpoint spikes and write stalls under memory pressure |
| TTL and deletes | Easy via tombstones; compacted away later | Immediate page update but leaves fragmentation |
| Secondary indexes | More complex; amplified writes across multiple LSMs | Straightforward transactional maintenance |
| Device synergy | Great on HDD and good on SSD | Good on SSD; weaker on HDD under random write load |
| Tuning knobs | Level sizes, Bloom bits per key, compaction policy, block size | Page size, fill factor, buffer pool, checkpoints |
| Representative engines | RocksDB, LevelDB, Cassandra, HBase | Postgres, InnoDB, SQL Server, many KV engines |
FAQs
Q1. When should I choose an LSM tree for a system design interview?
Pick LSM when the workload is ingest heavy, data is append like or time ordered, you can size Bloom filters, and you are comfortable tuning compaction to manage read and space amplification.
Q2. When should I choose a B tree?
Pick B tree for transactional services with many point reads, strict consistency, and several secondary indexes where locality and predictable read cost matter.
Q3. How do Bloom filters help LSM reads?
They let you skip files that cannot contain the key. A well sized filter with ten bits per key drives false positives below one percent and cuts random I O.
Q4. Are LSM trees always faster on SSDs?
No. SSDs reduce the penalty of random writes but compaction still multiplies write cost. If the service is read dominated a B tree may still lead.
Q5. What about large range scans?
B trees are naturally strong. LSM trees can approach similar performance with leveled compaction large blocks and prefetch but scans may touch many runs.
Q6. How do deletes behave across both structures?
LSM uses tombstones that persist until compaction removes the dead keys. B tree updates pages immediately and may need periodic defragmentation.
Further Learning
To practice the reasoning patterns that interviewers reward, study our step by step approach in Grokking the System Design Interview.
If you want a deeper exploration of storage engines, caching, and throughput latency trade offs inside distributed systems, enroll in Grokking Scalable Systems for Interviews. Both courses include projects and review checklists you can reuse in interviews.
GET YOUR FREE
Coding Questions Catalog
$197

$78
$78