On this page
Why Database Storage Matters
The Traditional Approach: B-Trees
Why B-Trees Are So Good at Reads
The Cost of Updating B-Trees
Enter the LSM Tree
How an LSM Tree Works
Step 1: Write to Memory
Step 2: Write Ahead Log
Step 3: Flush to Disk
Step 4: Compaction
Why LSM Trees Are Extremely Fast for Writes
Reads Become More Complicated
The Hidden Cost of Compaction
Range Queries: B-Tree Wins
Write Throughput: LSM Tree Wins
Space Amplification vs Write Amplification
Which Databases Use Which?
B-Tree Based
LSM Tree Based
Choosing the Right Storage Engine
What Interviewers Expect
Learn Database Internals Through System Design
Final Thoughts
LSM Trees vs. B-Trees: How Modern Databases Handle Reads and Writes


On This Page
Why Database Storage Matters
The Traditional Approach: B-Trees
Why B-Trees Are So Good at Reads
The Cost of Updating B-Trees
Enter the LSM Tree
How an LSM Tree Works
Step 1: Write to Memory
Step 2: Write Ahead Log
Step 3: Flush to Disk
Step 4: Compaction
Why LSM Trees Are Extremely Fast for Writes
Reads Become More Complicated
The Hidden Cost of Compaction
Range Queries: B-Tree Wins
Write Throughput: LSM Tree Wins
Space Amplification vs Write Amplification
Which Databases Use Which?
B-Tree Based
LSM Tree Based
Choosing the Right Storage Engine
What Interviewers Expect
Learn Database Internals Through System Design
Final Thoughts
When engineers compare databases, the discussion usually revolves around SQL versus NoSQL, consistency models, replication, or sharding. Those are important architectural decisions, but they often hide a more fundamental question:
How does the database actually organize data on disk?
The answer has a profound impact on performance.
Whether your application serves millions of reads per second, ingests a constant stream of IoT data, or powers a social media feed, much of its performance depends on the underlying storage engine. Two of the most influential data structures in modern databases are B-Trees and Log-Structured Merge Trees (LSM Trees).
Although they solve the same problem—efficiently storing and retrieving large amounts of data—they take dramatically different approaches. B-Trees optimize for balanced read and write performance by updating data in place, while LSM Trees optimize for high write throughput by treating storage as an append-only system and reorganizing data later.
Whether you're preparing for a system design interview or designing production systems, understanding storage engines gives you a much deeper appreciation of why databases behave differently under different workloads. If you're new to these concepts, Grokking System Design Fundamentals provides an excellent introduction to database internals, caching, replication, partitioning, and other building blocks that every software engineer should understand before diving into advanced distributed systems.
In this article, we'll explore how B-Trees and LSM Trees work, why they were designed differently, their strengths and weaknesses, and when each one is the better choice.
Why Database Storage Matters
Every database must solve the same basic challenge.
Applications continuously issue operations like:
- Insert a new row.
- Update an existing record.
- Delete obsolete data.
- Find a specific key.
- Retrieve a range of values.
These operations sound simple until you consider scale.
A production database may store billions of records, far exceeding available memory. Most of the data lives on disk, where reads and writes are significantly slower than in-memory operations. The storage engine's primary job is to minimize expensive disk I/O while keeping data organized for efficient access.
This is why database internals matter. Two databases may expose similar SQL syntax or APIs, yet perform very differently because their storage engines optimize for different workloads.
Before choosing a database, it's important to understand the workload you're optimizing for. Modern system design interviews often expect candidates to explain not only which database they would choose, but also why that storage engine fits the application's read/write characteristics. This type of architectural reasoning is covered extensively in Grokking the System Design Interview, where database selection is always discussed in the context of scalability, latency, and real-world trade-offs rather than as isolated technologies.
The Traditional Approach: B-Trees
For decades, B-Trees have been the dominant indexing structure for relational databases.
The idea is elegant.
Instead of storing every key sequentially, a B-Tree organizes keys into balanced pages. Each page contains multiple sorted keys along with pointers to child pages. Because every leaf node sits at roughly the same depth, searching for any key requires only a small number of page reads.
Imagine looking up a customer ID in a table with hundreds of millions of rows. Rather than scanning the entire dataset, the database traverses the tree from the root to the appropriate leaf node, narrowing the search space at each level.
A simplified B-Tree might look like this:
Searching for 60 follows a predictable path:
50 → 75 → 60
Even though the actual tree may contain billions of records, the number of disk accesses remains relatively small because each node stores many keys rather than just one.
Why B-Trees Are So Good at Reads
B-Trees excel because their data remains sorted at all times.
Suppose you need every order placed between 10:00 AM and 11:00 AM.
A B-Tree can quickly locate the first matching record and then read neighboring leaf pages sequentially. Since nearby keys are stored close together, range scans become extremely efficient.
This is one reason traditional relational databases rely heavily on B-Tree indexes for queries involving:
- Date ranges
- Alphabetical ordering
- Numeric intervals
- Prefix searches
- ORDER BY clauses
The sorted structure naturally supports these operations without requiring additional work.
The Cost of Updating B-Trees
The same structure that makes reads efficient also makes writes more expensive.
Consider inserting a new key into the middle of an already full page.
The database may need to:
- Read the existing page from disk.
- Insert the new key while maintaining sorted order.
- Split the page if necessary.
- Update parent pointers.
- Write multiple modified pages back to disk.
These random disk writes are significantly more expensive than sequential writes.
As write rates increase, maintaining the tree becomes increasingly costly.
For workloads dominated by continuous writes, another approach performs much better.
Enter the LSM Tree
Log-Structured Merge Trees were designed around a different assumption.
Disk writes are expensive.
Sequential writes are cheap.
Instead of modifying existing data immediately, an LSM Tree accumulates changes in memory and writes them sequentially to disk later.
The philosophy is simple:
Never rewrite existing files if you can append instead.
This dramatically reduces random disk I/O.
How an LSM Tree Works
The write path consists of several stages.
Step 1: Write to Memory
Incoming writes are stored inside an in-memory structure called a MemTable.
Because memory is fast, inserts complete quickly.
Step 2: Write Ahead Log
Before acknowledging success, the database also appends the operation to a Write-Ahead Log (WAL).
If the server crashes, the WAL allows the MemTable to be reconstructed.
Step 3: Flush to Disk
Once the MemTable reaches a threshold, it is written sequentially to disk as an immutable file called an SSTable (Sorted String Table).
No existing data is modified.
A new file is simply created.
Step 4: Compaction
Over time, multiple SSTables accumulate.
Background compaction merges overlapping files, removes obsolete versions, and discards deleted records.
This keeps the storage engine manageable while preserving efficient lookups.
Why LSM Trees Are Extremely Fast for Writes
Imagine receiving one million sensor readings every minute.
A B-Tree would continuously modify existing pages.
An LSM Tree simply keeps appending new records.
Sequential disk writes are much faster than scattered random writes, particularly on traditional hard drives and still beneficial on SSDs.
This is why many write-heavy systems choose LSM Trees.
Examples include:
- Cassandra
- RocksDB
- LevelDB
- ScyllaDB
- HBase
These databases are optimized for continuous ingestion rather than complex analytical queries.
Reads Become More Complicated
The trade-off appears during reads.
Suppose a key exists.
Where should the database look?
It may be:
- In the MemTable.
- In the newest SSTable.
- In an older SSTable.
- In an even older SSTable.
Without optimization, every lookup could require checking multiple files.
To reduce unnecessary disk reads, LSM Trees use Bloom Filters.
A Bloom Filter quickly answers:
"This key definitely isn't here."
or
"This key might be here."
If the filter says the key cannot exist inside a particular SSTable, the database skips reading that file entirely.
Bloom Filters dramatically improve lookup performance while consuming very little memory.
The Hidden Cost of Compaction
Compaction is both the strength and weakness of LSM Trees.
Eventually, scattered SSTables must be merged.
During compaction the database:
- Reads existing files.
- Merges records.
- Removes duplicates.
- Deletes obsolete versions.
- Writes new SSTables.
Although this happens in the background, it consumes CPU, memory, and disk bandwidth.
Large compactions can temporarily increase latency if not managed carefully.
Modern databases devote significant engineering effort to scheduling compactions efficiently.
Range Queries: B-Tree Wins
Suppose an application needs:
SELECT \*
FROM Orders
WHERE CreatedAt BETWEEN ...
B-Trees excel.
Data already exists in sorted order.
The database finds the first matching key and scans forward sequentially.
LSM Trees can support the same query, but data may reside across multiple SSTables created at different times. The engine often performs additional merging during reads, making large range scans more expensive.
This is one reason relational databases remain popular for transactional business applications.
Write Throughput: LSM Tree Wins
Now imagine an IoT platform ingesting telemetry from five million devices.
Every second brings another flood of writes.
Random updates become the bottleneck.
Sequential appends become the advantage.
LSM Trees handle these workloads exceptionally well because they transform random writes into sequential disk operations.
Applications involving:
- Logging
- Metrics
- Time-series data
- Event ingestion
- Social activity feeds
often benefit from LSM-based storage engines.
Space Amplification vs Write Amplification
Neither data structure is free.
B-Trees primarily pay the cost during updates.
LSM Trees pay the cost later through compaction.
Engineers describe these trade-offs using three concepts:
- Read amplification
- Write amplification
- Space amplification
Read amplification measures how many disk reads are required for one logical read.
Write amplification measures how much physical writing occurs for one logical write.
Space amplification measures how much additional storage is consumed beyond the actual dataset.
B-Trees generally have lower read amplification.
LSM Trees generally achieve higher write throughput but incur greater write amplification because data is rewritten during compaction.
Understanding these trade-offs is far more useful than memorizing which database uses which structure.
Which Databases Use Which?
B-Tree Based
- PostgreSQL
- MySQL (InnoDB)
- SQLite
- Oracle Database
- SQL Server
These databases prioritize predictable reads, indexing, and transactional workloads.
LSM Tree Based
- Cassandra
- RocksDB
- LevelDB
- ScyllaDB
- HBase
These systems prioritize sustained write performance and horizontal scalability.
Choosing the Right Storage Engine
A common system design mistake is choosing databases based only on popularity.
Instead, start with workload characteristics.
Ask questions like:
- Is the system read-heavy or write-heavy?
- Are range scans common?
- Is data mostly append-only?
- Are updates frequent?
- Is write latency more important than read latency?
The answers often determine whether a B-Tree or LSM Tree is the better foundation.
What Interviewers Expect
In system design interviews, you rarely need to explain the implementation details of B-Trees or LSM Trees.
Instead, interviewers want to know whether you understand the trade-offs.
A strong answer might sound like:
"For this write-heavy event ingestion system, I'd prefer an LSM Tree–based database because sequential writes provide much higher ingestion throughput. If the workload were dominated by transactional queries and range scans, I'd lean toward a B-Tree–based relational database instead."
That explanation demonstrates architectural reasoning rather than memorization.
Learn Database Internals Through System Design
Storage engines are only one piece of modern database architecture.
To make good architectural decisions, you also need to understand indexing, caching, replication, partitioning, consistency models, load balancing, and distributed communication. These concepts work together, and interviewers typically expect you to reason about them as a complete system rather than as individual topics.
If you're building your foundation, Grokking System Design Fundamentals explains the core building blocks of scalable systems, including databases, caches, messaging systems, replication, and storage architecture. It's an excellent starting point before tackling more complex system design problems.
Once you're comfortable with the fundamentals, Grokking the System Design Interview shows how these concepts come together in real-world designs such as Instagram, Twitter, Uber, URL shorteners, and other large-scale systems. Rather than memorizing architectures, you'll learn how to evaluate trade-offs and justify design decisions—the exact skill interviewers are looking for.
One of the biggest advantages of understanding B-Trees and LSM Trees is that database selection stops becoming guesswork. Instead of choosing PostgreSQL, MySQL, Cassandra, or RocksDB based on popularity, you'll understand the workload characteristics that make each storage engine a good architectural fit.
Final Thoughts
B-Trees and LSM Trees represent two different philosophies for managing persistent data.
B-Trees keep information continuously organized, making reads and range queries exceptionally efficient. LSM Trees defer organization, optimizing for fast sequential writes and reorganizing data later through compaction.
Neither approach is universally superior.
The best choice depends entirely on workload characteristics.
Read-heavy transactional systems often benefit from B-Trees.
Write-intensive event-driven systems often benefit from LSM Trees.
As distributed systems continue growing in scale, understanding these storage engines becomes increasingly valuable. They are not merely implementation details hidden inside databases—they influence latency, throughput, storage efficiency, and ultimately the architecture of the systems we build.
What our users say
Matzuk
Algorithms can be daunting, but they're less so with the right guide. This course - https://www.designgurus.io/course/grokking-the-coding-interview, is a great starting point. It covers typical problems you might encounter in interviews.
Arijeet
Just completed the “Grokking the system design interview”. It's amazing and super informative. Have come across very few courses that are as good as this!
AHMET HANIF
Whoever put this together, you folks are life savers. Thank you :)
Access to 50+ courses
New content added monthly
Certificate of completion
$31.08
/month
Billed Annually
Recommended Course

Grokking the System Design Interview
178,110+ students
4.7
The #1 system design course for FAANG interviews, built by ex-FAANG hiring managers.
View CourseRead More
Scaling SQL Databases: 8 Challenges of Horizontally Scaling SQL Databases
Arslan Ahmad
Database Indexes Are Not Free: The Read, Write, and Storage Trade-offs You Need to Know
Arslan Ahmad
PostgreSQL vs. MongoDB vs. DynamoDB
Arslan Ahmad
Grokking the Fundamentals of Database Replication for System Design Interviews
Arslan Ahmad