On this page
What a database index actually is
How indexes speed up queries
B-Tree indexes: the default, and usually the right choice
Hash indexes: blazing fast, but narrow
Bitmap indexes: the analytics workhorse
R-Tree indexes: the spatial specialist
Other index types worth knowing
Clustered vs non-clustered indexes
Composite indexes and the leftmost-prefix rule
When to add an index — and when not to
How indexes affect writes: the trade-off nobody talks about
Common interview questions and follow-ups
Putting it all together
Keep learning
Database Indexing Explained: B-Tree, Hash, Bitmap, R-Tree, and When to Use Each


On This Page
What a database index actually is
How indexes speed up queries
B-Tree indexes: the default, and usually the right choice
Hash indexes: blazing fast, but narrow
Bitmap indexes: the analytics workhorse
R-Tree indexes: the spatial specialist
Other index types worth knowing
Clustered vs non-clustered indexes
Composite indexes and the leftmost-prefix rule
When to add an index — and when not to
How indexes affect writes: the trade-off nobody talks about
Common interview questions and follow-ups
Putting it all together
Keep learning
Here's a question engineers answer wrong in interviews constantly: "How does a database find a specific row in a table with a billion rows without scanning all billion?"
The answer is the index. And once you understand how indexes actually work, database performance stops feeling like black magic. Slow queries become diagnosable. Query plans become readable. The difference between "this table is unusable at scale" and "this table serves 100K QPS from one box" is almost always a question of whether the right index exists on the right column.
This guide covers database indexing end to end. By the time you're done, you'll know:
- What an index actually is under the hood — not the hand-wave version
- The four index types you'll actually encounter (B-Tree, Hash, Bitmap, R-Tree) and when each one wins
- The hidden trade-offs: storage, write amplification, the cost of over-indexing
- Composite indexes and the leftmost-prefix rule (the senior-level knowledge most articles skip)
- How to answer "would you index this column?" cleanly in a system design interview
Let's dig in.
What a database index actually is
Think about how you use a physical book. If someone asks you "where does the book discuss photosynthesis?", you don't read every page from cover to cover. You go to the index at the back, find "photosynthesis," and jump to the listed pages.
A database index is exactly that. It's an auxiliary data structure — a separate, smaller structure stored alongside your table — that maps column values to row locations. When the database needs to find rows matching a WHERE clause on an indexed column, it searches the index instead of scanning the table.
Concretely: if you create an index on the email column of a users table, the database builds a sorted structure containing every email value paired with a pointer to the row holding that email. A query like SELECT * FROM users WHERE email = 'alice@example.com' searches the small sorted index (fast), finds the pointer, and fetches the one row (also fast).
Without the index, the database would scan every row in the users table comparing each email to 'alice@example.com'. At 100 million rows, that's the difference between milliseconds and minutes.
How indexes speed up queries
The math matters because it's what interviewers listen for when they ask "why does an index help?"
Without an index, finding a row with WHERE name = 'Alice' on an N-row table is O(N) — you inspect every row in the worst case. At 100 million rows, that's 100 million comparisons.
With a B-Tree index on the name column, the same lookup is O(log N) — roughly 27 comparisons for 100 million rows. That's a 3.7-million-fold reduction in work.
Under the hood, the index lookup goes something like:
- Index lookup: The database checks the index (sorted by the target column) instead of scanning the full table.
- Traverse the tree: For a B-Tree index, it follows pointers from root to leaf, narrowing down with a handful of comparisons.
- Retrieve the row: Once the index finds the matching key, it uses the stored pointer to fetch the full row from the main table.
This last step matters — the index stores column values + pointers, not the full rows. That's why indexes are much smaller than the tables they index. It's also why you'll hear about covering indexes — specialized indexes that include all the columns a query needs, so the database can answer the query from the index alone without fetching the row at all. More on that later.
Now, the choice of data structure determines what queries the index can accelerate. B-Trees handle range queries. Hash indexes handle exact matches. Bitmap indexes handle low-cardinality filters. R-Trees handle spatial queries. Let's walk through each.
B-Tree indexes: the default, and usually the right choice
B-Tree indexes are the default in virtually every relational database — PostgreSQL, MySQL, SQL Server, Oracle. When you type CREATE INDEX without specifying a type, you get a B-Tree. When a database's query planner picks an index without you specifying one, it's almost always a B-Tree.
How it works: A B-Tree (technically, most production databases use a B+ Tree variant) is a self-balancing, multi-way tree. Each node holds many keys in sorted order, with pointers to child nodes. The tree stays balanced — meaning all leaf nodes sit at the same depth — which guarantees predictable O(log N) lookup time.
The key property is that the tree stays wide and shallow. A node can hold hundreds of keys, so even a table with 100 million rows might have a B-Tree only 3–4 levels deep. Each level corresponds to one disk read. Finding any row takes ~4 disk reads, regardless of table size.
Strengths:
- Versatile. Handles exact-match queries (
WHERE id = 42), range queries (WHERE created_at BETWEEN '2026-01-01' AND '2026-02-01'), prefix matches (WHERE name LIKE 'Al%'), andORDER BYclauses. One index handles all of these. - Efficient at scale. O(log N) lookup remains fast even for billion-row tables.
- High-cardinality friendly. Works equally well on columns with many distinct values (like user IDs or timestamps).
- Self-balancing. Inserts and deletes don't degrade performance — the tree rebalances automatically.
Weaknesses:
- Write amplification. Every insert, update, or delete has to update the index. If a table has 10 indexes, every row modification does 10 extra writes.
- Storage overhead. An index on a column costs roughly as much disk space as the column itself, plus pointers.
When to use it: As the default. If you don't have a specific reason to pick something else, use a B-Tree. For range queries and sorted retrieval, it's the only right answer — hash and bitmap indexes don't support these at all.
Real example: An e-commerce app indexes orders.order_date with a B-Tree. The query WHERE order_date BETWEEN '2026-01-01' AND '2026-03-31' returns Q1 orders in milliseconds instead of scanning the whole orders table. Same app indexes customers.last_name with a B-Tree so lookups like "find all Smiths" are instant.
Hash indexes: blazing fast, but narrow
A hash index uses a completely different data structure: a hash table. You give it a key, it hashes the key, and it jumps directly to the bucket holding that key's pointer. Average lookup time is O(1) — constant, regardless of table size.
How it works: When the index is built, each indexed value is passed through a hash function that produces a bucket location. Pointers are stored at those buckets. To look up a value, the database hashes the query value and goes directly to the matching bucket. No tree traversal, no comparisons with other keys.
The strength is obvious — O(1) is as fast as lookup gets. The weakness is less obvious but more important: hash indexes only support exact equality lookups. They can't help with anything else.
Strengths:
- O(1) exact-match lookups. Faster than B-Tree for equality queries, particularly at high cardinality.
- Minimal storage per entry. No sort order to maintain.
- Low overhead. Simpler to update than a tree.
Weaknesses:
- No range queries.
WHERE id > 1000can't use a hash index. The hash scrambles the natural ordering. - No sorting.
ORDER BYon a hash-indexed column requires a separate sort step. - No prefix matches.
WHERE name LIKE 'Al%'is a full scan with only a hash index. - Collisions. Even with a good hash function, two distinct keys can produce the same bucket, requiring a collision resolution mechanism (which adds overhead).
- No partial keys. You must provide the full exact value.
When to use it: When you're certain the only query pattern is WHERE column = :value — exact equality lookups with no range, no sort, no prefix. In practice, this is less common than you'd expect. Classic cases:
- In-memory key-value stores like Redis, which are fundamentally built around hash indexing.
- PostgreSQL hash indexes for equality-only lookup columns (though PostgreSQL's B-Tree is nearly as fast for this case, so hash indexes are a niche optimization).
- MySQL MEMORY tables where hash is the default.
- Some NoSQL databases for partition key lookups.
The common mistake: Defaulting to a hash index because "it's faster." Yes, it's faster for the one case it supports — but the moment you need a range query or a sort, you're back to a full scan. B-Tree is the safer default. Only switch to hash if you've measured the benefit and you're sure you'll never need range semantics.
Bitmap indexes: the analytics workhorse
Bitmap indexes are the specialty tool of the indexing world. You probably won't use them in OLTP applications (your typical web app backend), but they're foundational in data warehouses and analytical systems. Understanding them is senior-level knowledge.
How it works: Instead of a tree or hash table, a bitmap index uses bit arrays. Suppose a users table has a status column with three possible values: Active, Inactive, Banned. A bitmap index on status creates three bit arrays, one per distinct value:
Active: 1 0 1 1 0 0 1 0 1 ...
Inactive: 0 1 0 0 1 0 0 0 0 ...
Banned: 0 0 0 0 0 1 0 1 0 ...
Each bit position corresponds to a row in the table. A 1 in the Active bitmap at position 3 means row 3's status is Active.
The magic comes from combining bitmaps with fast bitwise operations. A query like WHERE status = 'Active' AND region = 'North' AND gender = 'Female' becomes:
result = Active_bitmap AND North_bitmap AND Female_bitmap
Three bitwise AND operations across the three bitmaps give you exactly the rows matching all three conditions. On modern CPUs, bitwise operations on large bitmaps are extraordinarily fast — often processed 64 or 256 bits at a time via SIMD instructions.
Strengths:
- Extremely fast for multi-column filters on low-cardinality data. This is the killer use case — queries like "active premium users in the EU region" that combine several filters each on columns with few distinct values.
- Space-efficient for low-cardinality columns. A boolean column's bitmap compresses dramatically with run-length encoding.
- Easy to combine. Bitmaps are trivially combined with AND, OR, NOT operations.
Weaknesses:
- Terrible for high-cardinality columns. If a column has a million distinct values, you'd need a million bit arrays — impractical and memory-hungry.
- Write-hostile. Updating a single row means flipping bits across potentially many bitmaps. In high-concurrency write workloads, this causes severe lock contention.
- Limited database support. PostgreSQL, Oracle, and some data warehouse systems (Vertica, Greenplum) support bitmap indexes well. MySQL and SQL Server don't have native bitmap indexes (though they have alternative structures that behave similarly).
When to use it: Data warehouses and analytical databases with:
- Low-cardinality columns (gender, country, status, product_category, year)
- Read-heavy workloads with infrequent bulk updates
- Complex multi-column filtering (the "slice and dice" analytics pattern)
If this sounds more like an analytical query workload than a transactional one, you're right — bitmap indexes are one of the structural reasons OLAP systems exist separately from OLTP systems.
Real example: A retail company's sales fact table has 100 million rows with columns region, product_category, fiscal_quarter, customer_segment — all low-cardinality. Bitmap indexes on all four let analysts query "total Electronics sales in the North region in Q1 for Enterprise customers" by combining four bitmaps in milliseconds. The same query on B-Tree indexes would be much slower because each index would find a larger intermediate set that then has to be intersected.
R-Tree indexes: the spatial specialist
R-Trees are built for a problem B-Trees can't handle: multi-dimensional data. Think geographic coordinates, bounding boxes, 3D points — anything where "closeness" requires considering multiple dimensions simultaneously.
How it works: An R-Tree groups nearby objects into bounding rectangles (or bounding boxes in higher dimensions). These rectangles are organized hierarchically — each parent node's rectangle fully contains its children's rectangles. To answer a spatial query like "find all objects within this area," the database descends the tree, pruning entire branches whose bounding rectangles don't overlap the query area. This avoids scanning every point.
Strengths:
- Efficient spatial queries. "Find all coffee shops within 5 miles of this point," "find all routes passing through this region," "find all buildings overlapping this zoning area" — all fast.
- Handles 2D, 3D, and higher. R-Trees naturally extend to more dimensions.
- Native in GIS databases. PostGIS (PostgreSQL), MySQL with spatial extensions, MongoDB's geospatial indexes, and specialized systems like Elasticsearch's geo queries all use R-Tree-like structures.
Weaknesses:
- Specialized. R-Trees add no value for non-spatial queries — they just add overhead.
- More complex balancing. Because they index areas rather than points, R-Trees need more elaborate algorithms for inserts, deletes, and balancing.
- Not universally supported. You need to use a database with spatial extensions enabled.
When to use it: Any time the data has a spatial component and the queries involve spatial relationships. Classic uses:
- Location-based services: ride-sharing ("find nearest drivers"), mapping ("restaurants near me"), delivery apps
- Gaming: collision detection, proximity queries
- IoT and sensor networks: geo-fenced alerts
- Real estate and urban planning: "properties within this neighborhood"
In a system design interview: if the question is designing Uber or a location-based service, name R-Tree or its cousin the geohash as the spatial indexing strategy explicitly. This is a senior-level signal interviewers look for.
Other index types worth knowing
Beyond the big four, a few specialized index types come up often enough that you should at least recognize them:
GIN (Generalized Inverted Index). Designed for composite values like arrays, JSON documents, and full-text search. When a single row contains many indexable elements (e.g., an array of tags, or words in a document), GIN indexes each element separately. PostgreSQL uses GIN for full-text search and JSONB indexing.
GiST (Generalized Search Tree). A framework for building custom tree-based indexes. R-Tree is actually implemented via GiST in PostgreSQL. Also used for full-text search, geometric data, and network address (inet) indexing.
Covering indexes (also called "index-only scans"). A regular index that happens to include all columns a query needs. The database can answer the query entirely from the index without fetching the actual row. Example: if you index (user_id, created_at, status) and run SELECT status FROM orders WHERE user_id = 42 ORDER BY created_at, the query plan can be served from the index alone — a huge performance win.
Partial indexes. An index that only indexes rows matching a WHERE clause. Example: CREATE INDEX idx_active_users ON users (email) WHERE status = 'active'. The index is smaller, faster, and targets a specific query pattern.
Unique indexes. An index that enforces uniqueness on the indexed column. Every primary key has one. Essentially a B-Tree with a uniqueness constraint attached.
Clustered vs non-clustered indexes
This distinction trips up a lot of engineers, but it's central to how your database physically stores data.
Clustered index: Determines the physical order of rows on disk. There can be only one clustered index per table, because the table can only be sorted one way physically. In most SQL databases, the primary key is the clustered index by default. Rows are stored in key order, so range queries on the clustered key are extremely fast (adjacent rows live on adjacent disk pages).
Non-clustered index: A separate structure that points to rows. Doesn't affect the physical order of the table. You can have many non-clustered indexes per table. When you query via a non-clustered index, the database finds the pointer in the index, then fetches the row from the table (one extra hop).
In practice: MySQL's InnoDB engine uses the primary key as a clustered index by default. PostgreSQL's storage model is slightly different — tables aren't physically sorted by any index, though you can CLUSTER a table on an index to force sort order. SQL Server supports both clustered and non-clustered indexes explicitly.
When this matters: If you have a range-query-heavy workload (e.g., "fetch orders between these dates"), making that column the clustered index dramatically improves performance because range reads become sequential disk reads.
Composite indexes and the leftmost-prefix rule
This is the index concept that trips up the most engineers — including seniors.
A composite index indexes multiple columns together as a sorted tuple. Example: CREATE INDEX idx_orders_on_user_date ON orders (user_id, created_at). The index is sorted by user_id first, then by created_at within each user_id.
The critical property: a composite index can only accelerate queries that filter on a leftmost prefix of the indexed columns.
Given the index on (user_id, created_at):
WHERE user_id = 42→ uses the index (matches first column)WHERE user_id = 42 AND created_at > '2026-01-01'→ uses the index (matches both, in order)WHERE created_at > '2026-01-01'→ does NOT use the index (skips the first column)WHERE user_id = 42 AND status = 'shipped'→ partially uses the index (first column matches,statusisn't in it)
This is called the leftmost-prefix rule. The order of columns in the index definition matters enormously. If you'll query user_id alone AND (user_id, created_at) together, put user_id first. If you'll query created_at alone, create a separate index on created_at.
Getting this wrong is one of the most common sources of mysteriously slow queries. An engineer creates CREATE INDEX ON orders (created_at, user_id) and then wonders why WHERE user_id = 42 is slow — the index exists but can't be used for that query because user_id isn't the leftmost column.
When to add an index — and when not to
Indexes are not free. Every index you add:
- Consumes disk space (roughly the size of the indexed column plus pointers).
- Slows down writes. Every INSERT, UPDATE, and DELETE must update every relevant index on the table.
- Consumes memory. Hot portions of indexes live in RAM for fast access, competing with other memory needs.
- Complicates the query planner's job. Too many indexes can actually degrade performance as the planner wastes time considering more options.
The decision framework:
Add an index when:
- The column is frequently used in WHERE clauses
- The column is used for JOINs
- The column is used for ORDER BY or GROUP BY
- The query's current performance is unacceptable AND the EXPLAIN plan shows a sequential scan
- Measured read load significantly exceeds write load
Don't add an index when:
- The table is small (under ~1,000 rows) — the planner will often pick a sequential scan anyway
- The table has very high write churn and queries are rare
- The column already has an index that covers the query pattern
- You're adding it "just in case" without a measured need
Senior engineer framing: every index is a trade-off between read speed and write cost. Measure both before deciding. Look at your query logs and your write patterns. Don't index preemptively; index reactively based on observed slow queries and EXPLAIN output.
How indexes affect writes: the trade-off nobody talks about
The under-discussed cost of indexes is write amplification. If your orders table has:
- A primary key (usually B-Tree clustered, 1 index)
- An index on
user_id(for lookups) - An index on
created_at(for date range queries) - An index on
status(for filtering) - A composite index on
(user_id, status)(for compound filtering) - A unique index on
order_number
That's six indexes. Every single INSERT into the orders table triggers six separate index updates. Every UPDATE that modifies an indexed column updates that column's index (potentially multiple, if the update affects multiple indexed columns). Every DELETE removes entries from all six.
For write-heavy workloads, this is devastating. A table with 10 indexes can have 10x the write cost of the same table with just the primary key.
This is why high-write systems like event logs, telemetry ingestion, and write-ahead logs tend to have minimal indexing — and why read-heavy vs write-heavy workload patterns fundamentally change indexing strategy. Read-heavy systems get aggressive indexing; write-heavy systems stay lean.
Also worth noting: some databases handle write amplification better than others. Log-structured storage engines (LevelDB, RocksDB, the LSM-based backends in Cassandra and ScyllaDB) use fundamentally different index designs that prioritize write throughput over read latency — a trade-off different from the B-Tree-based approach in traditional SQL databases.
Common interview questions and follow-ups
When indexing comes up in system design or database-focused interviews, these are the follow-ups to expect:
- "What's the difference between a clustered and non-clustered index?" (Clustered = physical row order. Non-clustered = separate structure with pointers.)
- "Why not just index every column?" (Write amplification, storage, and planner confusion.)
- "Why is
WHERE UPPER(name) = 'ALICE'not using the index on name?" (Because applying a function to the column invalidates the index — unless you create a functional index onUPPER(name).) - "What's a covering index?" (An index that includes all columns a query needs, allowing index-only scans.)
- "How do you diagnose a slow query?" (Use
EXPLAIN/EXPLAIN ANALYZEto see the query plan. Look for sequential scans, missing indexes, or wrong index usage.) - "How does indexing change in a distributed/sharded database?" (Secondary indexes become expensive across shards. Most distributed databases encourage denormalization or use sharded secondary indexes with performance caveats. See the database sharding guide for the deep dive.)
- "How do indexes work in NoSQL databases?" (Varies widely — key-value stores typically index only the partition key; document stores index per-field; column-family stores like Cassandra have restricted secondary index capabilities. See NoSQL databases for system design.)
If you can handle five of these without hesitating, you've signaled solid database competence.
Putting it all together
The one-sentence version: B-Tree is the default and you should use it unless you have a specific reason; Hash is for pure equality lookups; Bitmap is for low-cardinality analytics; R-Tree is for spatial data. Indexes speed up reads and slow down writes — always a trade-off, never free.
The way seniors talk about indexing in interviews sounds different from the way juniors do. Juniors say "add an index to the column." Seniors say "the query filters on user_id and status together, so I'd add a composite index on (user_id, status) with user_id first because equality filters on user_id are more selective, and I'd accept the write-amplification cost because this table is read-dominant at a 100:1 ratio." That sentence structure — naming the query pattern, naming the index choice, naming the trade-off — is what the interviewer is listening for.
Good luck with your next interview.
Keep learning
Related posts for the database topics this guide touched:
- PostgreSQL vs MongoDB vs DynamoDB — how different databases implement indexes differently.
- Database Sharding Guide — indexing across sharded systems (a different problem).
- Scaling SQL Databases — indexing is central to SQL scaling strategy.
- NoSQL Databases for System Design — how NoSQL databases handle indexing differently.
- ACID and Database Transactions — how index updates participate in transactions.
- Read-Heavy vs Write-Heavy Workloads — the workload pattern that determines your indexing strategy.
- Caching for System Design Interviews — the index-vs-cache trade-off.
- Redis Guide for System Design — the in-memory key-value store that's fundamentally hash-indexed.
For the full system design interview roadmap, start with my complete system design interview guide.
What our users say
AHMET HANIF
Whoever put this together, you folks are life savers. Thank you :)
KAUSHIK JONNADULA
Thanks for a great resource! You guys are a lifesaver. I struggled a lot in design interviews, and Grokking System Design gave me an organized process to handle a design problem. Please keep adding more questions.
Roger Cruz
The world gets better inch by inch when you help someone else. If you haven't tried Grokking The Coding Interview, check it out, it's a great resource!
Access to 50+ courses
New content added monthly
Certificate of completion
$29.08
/month
Billed Annually
Recommended Course

Grokking the Advanced System Design Interview
39,474+ students
4.1
Grokking the System Design Interview. This course covers the most important system design questions for building distributed and scalable systems.
View CourseRead More
4 Basic Pillars of System Design – Scalability, Availability, Reliability, Performance
Arslan Ahmad
Unveiling the Secrets of Apache ZooKeeper's Architecture
Arslan Ahmad
RabbitMQ vs Kafka vs ActiveMQ: A Complete Guide to Choosing the Right Message Broker
Arslan Ahmad
System Design Mastery: Your Roadmap to Acing Interviews
Arslan Ahmad