How do you build a global secondary index across shards?

A global secondary index across shards lets you query by a non primary field at scale without scanning every shard. You keep a dedicated index that maps a secondary attribute to the primary records and you keep it in sync as data moves and changes. The trick is to design the write path and the query path so they stay fast and correct under growth.

Introduction

A global secondary index is a data structure that supports fast lookups by a field that is not the primary key across the entire cluster. In a sharded system your base data is partitioned by a primary key. If you want to query by email status tag username or product category you need an index that spans every shard. That is the global index. It is often stored in its own partitions and maintained by the write path or by a change data capture pipeline.

Why It Matters

This pattern shows up in search by username in social apps product browse in retail systems tag based catalogs and any workload where the read path filters by something other than the primary key. In interviews this concept tests how you think about consistency write amplification backfills resharding and failure recovery in distributed systems. In production it decides whether your feature feels instant or slow and flaky.

How It Works step by step

Define the index key and value

Pick the secondary attribute you will index and the format of the postings. A common record is:
- Index key equals secondary attribute
- Index value equals a small pointer that includes primary key shard id logical version and maybe a small ranking score.
- Decide if the index key is single field or composite like status plus tenant plus time bucket.
Choose where the index lives
- Option A keep the index in a dedicated index cluster. This lets you scale it independently.
- Option B colocate index partitions with base data. This reduces cross node hops but can be harder to scale. Most teams pick a separate index store.
Partition the index

Hash or range partition by the index key. Hash gives even spread. Range helps with prefix and ordering queries. For keys that are skewed like very popular tags add bucketing by a small hash prefix so one hot key becomes many buckets.
Normalize and compress

Lowercase user facing strings trim spaces and store a normalized key. Keep index rows small. Use delta or block compression for large postings lists. Avoid storing full documents in the index.
Plan the write path

You have two classic choices.

Synchronous dual write to base and index using two phase commit or a lightweight per key transaction. This gives strong consistency but adds write latency and failure coupling.
Asynchronous write using change data capture. The base write emits an event to a log. One or more indexer workers consume the log and update the index. This gives low latency for writes to the base but the index becomes eventually consistent. This is the common choice.

Make updates idempotent

Include a monotonic version or logical timestamp with each change. Indexers upsert only if the incoming version is newer than what is stored. For deletes write a tombstone with a version. This handles retries and at least once delivery.
Build the query path

A read by secondary attribute first hits the index shard to get a set of primary keys plus shard hints. Then it batch fetches the base records from the right shards using scatter gather with bounded concurrency. Use paging tokens to stream large result sets. Consider caching popular index hits.
Handle resharding and growth

The index and the base can scale independently. Keep a small directory map that tells clients where index partitions are. When you add nodes move index partitions by range or via consistent hashing. For the base shards do not forget the index already stores shard hints and must be updated when records migrate. Keep a background job that rewrites affected index entries when you change the base shard map.
Backfill and bootstrap

To create a new index on existing data take a snapshot scan it in parallel and write index entries in bulk. Then switch to streaming the change log from the snapshot position until you catch up. Validate with counts and random spot checks.
Deal with failure modes

Expect duplicates out of order events and partial outages. Use at least once consumption with dedupe by key and version. Monitor end to end lag insert rate update rate and error counts. Keep a repair service that can rescan a primary shard and reconcile the index on demand.
Design for consistency levels

If your feature needs read your own writes route those reads to an index replica that has applied the local change or perform a fallback check against the base for the caller record. If you can tolerate staleness you get simpler and faster writes with the asynchronous pipeline.
Think about cost

Indexing increases storage and write amplification. For very large postings lists consider a top N section plus a pointer to overflow storage. Evict rarely used entries with a retention policy when domain rules allow.

Real World Example

Imagine a social app that supports search by username. Base user profiles are sharded by user id. The product team wants type ahead search by name. A global index maps normalized name prefix to user id plus shard id. The write path normalizes the name creates keys for popular prefixes like ali ali a and alice and publishes one change event per affected key. An indexer service consumes the events and upserts entries in the index cluster. The read path for search autocompletion looks up the prefix in the index partitions returns the top candidate ids then batch fetches the small profile cards from the correct base shards. Stale entries are rare and are fixed lazily when a fetch misses the base record.

Common Pitfalls or Trade offs

Hot keys for very common values like unknown or default category. Add bucketing by hash prefix or store only a pointer to a time sliced child list.
Giant postings lists that do not fit in memory. Use segmented lists with compression and paging.
Index drift from eventual consistency. Add a background verifier and read time repair for misses.
Coupled failures with synchronous dual writes. A blip in the index tier starts to fail base writes. Prefer change data capture unless you need strict guarantees.
Cost creep. Every new secondary field doubles storage for some datasets. Treat new indexes like features with clear owners and metrics.
Painful reshard events. Keep shard hints in the index and a fast rewrite job that updates them when the base map changes.

Interview Tip

If asked for strong consistency for a critical lookup describe a small scope transaction that touches only one base row and the corresponding index row. For cross shard changes prefer a change log and idempotent upserts then explain a read path that falls back to the base when the index looks stale for the caller. Show that you understand both options and when to pick each.

Key Takeaways

A global secondary index maps a non primary attribute to primary keys across all shards.
Most systems maintain it via change data capture to keep writes fast and decoupled.
Idempotent upserts versioning and tombstones make index updates safe under retries.
Query path is index lookup plus batch fetch from base shards with paging and caching.
Plan for backfill resharding failure repair and cost from day one.

Table of Comparison

Approach	Read Latency	Write Cost	Consistency Default	When to Prefer
Global Secondary Index	Low for targeted lookups	Higher due to index maintenance	Eventual unless using transactions	High-selectivity queries on non-primary fields
Local Index per Shard	Medium (fanout to shards)	Lower than global	Consistent within a shard	Queries that mostly hit one shard
Scatter-Gather Query	High under many shards	Low write cost	Consistent by definition	Rare queries or small cluster sizes
External Search Service	Very low and flexible	High ingestion and sync cost	Eventual	Free-text search and ranking-rich queries

FAQs

Q1. What is a global secondary index in a sharded database?

It is a cluster wide structure that lets you look up records by a field that is not the primary key. It spans all shards and returns pointers to the right base rows.

Q2. Do I need distributed transactions to keep it correct?

Not always. Two phase commit gives strong guarantees but most teams use change data capture with idempotent upserts and accept eventual consistency.

Q3. How do I avoid hot keys?

Add a short hash prefix or another bucketing dimension so that one popular value fans out to many index partitions. Also cap list sizes and use paging.

Q4. How is the index rebuilt after a crash or during a new deployment?

Run a backfill from a consistent snapshot then replay the change log until caught up. Verify with counts and random reads and keep a repair job for drift.

Q5. What happens when I reshard the base data?

You migrate base rows and then rewrite shard hints inside index entries. A small directory map lets clients find current index partitions during the move.

Q6. Can I get read your own writes semantics?

Yes. Route the read to an index replica that applied the local change or check the base store as a fallback for the caller record.

Further Learning

Ready to master patterns like global indexes change data capture and resharding that show up in real interviews Study the full playbook in Grokking the System Design Interview with hands on case studies and review notes

If you want a deeper treatment of partitioning consistency and high scale indexing learn from Grokking Scalable Systems for Interviews which focuses on practical architecture decisions for large traffic systems