How would you run online reindexing without downtime?

Online reindexing means building or rebuilding database or search indexes while the application continues to serve users. The goal is simple yet nontrivial. You want faster queries and better resource usage without any user visible pause. This is a classic system design interview topic because the plan touches storage internals, replication, backfill strategy, write amplification, and safe cutover. In this guide you will learn a clean step by step playbook that works across relational stores, search engines, and key value systems.

Why It Matters

Indexes speed up reads, enforce constraints, and shape query plans. As data grows or access patterns change, an index that was perfect last quarter can become a bottleneck. Reindexing restores performance, but naive rebuilds can lock tables or starve the cluster. In a production scale architecture you must avoid brownouts, replication stalls, cache storms, and unexpected plan flips. Interviewers look for your ability to reason about throughput, isolation, and rollback while keeping the system available.

How It Works step by step

Below is a vendor neutral runbook that maps well to engines like Postgres with concurrent index build, MySQL with online DDL, SQL Server with online option, Elasticsearch with alias swap, and NoSQL platforms that support background index population.

Baseline and guardrails
- Capture current query latency, error rate, lock waits, replication lag, and storage headroom.
- Set resource limits for the backfill. Throttle reads, cap concurrency, and set IO priorities to avoid starving user traffic.
Create a shadow index
- Use a background or concurrent build if the engine supports it. The system scans existing rows and populates the new index.
- If the engine lacks native support, create the target index in a separate replica or shadow cluster. You will catch up with a change stream.
Backfill existing data
- Run a range based backfill to avoid hot spots. For relational tables use primary key ranges or time windows. For search engines use a scroll like read or partitioned export.
- Batch size should match your storage and cache behavior. Too large overlaps with user traffic and increases tail latency. Too small stretches the cutover window.
Stream live changes
- While the backfill runs, new writes keep arriving. Keep the shadow index in sync using CDC from the write ahead log, built in triggers, or dual writes at the application layer.
- Apply retries with idempotency so each change is processed once even if the stream reconnects.
Validate the shadow index
- Run parity checks. Sample keys and compare row counts per range. For search engines compare top N results for popular queries.
- Use a query mirror to execute read only requests against both paths and diff results and latency. Alert on divergence.
Warm the read path
- Canary a small percentage of read traffic to the new index. Watch p50 and p99 latency, cache hit rate, and CPU.
- If the canary is healthy, raise the fraction progressively until most reads prefer the new index.
Cut over safely
- For relational engines, the planner usually auto selects the best index. If the change replaces an existing index, do an atomic rename or constraint swap if supported.
- For search engines, use an alias swap to point traffic to the new index in one atomic step.
Rollback plan
- If metrics degrade, switch the alias back or drop the new index and revert to the previous plan. Rollback must be a single command or a fast toggle.
Clean up and observe
- Remove old indexes, free disk, and reset throttles.
- Keep a watch on replication lag and storage compaction for a full day. Large backfills can defer compaction and surprise you later.

Real World Example

A video feed service notices slow lookups on the feed items table as the dataset crosses hundreds of millions of rows. The team needs a composite index on user id and created at. The plan is:

Build a concurrent composite index in the primary without blocking user queries.
Throttle scan rate and sleep between batches to keep p99 latency stable.
While the build runs, a CDC pipeline tails the write ahead log to apply late arriving changes into the new index structure.
Run a read mirror where five percent of feed reads are executed through both paths and compared.
Once parity is proven, allow the planner to adopt the new index and drop the legacy index.
In parallel, a search index for the same feed is rebuilt in a new collection and traffic is switched using an alias. The alias swap is instant and reversible which gives a safe rollback.

This pattern improves read latency without user downtime and lets the team roll back with one command if needed.

Common Pitfalls or Trade offs

Metadata locks and brief stalls Even with online options, some engines take short metadata locks. Schedule the start at off peak and keep transactions short to reduce risk.
Replication lag Heavy backfill writes can saturate the binlog or logical stream. Add headroom to replica IO, and slow the backfill if lag grows.
Cold cache after cutover A new index has a cold buffer cache. Use canaries and ramp up slowly so caches warm without a surge of page faults.
Plan regressions A new index can change join order or cardinality estimates and make some queries slower. Keep an allow list of queries to inspect and add targeted hints only if the planner gets confused.
Storage pressure During the transition you temporarily hold two indexes. Confirm you have enough disk for the worst case footprint.
Dual write complexity Application level dual writes add failure modes. Prefer CDC or engine maintained background builds when possible.

Interview Tip

If asked how to reindex online, present a crisp plan with observability and rollback. Mention change streams for live catchup, parity checks before cutover, and an alias swap or atomic rename for instant rollback. A strong follow up is to explain how you throttle backfill to keep user latency flat and how you watch replication lag to avoid a failover surprise.

Key Takeaways

Online reindexing is a controlled background build with a safe and instant cutover.
Use range based backfill, CDC for live changes, and parity checks before enabling reads.
Always have a one step rollback using alias swap or index rename.
Throttle and monitor to keep latency and replication lag within budget.
Clean up promptly to reclaim disk and avoid long compaction debt.

Table of Comparison

Below is a quick comparison of common strategies for zero downtime indexing.

Table of Comparison

Strategy	Downtime Risk	Data Drift Risk	Operational Effort	Best Fit
Concurrent build in primary	Low	Low	Low to Medium	Relational engines with online index support
Reindex in read replica then promote	Low	Medium	Medium	Systems without true online DDL
Blue-green cluster with CDC sync	Very Low	Low (if CDC robust)	High	Large estates with strict SLO and change windows
Search index rebuild with alias swap	Very Low	Low	Medium	Elasticsearch or similar engines with aliases
Partition-by-partition rebuild	Low	Low	Medium	Very large tables with time-based partitions

FAQs

Q1. What is the safest way to reindex a huge table without downtime?

Create the new index in the primary using a concurrent or online option, throttle the backfill, stream live changes with CDC, validate with a query mirror, then cut over with an atomic rename or let the planner adopt it.

Q2. How do I throttle a backfill so user latency stays flat?

Process small batches, add sleeps between batches, and cap parallel workers. Watch p95 and p99 read latency and replication lag. Lower the batch size if either metric drifts.

Q3. What if my engine does not support online index build?

Use a read replica or a blue green cluster. Build the index offline on the target while you stream changes. Then promote or swap traffic in a single step.

Q4. How do I verify correctness before cutover?

Use parity checks on sampled keys, row count by range, and a read mirror that compares results and latency between old and new paths. Only proceed when divergence is zero or within a tiny known bound.

Q5. What if the new index changes the query plan and some reports get slower?

Canary the read path first, measure query specific latency, and roll back quickly if needed. Tune the index definition or use targeted hints for the few queries that regress.

Q6. How do I handle storage pressure during the transition?

Project the peak footprint with both indexes present. If space is tight, rebuild partition by partition, or use a replica or blue green approach that keeps the extra footprint off the primary.

Further Learning

Build confidence with a structured approach to capacity, data models, and safe rollouts in the Grokking System Design Fundamentals course by DesignGurus. It covers indexing, query planning, and migration basics with clear walk throughs: Grokking System Design Fundamentals.

If you want a full runbook for safe changes in large scale services, including CDC patterns, traffic ramps, and rollback design, enroll in Grokking Scalable Systems for Interviews which includes hands on modules on backfill, reindexing, and zero downtime cutovers: Grokking Scalable Systems for Interviews.

For interview practice that ties it all together, explore real prompts and model answers in Grokking the System Design Interview with practical case studies that feature indexing and search scale decisions: Grokking the System Design Interview.