How do you run safe schema migrations on hot paths?

Safe schema migrations on hot paths means changing database structure without breaking the most latency sensitive requests like login, checkout, feed load, or search. You keep the system available, keep reads and writes correct, and keep tail latency stable while the shape of the data evolves. The core idea is simple. make only additive changes first, run careful backfills, do dual reads or dual writes behind a flag, then cut over and finally remove the old shape. Done right, users never notice and on call sleeps well.

Why It Matters

Hot paths are the money paths where a few milliseconds can change conversion and engagement. A careless ALTER can lock a table, trigger a rewrite, or spike replication lag which pushes p99 through the roof. Rolling deploys across many services mean readers and writers with mixed versions will hit the database at the same time. That demands forward and backward compatibility for a period of time. Interviewers want to see that you can design a migration plan that preserves availability, consistency, and performance across a distributed system and a scalable architecture.

How It Works (step by step)

1) Map the hot path and the data contracts Identify the exact queries, indexes, and cache keys used by the hot path. List each service that reads or writes the affected tables. Capture the current contract. column types, nullability, defaults, and derived values. Pin down which changes are destructive. drops, renames, type changes, and unique constraints.

2) Choose the grow and shrink pattern Adopt an expand stage then a migrate stage then a contract stage. In the expand stage you only add. new table, new column, new index, new view, new feature flag. The contract stage removes old columns or tables only after cutover is complete.

3) Expand safely with additive changes Add the new column or table as nullable or with no default so the database avoids a full table rewrite. Create needed indexes with online or concurrent options where your engine supports it. For Postgres use CREATE INDEX CONCURRENTLY for large tables. For MySQL family use online DDL options that avoid a blocking lock. If you need a new not null column, first create it as nullable, backfill values, validate, then add the not null constraint later.

4) Introduce dual write behind a flag Update the write path so each request writes both the old and the new shape. Guard with a feature flag that can be rolled back instantly. Make the write logic idempotent. repeated writes must not corrupt data. Log mismatches and increment a metric when the two shapes diverge.

5) Run a backfill safely Choose batch or streaming.

Batch. process in small chunks ordered by primary key with a low throttle and a sleep between batches. Keep transactions short to avoid long lock queues.
Streaming. subscribe to change data capture and transform on the fly. This keeps the new shape close to real time. Monitor rows processed per second, replication lag, deadlock count, and cache hit rate. Backfills are long running jobs, so build resume and progress checkpoints.

6) Dual read with canary and parity checks Let the read path prefer the old shape while you compute the new one in parallel for a small percentage of requests. Compare results and publish a parity metric. Only when parity is stable do you raise the canary percentage.

7) Cut over gradually Flip reads to the new shape for one service and one region. Watch p95 and p99, error rate, and cache hit rate. If healthy, ramp to more services or a larger percentage. Keep dual writes on for a while to allow fast rollback.

8) Contract and clean up Turn off dual write, remove read fallbacks, drop old indexes and old columns. Drop in a separate deploy from application changes to isolate risk. Update documentation and data lineage.

9) Safeguards and guardrails

Never rename or drop in a single step on a hot path
Add constraints after data is clean and validated
Prefer application defaults to database defaults during backfill windows
Seal with an automated runbook. flags to flip, metrics to watch, and exact revert steps

Real World Example

Imagine a checkout service at an ecommerce giant that must add a new payment attribute to the Orders table and use it on the critical read of the order summary. The plan is simple. In the expand stage add a nullable column payment_context and a secondary index created with a non blocking method. Writers start dual write behind a flag. A batch backfill reads historical orders in primary key ranges and computes payment_context. The read path calculates the context in memory for a canary slice and compares with the value sitting in the new column. After parity holds for one day, the team flips reads to the new column for one region, watches p99 and error budgets, then ramps globally. After a week, dual writes are disabled and the legacy code path is removed. The old derived logic is deleted in a final cleanup deploy.

The same pattern fits Instagram feed ranking as well. a new ranking input field is added and backfilled while the request handler continues to read the old input. Once parity and latency look good in shadow mode the handler reads the new field by default and the old field is removed later.

Common Pitfalls

Adding a not null column with a default on a large table On some engines this rewrites every row and can lock for a long time. Add nullable first, backfill, then add the constraint.
Creating a non concurrent index on a large table That can block writes or reads and will blow up p99. Use a concurrent or online option and test on a copy with realistic data volume.
Dropping or renaming too early Readers and writers roll out at different times. Keep the old shape until all services are upgraded and the cutover is stable.
Backfill that floods replicas or caches Aggressive batch size can create replication lag or evict hot keys from caches. Throttle, sleep between chunks, and watch lag.
Dual write without idempotency Retries or partial failures produce divergent states. Design upserts and use idempotency keys.
Forgetting cache and search If caches or search indexes reflect the old shape, your reads will look inconsistent. Plan cache invalidation and search reindex steps.
Missing rollback plan If the new read path fails under load, you need instant revert. Keep a flag to fall back to the old shape and keep dual writes for a grace period.

Interview Tip

A favorite prompt is. You need to move from a text user id to a numeric user id in a table that sits in the login hot path. Describe a plan with zero downtime. Start with grow and shrink. add the numeric column as nullable, build an index concurrently, enable dual write, backfill in batches with idempotent jobs, dual read with parity metrics, canary flip reads by small percent, monitor p99 and error rate, then remove the old column later. Call out database and cache details, and finish with a clear rollback plan.

Key Takeaways

Make additive changes first, destructive changes last
Use dual write and dual read to prove parity before a cutover
Backfill with throttle, checkpoints, and strong observability
Build instant rollback with feature flags and keep dual write on during soak
Coordinate caches, search, and downstream consumers as part of the plan

Table of Comparison

Approach	Risk to hot path	Backward compatibility window	Complexity	Best use case
Grow and shrink with dual write and dual read	Very low when done with flags and concurrent index	Long window, works across mixed versions	Medium	Most schema changes on critical paths
Blue green database cluster with sync	Low if sync is reliable and switch is gradual	Short window, but needs full data sync	High	Large version upgrades or engine upgrade
Change data capture with streaming backfill	Low if consumers are idempotent	Long window with near real time updates	High	Wide table splits or materialized view builds
Big bang single step migration	Very high due to locks and rewrites	None	Low to implement, high blast radius	Small tables outside any hot path
Compatibility layer using views or read mappers	Low	Medium to long depending on cleanup speed	Medium	Renames and type changes that must hide behind an abstraction

FAQs

Q1. What is a hot path in a database migration?

A hot path is a user facing request path that must stay fast and correct at all times like login, checkout, feed load, or search. Any lock or rewrite here hurts p99 and availability during a system design interview you must prioritize these paths.

Q2. How do I rename a column with zero downtime?

Do not rename directly. Add a new column, dual write to both columns, backfill, dual read with parity checks, cut over reads to the new column, then drop the old column in a final cleanup.

Q3. How do I add a not null column safely?

Create it as nullable first, backfill in batches, validate that all rows are populated, then add the not null constraint in a separate step.

Q4. What metrics should I watch during cutover?

Track p95 and p99 latency, error rate, replication lag, parity rate between old and new reads, and cache hit rate. Set alerts tied to your error budget.

Q5. Should I use batch backfill or streaming change data capture?

Batch is simple and fine for medium tables with flexible timelines. Streaming CDC keeps the new shape near real time and suits very large tables or when you want continuous freshness.

Q6. How do I roll back if the new read path misbehaves?

Keep a feature flag that reverts reads to the old shape and continue dual writes so no data is lost. Investigate with logs and parity metrics, then try again.

Further Learning

Build the muscle for safe changes in our complete playbook. Master safe rollouts in Grokking the System Design Interview
Go deep on scale patterns that support zero downtime migrations. Advance your skills with Grokking Scalable Systems for Interviews