01Why Sharding Is Unavoidable, and Why It's Brutal
At some point, your data exceeds what a single machine can hold. Or your write throughput exceeds what a single primary can sustain. Or your read load exceeds what replicas can absorb. When that happens, you shard: split the data across multiple machines, each holding a piece.
This is unavoidable at scale. It is also brutal. Sharding solves the resource ceiling but introduces a long list of new problems: hot keys, cross-shard queries, resharding pain, transactional limits, operational complexity. The reason sharding shows up in every senior interview is that the hard parts are subtle, and candidates who haven't operated a sharded system stumble in predictable ways when probed.
This page covers the decisions you'll make and the failure modes you'll encounter. The shard key is the center of gravity. Most other sharding decisions are downstream of choosing it well. Get the shard key right and the rest is tractable. Get it wrong and you'll spend years recovering.
The Senior Move
The senior signal in sharding interviews is not knowing every strategy. It is knowing when not to shard. A single well-tuned Postgres instance with read replicas can handle workloads that the candidate's intuition says require sharding. Defaulting to "we'd shard" before the requirements demand it is a junior pattern. Naming the threshold at which you'd shard is the senior pattern.
02What Sharding Actually Does (and Doesn't)
Sharding does three things, and only three things:
- Distributes storage. Each shard holds a fraction of the total data. If your dataset exceeds a single machine's disk, sharding is the resolution.
- Distributes write load. Each shard handles writes for its portion of the data. If a single primary can't keep up with write throughput, sharding spreads it.
- Enables horizontal scaling. When you need more capacity, you add more shards. The resource ceiling moves from "scale up the machine" to "add more machines."
Notice what's not on this list. Sharding does not, by itself:
- Improve read latency. A single read still goes to one shard, with the same disk and CPU costs. If anything, sharding adds the small overhead of routing the request.
- Fix availability. A shard that goes down still loses its share of the data. Replication fixes availability; sharding does not.
- Solve data-skew problems. If your data is naturally skewed (some users have 1000x more activity than others), sharding may concentrate the skew rather than spread it. The hot-key problem we cover in Section 5 is exactly this.
- Replace good database choice. Sharding a Postgres instance gives you sharded Postgres with all of Postgres's per-shard limits. If the underlying database is wrong for the workload, sharding doesn't fix it.
The interview move when sharding comes up is to name what it does and what it doesn't. "Sharding solves storage and write throughput. We still need replication for availability and we still need a good shard key to avoid concentration." That sentence positions you as someone who has thought about the boundaries of the technique.
04The Four Sharding Strategies
Once you have a shard key, you need a strategy to map shard key values to specific shards. Four common strategies, each with tradeoffs.
Four Ways to Map Keys to Shards
The four sharding strategies. Hash and range are the most common; geographic appears when data sovereignty matters; directory is the most flexible but has operational tradeoffs.
Hash-Based
Default for most systems
Apply a hash function to the shard key, take the modulo of the shard count, and that's the destination shard. shard = hash(user_id) % N.
Most modern distributed databases default to this. Cassandra, DynamoDB, and Redis Cluster all use hash-based sharding by default. The hash distributes keys evenly regardless of input distribution.
TradeoffMaximum even distribution. Range queries on the original key become impossible (the hash randomizes the order). Resharding is painful with naive hash-mod-N because adding a shard remaps almost everything; consistent hashing addresses this.Range-Based
When range queries matter
Each shard holds a contiguous range of shard key values. Shard A holds user_id 0–999999, shard B holds user_id 1000000–1999999, and so on.
Common in HBase, BigTable, and any system where you need to do range scans efficiently. A query for "users created last week" can hit just one or two shards if the shard key is timestamp-ordered.
TradeoffRange queries are fast. Distribution is sensitive to skew: if your shard key is timestamp and most activity is recent, the most recent shard becomes a hot spot. Range-based sharding requires care to keep shards balanced.Geographic
When locality or sovereignty matters
Shard by geographic region. EU users live on the EU shard, US users on the US shard. Each shard is in its corresponding region for low-latency access and compliance with data residency requirements.
Common in products subject to GDPR or other regional data laws. Also common when latency to local users matters more than global scale (gaming, real-time collaboration).
TradeoffLatency and compliance benefits are real. Distribution is almost always uneven (one region usually dominates traffic). Cross-region operations are expensive. Often combined with hash-based within each region.Directory-Based
When you need maximum flexibility
A separate lookup service holds the mapping from shard key value to shard. Queries first hit the directory to find the right shard, then route to it.
Used when shard placement needs to be flexible: tenant migrations between shards, custom placement rules, or per-key shard affinity. The directory itself becomes a system component to operate.
TradeoffMaximum flexibility. The directory is a single point of failure that needs its own replication and caching. Adds an extra network hop on every operation. Operational overhead is real.Consistent hashing: the resharding-aware variant
The naive hash-mod-N strategy has a brutal property: when you change N (add or remove a shard), almost every key gets remapped. If you go from 4 shards to 5, roughly 80% of your data needs to move. This makes resharding effectively impossible at scale.
Consistent hashing solves this. Instead of hash % N, you place shards on a hash ring (positions 0 to 2^32). Each key hashes to a position on the ring and lives on the next shard clockwise. When you add a shard, only the keys whose ring position falls between the new shard and its predecessor need to move. Roughly 1/N of the data, not (N-1)/N.
Consistent hashing shows up in almost every distributed system at scale: Cassandra, DynamoDB, Redis Cluster, content delivery networks, distributed caches. If sharding is the topic and you don't mention consistent hashing during the deep-dive, you've left a depth signal on the table.
Virtual nodes (the practical refinement)
Pure consistent hashing has a problem: with a small number of physical shards, the ring positions can produce uneven distribution by chance. Virtual nodes fix this. Each physical shard owns hundreds or thousands of virtual nodes scattered around the ring. The averaging effect produces near-uniform distribution and lets you weight shards differently (a stronger machine can own more virtual nodes).
05The Hot-Key Problem
Sharding distributes data across shards. It does not distribute access patterns. If 10% of your users generate 90% of your traffic, sharding by user_id gives you a few extremely busy shards and many idle ones. This is the hot-key problem and it shows up in nearly every senior interview that touches sharding.
Hot-Key Problem · Why Even Sharding Doesn't Save You
Even with perfectly balanced storage, traffic skew can concentrate load on a single shard. The hot shard becomes the system's bottleneck. The other shards' spare capacity doesn't help because requests can't be redirected without the data.
The classic case: celebrity users
The canonical example is Twitter. Most users have hundreds of followers. Some celebrities have tens of millions. If you shard tweets by author, the celebrity's shard receives traffic for every read of any of their tweets, plus every write fan-out, plus every follower interaction. The celebrity shard burns; the other shards idle.
Variants of the same problem appear everywhere. The big tenant in a multi-tenant SaaS. The trending product in an e-commerce catalog. The popular video on a streaming platform. The viral post on any social product. Whenever access patterns are power-law distributed (which they almost always are), naive sharding concentrates the load.
Three ways to handle hot keys
Strategy 01
Caching the hot keys
The simplest mitigation. The hottest keys are also the most cache-friendly: high read rate, predictable working set, slow change. Put a cache in front of the hot shard. The cache absorbs most reads; the shard handles the rest. The caching deep-dive covers cache placement strategies. For hot-key cases, in-process plus distributed cache is usually the right combination.
Strategy 02
Splitting the hot key
Treat the hot value as multiple logical entries instead of one. For a celebrity user, partition their data across N sub-shards by adding a suffix or hash to the key (celebrity_id_0, celebrity_id_1, etc). Reads scatter-gather across the sub-shards; the load distributes. The cost is application complexity: every read needs to know it's a hot key and assemble across sub-shards.
Strategy 03
Asymmetric handling at the application layer
Treat hot keys differently in the application logic. Twitter's hybrid fan-out is the classic example: normal users get push-based timeline assembly (write-time work); celebrities skip the push and get pull-based assembly (read-time work). The shard-level concentration is bypassed by changing the access pattern itself. The Twitter walkthrough on the framework page shows this in detail.
The depth probe
"What about the celebrity user case?" is the most common follow-up after any feed-shaped sharding discussion. The strong response acknowledges the problem, picks one of the three strategies, defends the choice, and notes the operational implications. The weak response either ignores the problem or hedges across strategies without committing.
Sharding distributes data. It does not distribute access patterns. The interview move is to name this gap and state how you'd handle it.
06Cross-Shard Queries: The Pain You Didn't Ask For
Sharding is wonderful when every query filters by the shard key. The query goes to one shard, runs locally, returns. Latency stays low. The architecture stays clean.
Most production systems aren't this lucky. Some queries don't filter by the shard key. Maybe an admin dashboard counts users by signup date. Maybe a search feature looks for a string across all records. Maybe a report aggregates revenue by region. None of these queries can be served by one shard.
Scatter-gather: the technique and its costs
The standard solution is scatter-gather: send the query to every shard in parallel, collect the partial results, merge them in the application or in a coordinator. This works but it has costs that compound at scale.
- Latency floor. The query is as slow as the slowest shard. With 16 shards, p99 latency on the slowest shard becomes p99 latency on the whole query. The "tail latency amplification" is dramatic.
- Resource cost. Every shard does work for every cross-shard query. If your scatter-gather queries are 10% of total traffic, every shard now does 10% of "their" work plus 10% of every other shard's work.
- Coordinator complexity. Some database systems do this for you (Vitess, Citus, MongoDB's mongos). Others require you to build the coordinator at the application layer. The coordinator itself becomes a system component to operate.
Avoiding cross-shard queries when you can
Three patterns to consider:
- Denormalization. Duplicate data so that every common query has a path to a single shard. Write amplification is the cost; read simplicity is the gain.
- Secondary indexes. Maintain a separate index sharded by a different key. Searches by the secondary key go to the index, then resolve to the primary store. Adds complexity to writes.
- OLAP offloading. Stream changes from your sharded OLTP store to a separate analytics store (BigQuery, Snowflake, ClickHouse). Cross-cutting analytical queries run there, away from the operational shards. Change data capture is the bridge.
The depth probe
"How would you support [admin query that doesn't filter by shard key]?" is a question candidates often handle poorly. The strong response picks one of the patterns above and explains the tradeoff. "I'd stream changes to a BigQuery warehouse and run admin queries there. The trade is some staleness, but admin queries don't need real-time freshness. The OLTP shards stay focused on user-facing traffic."
07Resharding Without Downtime
You picked your shard key well. You chose a good strategy. Distribution is even. And then your traffic doubles, or your dataset outgrows your shards, or you discover a hot key that needs to be split. You need to reshard.
Resharding is one of the most operationally painful procedures in distributed systems. The constraints make it hard:
- The system must stay online during the transition. Customers don't accept "scheduled maintenance."
- Writes are happening continuously. The data being moved is also being changed.
- The new shards have to be consistent with the old shards at some cutover moment. Mistakes cause data loss or duplication.
- The application's understanding of "which shard owns which key" has to change without breaking in-flight requests.
The four-phase resharding pattern
Most resharding procedures, regardless of database, follow some variant of this pattern:
- Provision new shards. Stand up the destination capacity. No data flowing yet. The system runs on the old shards.
- Backfill historical data. Copy existing data from the old shards to the new ones. This can take hours or days for large datasets. The old shards keep serving traffic during this phase.
- Dual-write phase. Writes go to both old and new shards. The two stay in sync as the backfill catches up. Reads still serve from the old shards. This is the riskiest phase: any write that lands on one but not the other is a divergence bug.
- Cutover. Reads switch to the new shards. The old shards become read-only as a safety net for some period, then are decommissioned. The cutover itself should be a flag-flip, not a deploy.
Most of the operational work is in phase 3. Detecting divergence between dual-writes, handling reconciliation, deciding when the new shards are caught up enough to cut over.
Why consistent hashing makes this dramatically easier
If you used consistent hashing from the start, the resharding pattern is much smaller in scope. Adding a new shard only requires moving roughly 1/N of the data (the keys whose ring positions fall in the new shard's range). The rest of the system is undisturbed. Compare with naive hash-mod-N where adding one shard remaps almost every key. The difference is between an operationally sane procedure and an operationally infeasible one.
The Operational Probe
"Walk me through how you'd reshard this system in production." This is a staff-level probe. The strong response describes the four phases, names the dual-write reconciliation as the trickiest part, mentions consistent hashing as a way to reduce scope, and notes that this is one of the procedures where you build feature flags into the routing layer specifically to support a fast rollback if cutover goes wrong.
08When Not to Shard
The senior signal in sharding interviews is knowing when to push back on sharding rather than reaching for it. Many candidates assume sharding is the answer to any scale concern. It usually isn't, until the requirements actually demand it.
Postgres on a single primary handles more than candidates think
A well-tuned Postgres instance on modern hardware (say, 64 cores, 256GB of RAM, NVMe storage) can comfortably handle:
- Tens of TB of data on a single primary
- Tens of thousands of writes per second on hot tables
- Many tens of thousands of reads per second from the primary, plus more from read replicas
Adding read replicas extends read capacity further. Adding query optimization, partitioning (within a single Postgres instance, not sharding across instances), and caching extends the ceiling further still. Many companies that "need to shard" actually need a better Postgres configuration first.
The threshold to actually shard
Reasonable triggers to consider sharding:
- Storage approaches a single machine's limit. Or compute on the primary saturates and read replicas can't absorb the read load.
- Write throughput exceeds what a single primary can sustain. Roughly 30K-50K writes per second on a hot table is where most teams start to feel pain.
- Operational considerations. Backups take days. Maintenance windows are too long. The blast radius of a single primary failure is unacceptable.
Below those thresholds, the cost of sharding (operational complexity, application changes, cross-shard query pain) usually exceeds the benefit. Right-size the database to the realistic workload.
The interview move
When asked "would you shard this?" the strong default is "not yet." Then explain at what point you would shard. "At our current scale of 10K writes per second, Postgres on a single primary with read replicas handles this comfortably. I'd reconsider sharding at roughly 50K writes per second on the same table, or when storage approaches the single-machine limit. Until then, sharding adds operational debt without enough benefit."
Reaching for sharding before you need it is over-engineering. The senior signal is naming the threshold at which sharding becomes the right answer, not assuming it's the answer now.
09How Sharding Interacts With Other Concepts
Sharding doesn't live alone. The interesting depth probes often cross concept boundaries.
- Sharding × Replication. Sharding gives you horizontal scale. Replication gives you availability. You usually need both. The interaction matters: replicate each shard, or shard your replicas, or both. The replication deep-dive covers the choices.
- Sharding × Caching. A hot-key problem in your sharded store often becomes a job for caching. Cache the hot keys and the shard concentration disappears. The caching deep-dive covers placement strategies for this.
- Sharding × Database selection. Some databases shard natively (Cassandra, DynamoDB). Some require manual sharding (Postgres, MySQL). The "how hard is sharding later?" question is one of the inputs to database selection.
- Sharding × Load balancing. The load balancer needs to know which shard owns which key. With consistent hashing, this knowledge can be pushed to the routing layer. With other strategies, the application or a coordinator does the routing.
- Sharding × Search. Cross-shard search queries are notoriously hard. Most teams solve this by streaming data to a separate search index (Elasticsearch, OpenSearch) sharded differently from the primary store.
For more on cross-concept interactions, see the concepts library hub.
10Practice Scenarios
Three scenarios. Read the setup. Decide your sharding approach before opening the reveal. The reveal is one defensible answer, not the only correct answer.
Scenario 01
A SaaS product serves multiple business customers. Some customers are 1000x bigger than others. How do you shard?
Each customer's data is logically isolated (no queries span customers). Customers range from 10 users to 100,000 users. The biggest customer has 1000x the data and traffic of the median customer. New customers sign up regularly and need provisioning.
How to think about this
This is a classic multi-tenant sharding problem. The shard key is clearly tenant_id (queries don't span customers). The challenge is the size skew.
Naive hash-based sharding by tenant_id would put the biggest customer on one shard with no neighbors and the smallest customers crammed together. Cross-tenant queries are not a concern since they don't exist, but per-shard load is wildly uneven.
Strong answer: directory-based sharding. A lookup table maps tenant_id to shard. Big customers get their own dedicated shard. Smaller customers share a shard. The directory is small enough to cache aggressively. New customers get placed based on current shard load. This is exactly the use case where directory-based sharding earns its operational complexity: maximum flexibility in placement, customer-specific provisioning rules, and the ability to migrate a customer to a different shard as they grow.
Scenario 02
A messaging product needs to store every message. Reads are mostly recent messages. Writes are continuous and high-volume. How do you shard?
Roughly 10 billion messages stored. 200K messages written per second. 500K message reads per second. Most reads target messages from the last 24 hours. Older messages are read occasionally for history.
How to think about this
The natural shard key here is conversation_id (messages stay together by conversation, queries usually filter by it). Hash-based sharding by conversation_id gives even distribution.
The interesting twist is the time-based access pattern. Most reads are recent. If we sharded by time (range-based on timestamp), the latest shard would be hot for both writes and reads. That's a problem.
Strong answer: hash-based by conversation_id, with each shard internally using a wide-column store (Cassandra-style) that uses timestamp as the clustering key. This gives you even distribution across conversations (write throughput spreads across shards) and within each conversation, time-ordered storage with efficient recent-message reads. The hot-key risk is large group conversations; we'd handle those with the same hot-key strategies described earlier (caching recent messages, splitting the conversation if it gets enormous).
Scenario 03
A startup's Postgres database is at 80% disk utilization and writes are slowing down. Should they shard?
Single Postgres instance, 4TB used of 5TB allocated. 8K writes per second on the busiest table, p99 write latency has crept from 5ms to 30ms over the last quarter. The team has 8 engineers, none with sharding experience.
How to think about this
The honest senior answer here is: probably don't shard yet. Investigate first.
5TB is approaching but not at the limit of what Postgres handles well on modern hardware. The latency increase is more concerning. Three things to investigate before sharding:
1. Is the slowdown actually about scale? Or is it about a missing index, a slow query, or vacuum behavior? The pattern of "writes slowing as data grows" sometimes points to bloat or fragmentation, not sharding need.
2. Vertical scaling first. Going from 4TB to 16TB on a bigger instance is a configuration change. Going from one shard to four shards is a multi-month project. The cost-benefit is wildly different.
3. Read replicas and partitioning. Postgres native partitioning (within a single instance) handles many "we need sharding" cases by spreading the table internally. Read replicas absorb read load. Both are dramatically simpler than sharding.
Strong answer: "I wouldn't shard yet. The team is small, sharding adds significant operational debt, and 5TB is in the range where Postgres still works well. I'd investigate the latency increase first, scale vertically next, and only consider sharding if those don't address the problem."
11Sharding FAQ
How many shards should I start with?
More than you think. If you commit to sharding, start with at least 4-8 shards even if 2 would handle current load. Adding shards later means resharding, which is operationally painful even with consistent hashing. Starting with more shards costs you slightly more infrastructure today; under-provisioning costs you a multi-month migration later. The exception is if you're using a database that handles resharding transparently (DynamoDB, Spanner), in which case start small and let the database handle the scaling.
Should I shard before launch, or wait until I need it?
Wait. Sharding before you have actual traffic is over-engineering, and the shard key you guess at without real access patterns is usually wrong. The cost of sharding later (even with the operational pain) is usually less than the cost of sharding wrong from day one. Postgres or a managed database service handles plenty of scale without sharding.
What's the difference between sharding and partitioning?
Terminology varies but the common usage: sharding means splitting data across separate database instances. Partitioning means splitting data within a single database instance (a table partitioned by date, for example). Postgres has native partitioning that handles many cases people reach for sharding to solve. Partitioning is dramatically simpler operationally because everything still lives in one database. Mention this distinction in interviews; it's a depth signal.
What if my queries don't all filter by the shard key?
Most production systems have this. The patterns to use: secondary indexes sharded by a different key, denormalization to make all common queries shard-key-aligned, or offloading cross-cutting analytical queries to a separate OLAP store. Pick one or combine them. The mistake is to assume scatter-gather queries are free; they aren't, especially at p99.
How do I handle transactions that span shards?
Cross-shard transactions are expensive and complicated. Three approaches: avoid them by choosing a shard key that keeps related data on the same shard; use a database that supports cross-shard transactions natively (Spanner, CockroachDB, FoundationDB) and accept the latency cost; or relax to eventual consistency and use compensating transactions or sagas. The first option is by far the easiest. The third is the next-best when the first isn't feasible.
What about resharding triggered by a single hot key?
Resharding to spread one hot key is usually the wrong fix. The hot key would still be hot on whatever shard it lands on, just on a slightly less crowded shard. Better fixes: cache the hot key, split it into sub-keys at the application layer, or change the access pattern so the load doesn't concentrate. Resharding is for global imbalance, not local hot keys.
Should I use a database that shards automatically, or shard manually?
Strongly prefer automatic when possible. DynamoDB, Spanner, Cassandra, and others handle sharding internally. The application writes to "the database" without knowing which shard. The operational cost is dramatically lower than manually sharding Postgres. Choose manual sharding (Postgres + Vitess or Citus, MySQL + Vitess) only when you have a strong reason to keep the database and the workload demands it.
How does sharding work with foreign keys?
It usually doesn't. Most sharded systems give up referential integrity at the database layer because cross-shard foreign keys aren't enforceable without expensive coordination. The application becomes responsible for maintaining referential consistency. This is one of the costs of sharding that candidates often overlook: you lose database guarantees you used to take for granted.