Range vs hash vs hybrid sharding: how to choose and migrate?

Sharding means horizontal partitioning of data so that storage and traffic are spread across many nodes. The goal is simple. Keep latency predictable, keep write and read throughput high, and allow independent scaling as the product grows. The three families you will use most are range sharding, hash sharding, and a hybrid that combines the strengths of both. You will learn how to map access patterns to the right shard key, how to avoid hotspots, and how to plan a safe online migration.

Why It Matters

Sharding choices decide your tail latency, your operational complexity, and the cost curve of the platform. A good shard key protects you from noisy neighbor effects, keeps rebalancing cheap, and lets you evolve without painful rewrites. In interviews, the shard key is the single most lever you control for scalable architecture, so being precise about why you chose range, hash, or hybrid is a signal that you can run a real distributed system.

How It Works step by step

You can think of sharding as two decisions. First choose a shard key that reflects how your application reads and writes. Second choose a placement strategy that balances distribution and query efficiency.

Range sharding

  1. Pick an ordered key such as user id, time, or lexicographic order of a composite key.
  2. Define contiguous intervals. Example, users zero to nine million on shard A, nine million to eighteen million on shard B.
  3. Route by comparing the key to the shard map.
  4. Rebalance by splitting or merging ranges, then moving whole ranges between nodes.

What it gives you

  • Efficient range scans like top posts within id bounds or events within a time window
  • Natural locality for ordered data and time series
  • Cheap to split hot ranges if the system supports fast range moves

What to watch

  • Hotspots if the key is monotonic and recent values dominate writes
  • Uneven growth across ranges unless you split proactively
  • Scatter gather joins when the query does not align with the key

Hash sharding

  1. Choose a stable hash function and a number of buckets greater than the number of physical shards.
  2. Compute hash of the shard key, mod by bucket count, then map buckets to shards.
  3. Rebalance by adding shards and remapping some buckets to the new nodes.
  4. For smoother growth, use virtual nodes and consistent hashing to limit the amount of data moved.

What it gives you

  • Even write distribution for unpredictable traffic
  • Simple routing with a pure function that never needs range comparisons
  • Strong protection against hot partitions when the key has natural variety

What to watch

  • Range queries are expensive since neighbor keys are spread across shards
  • Secondary indexes and aggregations can become cross shard operations
  • Backfills or rehashing require careful throttling and verification

Hybrid sharding

Hybrid uses a composite strategy. You segment on one dimension, then hash within each segment. Common patterns

  • Time bucket then hash. Daily or hourly buckets for events, then hash by user id within the bucket
  • Tenant then hash. Customer id as top level partition, then hash by document id
  • Geographic region then hash. Region based routing reduces latency, local hash keeps load even

What it gives you

  • Balanced writes plus efficient scans within the segment
  • Safer migrations since you can move or archive one segment at a time
  • Better multi tenant isolation without giving up uniform load

What to watch

  • More moving parts in the routing layer and in the shard map
  • Queries that cross segments still require scatter gather
  • Capacity planning must respect both the segment distribution and the hash distribution

Real World Example

Think about three services you already know.

A social feed similar to Instagram Most reads fetch a single user timeline or a set of recent stories. Hash by user id to keep writes even and to avoid a single celebrity user flooding one shard. For discovery features that need time scans, keep a separate index that is range partitioned by time and materialize feed candidates asynchronously. This mix captures the best of both worlds.

A catalog similar to Amazon Search and browse prefer range on product attributes such as category or price band. However checkout and order history lean on user id. Use hybrid. Top level partition by category family for browse heavy workloads. Within each family, hash by product id for even writes. Maintain secondary stores for order history that hash by user id.

A telemetry stream similar to what a large video platform would use Events arrive in time order and queries like recent errors in the last hour are common. Use time bucket then hash within each bucket. This keeps hot recent data local to a few buckets while the inner hash evens the load.

Common Pitfalls or Trade offs

  • Picking a monotonic shard key with range sharding creates a hot head where all new writes hit the same shard

  • Hash sharding makes ad hoc range queries expensive and encourages full scatter gather which raises p99 latency

  • Hybrid sharding can hide complexity until operations need to rebalance across both the segment and the hash layer

  • Secondary indexes need their own sharding strategy or they will become the new bottleneck

  • Cross shard transactions are costly, so design write paths that touch only one shard whenever possible

How to choose

Use this simple checklist.

  • If most queries are equality lookups by a high cardinality key and you care about uniform write load, choose hash

  • If you need fast range scans or time window queries, choose range with smart splitting or choose hybrid time bucket then hash

  • If you need tenant isolation for noisy neighbors or legal boundaries, choose tenant segmented hybrid then hash inside each tenant

  • If your key distribution is unpredictable or prone to spikes, prefer hash or hybrid rather than pure range

  • If you need global ordering on writes, choose range and fix hotspots with dynamic range splitting and write throttling

Concrete signals from SLOs

  • High write p99 with uneven queue depth hints at skew. Move toward hash or hybrid

  • Expensive global range queries hint at the need for time or category segments. Add a segment dimension or maintain a dedicated range index

  • Cost spikes during rebalancing point to having too few buckets. Introduce virtual nodes so future growth moves less data

Migration strategies that work in production

You can migrate online without user impact if you follow a careful sequence.

  1. Pick the target strategy and define the new shard key. For example migrate from pure hash by user id to hybrid time bucket then hash by user id.

  2. Build or extend a routing service that understands both the old and new maps. Make the map versioned so you can roll back.

  3. Dual write. For a fixed window, every write goes to both the old and the new cluster. Writes must be idempotent and carry a request id so duplicates are safe.

  4. Backfill. Copy historical data from old to new with a controlled batch job. Order by primary key to get stable progress markers.

  5. Verify. Use row count checks, checksums per bucket, and sampled reads that compare old and new results.

  6. Shadow read. Route a small percentage of reads to the new cluster in parallel and compare responses server side.

  7. Shift traffic gradually. Start at five percent, then twenty five, then fifty, while watching tail latency and error rates.

  8. Decommission the old path after a cool off period where you have full confidence in the new layout.

Tips by source and destination

  • Range to hash. Recompute the hash while backfilling and write into many small buckets to avoid long exclusive moves later
  • Hash to range. Pre build the range boundaries by sampling the distribution to avoid hot ranges on day one
  • Hash to more hash shards. Use consistent hashing with many virtual nodes so that new capacity only steals a small portion of buckets
  • Toward hybrid. Introduce the segment dimension in the key and migrate segment by segment with dual writes

Operational details that save the day

  • Keep the shard map in a strongly consistent store such as a quorum backed key value service and propagate with TTL so stale maps expire
  • Use write ahead logs or change data capture to power the backfill and to retry failures without double counting
  • Monitor queue depth per shard, not just global metrics. Imbalance shows up there first
  • Protect downstream systems by shaping traffic during backfill and by pausing heavy buckets during peak user hours

Interview Tip

Interviewers love to ask which shard key you will choose. Start from access patterns. Say out loud which queries must be single shard and which can be scatter gather. Point to hotspots, for example monotonically increasing ids in range layout. Propose a migration plan with dual writes and backfill. That tells the interviewer you can run the system, not just design it.

Key Takeaways

  • Choose the shard key based on dominant reads and writes, not only on data model convenience
  • Use hash for uniform distribution, range for efficient scans, hybrid for the best blend of both
  • Plan migration as a product feature with routing versioning, dual writes, and rigorous verification
  • Secondary indexes and analytics may need different shard keys than the primary store
  • Capacity planning starts with many small buckets so that rebalancing remains cheap

Table of Comparison

StrategyBest ForStrengthsLimitationsTypical Shard KeyRebalancing MethodOps Complexity
RangeTime-based or ordered scansFast range queries and data localityHotspots with monotonic keys, uneven data growthTimestamp, ordered ID, composite prefix keySplit or merge ranges, move whole rangesModerate — requires proactive range management
HashEven write distribution and random accessUniform load, simple routing, prevents hotspotsExpensive range queries, scatter-gather overheadUser ID, order ID, document IDMove hash buckets using consistent hashingLow — except during rehash operations
HybridMixed workloads combining range and equality queriesBalanced load, supports efficient scans within segmentsMore complex routing and capacity planningSegment key (e.g., time, tenant) + hash subkeyMove segments or hash buckets inside segmentsHigh — flexible but operationally heavier

FAQs

Q1. What is the difference between sharding and partitioning?

In practice the terms often overlap. Sharding usually implies application aware routing across nodes. Partitioning is a more general term used by storage engines for splitting data internally.

Q2. How do I pick a shard key for a social app?

Choose user id for the primary store so profile and timeline writes hit one shard. Keep a separate index for discovery that is range based on time to support recent content scans.

Q3. How many shards should I start with?

Start with many logical buckets, often some hundreds, even if you only have a few physical nodes. Map buckets to nodes. This keeps future growth cheap.

Q4. What is consistent hashing and why use it?

It is a way to map keys to buckets so that adding or removing nodes moves only a small fraction of keys. It is useful for hash or hybrid strategies that expect regular growth.

Q5. How do I handle cross shard joins?

Avoid them in the request path. Denormalize or pre materialize results into a store that is keyed to the query. Keep heavy joins in offline analytics.

Q6. How do I prevent hotspots with time based keys?

Use time buckets that roll over frequently and add an inner hash on a secondary field to spread the writes across nodes.

Further Learning

TAGS
System Design Interview
System Design Fundamentals
CONTRIBUTOR
Design Gurus Team
-

GET YOUR FREE

Coding Questions Catalog

Design Gurus Newsletter - Latest from our Blog
Boost your coding skills with our essential coding questions catalog.
Take a step towards a better tech career now!
Image
One-Stop Portal For Tech Interviews.
Copyright © 2025 Design Gurus, LLC. All rights reserved.