How would you design a time‑series database (ingest, downsampling, retention)?
A time series database stores measurements that change over time and lets you write and read them at very high scale with tight control over cost. Think of billions of small points like cpu usage, ad impressions, sensor readings, or click counts arriving every second. The core of a good design is a fast ingest path, smart downsampling to reduce volume while keeping signal, and clear retention rules that match the value of the data over its lifetime.
In this guide you will learn how to design each part so it survives a production workload and shines in a system design interview.
Why It Matters
Most metrics are hottest right after they arrive. Engineers need second level visibility to debug incidents, minute level trends for capacity, and hour level history for business reporting. Keeping raw data forever is wasteful. A purpose built time series design gives you predictable writes, compressed storage, and queries that stay fast even as data grows. In an interview you will be judged on how you control cardinality, how you roll up data without losing meaning, and how you enforce retention without hurting queries. Mastering these topics signals real world readiness for scalable architecture.
How It Works step by step
Step 1 Model the data Use a simple tuple per point
- measurement name such as cpu or requests
- timestamp as a 64 bit integer in epoch nanos or millis
- tags as string key value pairs like host region service
- fields as numeric values like value count sum gauge rate Keep tags low cardinality. Never put user id or request id in tags. If you need per user analytics push that to a different store.
Step 2 Partition and route writes Derive a series id from measurement and tags by hashing. Route to a shard by series id and to a time partition by timestamp. A common layout is shard S and time bucket T for example daily partitions per shard. This spreads hot points and makes retention a cheap drop of old partitions.
Step 3 Make ingest durable and fast On arrival:
- validate and batch points
- append to a write ahead log for crash recovery
- buffer in memory structures similar to an LSM tree memtable Flush by time or size to immutable segment files. Acknowledge to clients after the log is fsynced so you have durability even if a node crashes.
Step 4 Encode and compress segments For each series store data in time sorted blocks. Use time specific encodings like delta of deltas for timestamps and Gorilla style bit packing for float values. Dictionary encode tag keys and tag values once per block. These tricks cut storage by an order of magnitude and improve scan speed.
Step 5 Build the right indexes You need two fast lookups:
- a series catalog that maps tag filters to series ids using inverted indexes per tag key
- a time index that maps a series id to its segment blocks per partition To control memory, cache active tag postings in memory and keep the full catalog on disk with a bloom or cuckoo filter in front. Track per tag key cardinality and alert when it grows too fast.
Step 6 Handle out of order and late data Real producers send points slightly out of order. Accept late points within a bounded window like one to five minutes. Keep a small reorder buffer in memory. If a late point lands in a flushed segment write a tiny delta block or create a tombstone range and rewrite during compaction. Expose a watermark that tells the downsampling jobs how far it is safe to aggregate.
Step 7 Downsample in tiers Create rollups at multiple resolutions on a separate stream
- raw resolution stored for a short window like one to seven days
- one minute aggregates for weeks
- five minute or fifteen minute aggregates for months
- one hour aggregates for long term history For each resolution compute count min max sum and store a start aligned time window so queries stitch cleanly. Keep rollups idempotent and backfill friendly by using deterministic window boundaries and writing complete windows only after the watermark passes.
Step 8 Enforce retention and tiering Make retention a policy per measurement or tenant. Typical defaults:
- raw expires after seven days
- one minute expires after thirty to ninety days
- five minute expires after six to twelve months
- one hour expires after one to three years Implement by dropping whole time partitions. For cold layers move old rollups to cheap object storage with an index pointer so queries can still scan them when needed.
Step 9 Compact and repair Background jobs merge small segments into larger ones and apply better compression. Compaction also resolves tombstones from late corrections. Throttle compaction to protect ingest. Expose metrics like write amplification and compaction lag to operations.
Step 10 Plan the query path A query has two phases:
- index phase evaluate tag filters to get series ids then pick the right resolution based on requested time range and accuracy
- scan phase read blocks for those series and time ranges apply functions like rate sum avg percentile and produce groups by tag keys Push down aggregations so the engine reads only the lowest resolution needed. If a range spans multiple tiers the planner can combine one hour blocks for old time and one minute blocks for recent time.
Step 11 High availability and replication Replicate shards three ways across zones. A write is durable after a quorum fsync of the log. Keep metadata such as series catalog ownership under a small consensus group and stream data replication asynchronously so ingest stays fast. For cross region disaster recovery replicate rollup tiers first since they are smaller and good enough for most queries.
Step 12 Governance and multi tenant control Protect the cluster:
- per tenant quotas on series cardinality and ingest rate
- label validation rules to block high cardinality tags
- query limits on time range fan out and series scan count This keeps noisy neighbors from hurting everyone.
Real World Example
Picture a large streaming platform with millions of devices. Each player reports startup time rebuffer count and bitrate every few seconds. Ingest comes through a gateway that batches by device and region and writes to the time series cluster with write ahead logging. The cluster partitions by region and day. Schedulers create one minute and five minute rollups once the watermark passes the window. Operations watch recent raw data for incident response while product analytics pulls one hour rollups to study long term quality trends. When Black Friday traffic hits, compaction is throttled and quotas prevent any one service from exploding tag cardinality. Queries stay fast because the planner reads one hour rollups for long ranges and only dips into raw data for the most recent hour.
Common Pitfalls or Trade offs
-
Cardinality explosion Putting user id request id or url in tags creates billions of series and blows up memory. Use fields or sample to a separate analytics system.
-
Unaligned windows If rollups are not start aligned the same point may land in different buckets and sums will drift. Always align to absolute time boundaries.
-
Late data pollution Aggregating before the watermark leads to double counting once late points arrive. Wait for the watermark then publish complete windows.
-
Write amplification Excessive small flushes and frequent rewrites waste I O and storage. Tune flush size and compaction to keep segments large.
-
Query fan out Loose tag filters can fan out to millions of series and create slow scans. Keep an inverted index and add limit guards in the planner.
-
One size retention Treating all metrics the same wastes money. Use per measurement policies so high value metrics live longer and low value metrics expire quickly.
Interview Tip
A favorite prompt is how will your design handle out of order writes while keeping rollups accurate. A strong answer mentions a bounded late window a watermark a reorder buffer delta blocks plus idempotent rollup jobs that publish only after the watermark passes. If you also add per tenant cardinality limits and partition level drops for retention you will stand out.
Key Takeaways
-
Ingest is a pipeline log then buffer then flush to immutable segments with compression
-
Downsampling is tiered with aligned windows and a watermark for correctness
-
Retention is cheap if you partition by time and drop whole partitions
-
Indexes and cardinality control protect memory and keep queries predictable
-
The query planner chooses the coarsest safe resolution and pushes down aggregations
Table of Comparison
| Store Type | Best For | Write Pattern | Typical Queries | Strengths | Limitations |
|---|---|---|---|---|---|
| Purpose-built Time Series DB | Metrics, events, sensors with strict retention and rollups | High-rate small points, append-only with bounded late arrival | Group by time and tags, rates, percentiles, top N | Fast ingest, compressed storage, time-aware query planner | Limited joins, not ideal for complex ad hoc analytics |
| Columnar Warehouse | Ad hoc analytics across large historical windows | Batch loads or micro-batches | Heavy aggregations, joins, multi-fact analysis | Vectorized scans, rich SQL support | Costly for continuous high-rate ingest and real-time freshness |
| Log Analytics Store | Text logs and traces with search | Append to shards by time | Full-text search, filters, error tracking | Great for unstructured events | Numeric aggregations slower and more costly |
| Row Store Relational DB | Small metric volumes or strict transactional use cases | Row inserts with indexes | Simple aggregates and dashboards | Easy operations, strong consistency | Poor compression, limited scale for high write rates |
FAQs
Q1 How do I pick the right shard key for a time series DB?
Hash measurement and a small set of low cardinality tags such as service and region. Include time only for partitioning not for the shard hash.
Q2 What is a safe window for late data?
Start with one to five minutes for infrastructure metrics and longer for mobile or edge clients. Expose a watermark and tune it from real arrival distributions.
Q3 Should I compute rates at write time or query time?
Compute raw counters at ingest and calculate rate or delta at query or rollup time. This avoids precomputing the wrong window and keeps storage simpler.
Q4 How do I prevent cardinality explosions?
Set per tenant limits on new series per minute. Validate labels at ingest and drop or rewrite points that carry forbidden tag keys like user id.
Q5 What encodings work best for time series? Delta of deltas for timestamps Gorilla style for floats run length for repeated values and dictionary encoding for tags. Combine during compaction.
Q6 How should I plan capacity for ingest?
Estimate points per second times bytes per point after compression. Add headroom for compaction and rollups. Watch write amplification and compaction lag in dashboards.
Further Learning
Level up your thinking with targeted practice in our course Grokking the System Design Interview where you work through trade off driven designs end to end.
If you want deeper implementation patterns for large scale pipelines explore Grokking Scalable Systems for Interviews and build confidence with production grade examples.
GET YOUR FREE
Coding Questions Catalog
$197

$78
$78