How would you build a metrics store (Prometheus‑like) with remote storage?
A metrics store that feels like Prometheus but keeps data in remote storage gives you the best of both worlds. You keep the simple pull based scrape model and familiar query language while unlocking long retention, elastic scale, and team level isolation. Think of it as three clear paths that work together. an ingestion path that accepts time series samples, a storage path that persists and compacts them in a cheap durable system, and a query path that fans out, merges, and down samples results for fast dashboards.
Why It Matters
Prometheus on a single node is perfect for cluster level monitoring but it struggles with long retention, multi tenant isolation, and large query fanout. A remote storage backed design solves those limits.
-
Long retention without local disks filling up
-
Horizontal ingestion capacity for bursty services
-
Multi region durability and simple disaster recovery
-
Centralized observability across many teams while keeping per tenant quotas
-
Lower total cost by using object storage and compute that can scale independently
How It Works step by step
Below is a concrete path you can implement. It mirrors how mature metric systems work while staying compatible with Prometheus scrape and query patterns.
1. Data model and naming
- Each time series is a metric name with a set of labels. Example. http_requests_total with labels job, instance, method, status.
- Store samples as time, value, and series id.
- Enforce cardinality budgets per tenant. set limits on unique series, labels per series, and samples per second.
- Prefer counters and histograms for aggregation friendly metrics. gauge is fine for point in time values.
2. Ingestion from Prometheus via remote write
- Prometheus scrapes targets and batches samples into remote write requests.
- A stateless distributor receives requests, authenticates the tenant, validates label rules, and assigns each series to a partition using a consistent hash of labels.
- Before acknowledging, the distributor appends to a write ahead log so back pressure or restarts do not lose samples.
- Per partition queues buffer and retry with exponential backoff. Keep idempotency by using sequence numbers or a de duplicate key of series id and timestamp.
3. Short term in memory buffering and chunk building
- Ingesters keep recent samples for a series in memory and build compressed chunks. Common encodings include delta of delta for timestamps and XOR based float encoding.
- When a chunk is sealed by size or time, the ingester writes it to object storage and updates the index store.
- Seal policy matters. longer windows shrink index writes but can increase memory pressure and recovery time.
4. Index design for fast label queries
- Use an inverted index that maps label name and value to series ids.
- Store the index in a key value database that scales reads and writes independently from object storage. Options include column family stores and cloud scale key value services.
- Keep a small hot cache. label postings lists, frequent series metadata, and last value for staleness checks.
5. Compaction and retention
- A background compactor merges many small chunks into larger blocks, reorders samples, removes duplicates, and rewrites index entries.
- Apply retention rules. for example keep raw ten second resolution for seven days, keep one minute rollups for ninety days, and keep one hour rollups for one year.
- Move older blocks to colder storage classes to save cost.
6. Downsampling and rollups
- Precompute rollups during compaction. sum, min, max, average for gauges and proper rate and increase for counters.
- For histograms, pre merge buckets so long range percentiles can read fewer blocks.
- The query planner chooses downsampled data when ranges are large and step is coarse.
7. Query path with remote read
- A stateless gateway accepts PromQL queries. It parses, plans, and splits the request into subranges and label filters.
- A query coordinator fans out to storage nodes using the inverted index to find series quickly.
- Results stream back as typed vectors, then the gateway applies PromQL operators, joins, binary math, and aggregations.
- Response caching stores final results and partial range results keyed by query plus time window.
8. High availability and duplicate handling
- Run Prometheus in an HA pair with different external labels such as replica A and replica B.
- Ingestion deduplicates by series fingerprint and timestamp, preferring one replica if values collide within a small window.
- Replicate chunks across zones asynchronously and verify with periodic scans.
9. Multitenancy and limits
- Partition by tenant id up front and propagate it through every pipeline stage.
- Enforce budgets. samples per second, active series, max label names, query time range, query complexity.
- Offer soft limits with admission control and hard limits with request rejection and clear errors.
10. Backfill and repair
- Provide a backfill API that accepts historical blocks produced by a tool. Useful for import from another system or manual fixes.
- A repair job scans for missing or corrupt blocks, rebuilds indexes, and replays the write ahead log when needed.
11. Security and governance
- TLS in transit and encryption at rest.
- Per tenant API tokens, optional mTLS for Prometheus servers.
- Audit logs for configuration and data access.
Real World Example
Consider a platform team that runs a hundred Kubernetes clusters. Each cluster runs two Prometheus servers in HA mode that scrape the control plane and workloads. Remote write sends samples to a global ingestion layer. Distributors hash by series labels so the pair does not collide on the same node. Ingesters form chunks for ten minute windows, seal, and upload to object storage. A column family store holds the label index.
Grafana users send PromQL queries to the gateway. For a three hour dashboard, the planner hits raw chunks. For a six month report, it uses the one hour rollups. When a zone fails, the duplicate chunks in another zone keep queries healthy. Retention rules move older blocks to a colder storage class without changing the query API.
Common Pitfalls or Trade offs
-
Exploding cardinality from unbounded labels like user id or request id. add rules to drop or hash such labels at ingestion.
-
Duplicate samples from HA pairs that are not deduplicated. always tag replicas and dedupe by timestamp.
-
Query fanout across too many shards. plan queries in time slices and add caches for postings and results.
-
Wrong downsampling for counters. never average raw counter values. compute rate or increase first then aggregate.
-
Histograms with too many buckets. use bucket strategies that match expected ranges and quantiles.
-
Oversized chunks that delay visibility or create large recovery windows. tune seal size and time based on write rate.
-
Missing backpressure. ingestion should shed load gracefully instead of crashing.
Interview Tip
A favorite prompt is. you operate Prometheus in two replicas and remote write into a central store. walk me through how you avoid duplicate samples and still stay available if one side is lagging. A crisp answer mentions external labels on each replica, a dedupe window during compaction or read time, idempotent write keys, and a query layer that tolerates partial results with a clear stale marker.
Key Takeaways
-
Keep the Prometheus pull model and PromQL while moving storage to a durable remote system.
-
Separate compute roles. distributor, ingester, compactor, and query gateway. each scales on its own.
-
Use object storage for chunks and a key value index for labels. add caches to keep latency low.
-
Control cardinality early and enforce budgets per tenant.
-
Downsample for long range queries and deduplicate HA streams by fingerprint and timestamp.
Table of Comparison
| Option | Scale and durability | Retention and cost | Query latency | Operational complexity | Best fit |
|---|---|---|---|---|---|
| Single node Prometheus TSDB | Limited to one server and disk | Days to weeks unless you grow disks | Very fast for local data | Simple to run | Small teams and cluster local dashboards |
| Prometheus with remote storage as described | Horizontal ingestion and multi region durability | Cheap long term storage with tiering | Low for recent data, moderate for very long ranges with rollups | Moderate due to extra components | Org wide metrics, mixed dashboards and audits |
| Fully managed clustered metrics database | Elastic at very large scale | Very long with automated tiering | Usually low with heavy caching and optimized query engines | Low for users, high for the provider | Very large enterprises and global platforms |
FAQs
Q1. Should I use pull or push for ingestion?
Pull keeps discovery simple and gives you backpressure by slowing scrape frequency. Use push only for components that cannot be scraped or for edge networks that cannot accept inbound connections.
Q2. How do I prevent label cardinality from exploding?
Define allowed label sets per metric, drop or hash risky labels, enforce per tenant budgets, and run a daily report that lists the top new series so you can fix sources quickly.
Q3. What is the difference between remote read and federation?
Remote read makes the query engine fetch series from a remote store while keeping the same PromQL surface. Federation pulls and aggregates pre evaluated metrics from other Prometheus servers and stores the results locally.
Q4. How do I deduplicate HA samples safely?
Tag each replica with a stable external label, set a small dedupe window, and keep idempotent keys based on series fingerprint and timestamp. Prefer server side dedupe during compaction so queries stay fast.
Q5. How do I choose chunk size and seal policy?
Balance write amplification and query latency. larger chunks reduce index updates but increase memory and recovery time. many small chunks lower memory but slow long range scans. Start with a target duration like ten minutes and adjust with real traffic.
Q6. How do I make long range queries fast?
Precompute downsampled rollups, cache postings and results, plan in time slices, and read from the nearest region with replicas in other regions as fallbacks.
Further Learning
Build a solid mental model of queues, caches, and storage patterns in our friendly primer. Explore the foundations in Grokking System Design Fundamentals. Ready to practice end to end design tradeoffs for real interview prompts. Tackle hands on case studies in Grokking the System Design Interview.
GET YOUR FREE
Coding Questions Catalog
$197

$78
$78