How do you capture high‑cardinality metrics without exploding cost?
High cardinality metrics capture signals with many unique label combinations such as user id or device id. They are invaluable for root cause analysis but they can overwhelm a time series backend and your budget. The goal is not to drop visibility but to collect the right detail at the right place with storage that matches query needs and cost constraints.
Why It Matters
In interviews and in production you will be judged on how you balance insight with efficiency. Pure metrics that include every user or every request can create millions of series which increases memory pressure index size and query latency. On call teams need a design that keeps p ninety five and p ninety nine visibility while avoiding runaway cost. Showing you can control cardinality while preserving debuggability is a strong system design interview signal.
How It Works Step by Step
Step 1. Set a cardinality budget Decide the maximum series you will allow per service per region and per team. Treat it like an SLO. For example a budget of one hundred thousand active series per service keeps head memory predictable. Exceeding the budget triggers guardrails in the next steps.
Step 2. Allowlist dimensions that matter Pick a small set of labels that explain most incidents such as region az instance class api endpoint experiment group. Block or rename high entropy labels such as request id user id and build sha. When product needs per user analysis route those events to a log or trace store instead of inflating the metrics space.
Step 3. Pre aggregate at the edge Use a sidecar or client library that batches and aggregates counters timers and gauges before sending to the metrics backend. Pre aggregation turns many granular updates into a single time series per minute per label set which shrinks writes and index churn.
Step 4. Use histograms and sketches for latency and size Quantiles from raw series are costly. Prefer HDR Histogram or DDSketch to encode latency distributions with tight memory bounds. One sketch per allowlisted dimension gives p fifty p ninety p ninety nine and max without creating a new series per request. For object sizes request bytes or payload samples store histograms rather than raw buckets.
Step 5. Apply dynamic sampling for spiky or unbounded keys If you must observe keys with unbounded space such as user or tenant apply head sampling during normal operation and raise the sampling rate during error spikes. Tail based sampling is great for traces and for metrics you can approximate it by gating on error rate so that failing requests are always recorded with more detail.
Step 6. Heavy hitter detection instead of per key metrics Use streaming heavy hitter algorithms to track the top K users endpoints or queries during each window. Emit only the top keys as labeled metrics and export the full key list as a log blob to object storage for offline inspection. This preserves fast dashboards while keeping long tail cost near zero.
Step 7. Tiered retention and downsampling Keep detailed minute level series for a short hot window such as seven days. Downsample to five minute and one hour rollups for warm weeks and months. For any dimension with very high cardinality shorten its hot retention further while keeping service level aggregates longer.
Step 8. Separate paths for metrics logs and traces Metrics serve fast numeric queries and alerting. Logs and traces carry the high cardinality detail. Route intense cardinality to a columnar or object store backed system and link from metric panels to the corresponding trace or log view. This keeps graphs fast and bills stable.
Step 9. Enforce guardrails in the write path Introduce a label policy that rejects or rewrites metrics with banned keys and caps the number of series per metric name. Emit audit metrics when drops happen so developers can fix abusive emitters. Make the policy visible in documentation and code review templates.
Step 10. Design for multi tenant fairness Allocate separate series budgets per tenant or team. If one tenant exceeds their share slow or drop only their high cardinality labels rather than impacting everyone. This avoids noisy neighbor issues inside the monitoring platform.
Real World Example
Consider a large social feed service that previously labeled latency with user id and post id to track p ninety nine outliers. This created tens of millions of series and frequent cardinality explosions during traffic spikes. The team introduced a plan.
- They set a series budget per service and added a label allowlist focused on region endpoint and experiment.
- They moved per user and per post detail to a trace store with aggressive tail sampling during errors.
- They replaced raw quantiles with HDR Histogram based timers.
- They added a streaming top K pipeline that emits the ten most expensive endpoints per region each minute as labeled metrics and sends a full key list to object storage.
- They changed retention to seven days for minute data and ninety days for hourly rollups.
Dashboards still show p fifty p ninety and p ninety nine for every important dimension. When an outage hits engineers click from a high level panel into traces that carry user level and post level context. Ingest volume and memory footprint dropped by more than half without losing the signals that matter.
Common Pitfalls or Trade offs
-
Over labeling Attaching user id session id build id or random tokens to every metric gives an illusion of precision but destroys performance and cost. Use logs or traces for identities and keep metrics numeric and stable.
-
Using percentiles computed from raw series Client side percentiles per label can be inconsistent and heavy. Prefer server side histograms or sketches that merge well and provide tight error bounds.
-
No retention plan Keeping minute resolution forever is rarely useful. Downsample and tier to align storage with query patterns.
-
One tool for every job Metrics logs and traces serve different access patterns. Pushing all detail into metrics is a common anti pattern.
-
Dropping everything under stress Guardrails should be surgical. Reject only abusive labels and keep core service aggregates intact so alerting never goes dark.
Interview Tip
A favorite prompt is this. Your payments service emits p ninety nine latency with labels user id and order id and your metrics bill tripled last month. What changes would you make in the next week to cut cost while preserving debuggability during incidents. A strong answer mentions allowlisting dimensions switching to histogram based timers dynamic or error aware sampling for spiky keys and a clear retention and downsampling plan with pointers from metrics to logs or traces.
Key Takeaways
- Start with a written cardinality budget and enforce it in tooling.
- Keep metrics identity free and route identities to logs or traces.
- Use histograms or sketches for latency and sizes to avoid per request series.
- Detect heavy hitters and report only the top keys while exporting the long tail to cheap storage.
- Apply tiered retention and downsampling so cost matches query value.
Table of Comparison
Below is a simple comparison of common strategies for high cardinality signals.
| Strategy | Best for | Accuracy | Cost Control | Query Speed | Notes |
|---|---|---|---|---|---|
| Allowlisted metrics with counters, gauges, and histograms | Dashboards, alerting, SLO tracking | High for aggregates; high for percentiles with sketches | Strong with budgets and retention | Very fast | Keep labels stable and low entropy |
| Heavy hitter detection then emit top K | Finding worst users, endpoints, or tenants | High for top items | Excellent because long tail is not stored as series | Fast for dashboards | Export full list to cheap storage for deep dives |
| Dynamic and error-aware sampling | Spiky or unbounded keys | Approximate for normal traffic; exact for failures | Good because sampling ramps with need | Fast for aggregated views | Choose policies per service risk |
| Logs in columnar or object storage | Forensic analysis and ad hoc joins | Exact | Good if compressed and tiered | Slower than metrics | Link from metric panels to relevant log queries |
| Traces with tail-based sampling | Causal debugging across services | High for slow or failed requests | Good with sampling and retention tiers | Fast for sampled spans | Use traces for identity-rich context, not metrics |
FAQs
Q1. What is high cardinality in metrics and why is it hard?
High cardinality means a metric has many unique label combinations such as user id or item id. It is hard because each unique combination becomes a separate series that consumes memory and storage and slows queries.
Q2. How can I keep p ninety nine latency without storing per user metrics?
Use histogram or sketch timers per safe dimension like region and endpoint. Link dashboards to traces where you sample more aggressively during errors so deep dives still see user context.
Q3. When should I prefer logs or traces over metrics for identity heavy data?
Choose logs or traces when you need per user or per device analysis or when keys are unbounded. Use metrics for fast aggregates alerting and SLOs.
Q4. Do approximate algorithms reduce accuracy too much for SLOs?
Sketches like HDR Histogram or DDSketch provide small known error bounds and merge cleanly across nodes which is ideal for SLO percentiles. They are accurate enough for alerting and capacity planning.
Q5. What guardrails stop a new deploy from exploding series count?
Add a server side label policy that rejects banned keys caps series per metric name and emits audit metrics when drops happen. Pair this with code review checks and a published label allowlist.
Q6. How should retention be set for a cost efficient monitoring stack?
Keep minute data for a short hot window then downsample to five minute and one hour for longer history. Shorten retention for any dimension that tends to explode and keep service level aggregates longer.
Further Learning
Build strong foundations in our beginner friendly course Grokking System Design Fundamentals where you learn how to choose the right signal for the job and how to size storage with confidence.
For a deeper playbook on scaling observability inside large distributed systems explore Grokking Scalable Systems for Interviews which covers budgets retention and trade offs that come up in real interviews.
If you want practice applying these ideas to open ended prompts try the case studies in Grokking the System Design Interview and turn this material into crisp interview stories.
GET YOUR FREE
Coding Questions Catalog
$197

$78
$78