How would you design log ingestion & search (ELK/OpenSearch‑like)?
Log ingestion and search is the nervous system of a scalable platform. You collect high volume semi structured events from services and infrastructure, ship them reliably, index them for fast discovery, and retain them cost effectively. Think of it as a time ordered data pipeline that turns raw text into indexed facts you can filter, aggregate, and visualize during incidents and analytics. If you can explain this flow clearly in a system design interview, you signal strong knowledge of distributed systems and observability.
Why It Matters
- Faster incident response. Engineers need real time queries on recent errors, slow requests, and spikes across services and regions.
- Lower mean time to recovery. Rich filters and aggregates cut through noise to isolate the blast radius quickly.
- Cost and compliance controls. Hot warm cold storage and redaction prevent runaway bills and data exposure.
- Product and security insights. The same pipeline supports feature telemetry, audit trails, access investigations, and anomaly detection.
- Interview relevance. This design touches ingestion pressure, backpressure, indexing strategy, retention policy, multi tenant isolation, and cost trade offs.
How It Works step by step
Step 1 Collect at the source
Install lightweight agents next to applications and nodes to capture stdout, files, syslog, and app spans that you choose to treat as logs. Normalize a minimal envelope across all sources such as timestamp, level, service, environment, region, host, request id, user id if allowed, and raw message. Emit structured JSON whenever possible so downstream parsing is cheap and reliable.
Step 2 Buffer and transport
Send events to a durable message bus to smooth bursts and protect against downstream outages. Kafka, Pulsar, or a managed stream serves as the write ahead log for the pipeline. Enable batching, compression, and retries with exponential backoff. Enforce backpressure so producers slow down instead of dropping data when the bus is full.
Step 3 Ingest and transform
A fleet of stateless consumers reads from the bus and applies:
- Parsing and schema shaping. Prefer structured JSON. Use patterns for text lines.
- Enrichment. Add service taxonomy, geo attributes, build id, and Kubernetes metadata.
- PII redaction. Apply allow lists and tokenization before data leaves the secure boundary.
- Sampling. For very high volume services, sample debug level lines while keeping errors.
- Routing. Partition by time then tenant or index family to keep shards balanced.
Step 4 Choose a storage strategy
You rarely use one store. Mix tiers to match query patterns.
- Inverted index store such as OpenSearch or Elasticsearch for fast keyword and nested field filters with aggregations.
- Columnar store such as ClickHouse or BigQuery for wide scans and heavy group by analytics with very high compression.
- Object storage as the long term archive holding raw or compacted parquet, with the ability to hydrate back into a hot tier for rare investigations.
Step 5 Index and shard design
For an inverted index cluster:
- Time based indices. Daily or hourly per index family depending on volume.
- Shard sizing. Target shard sizes that keep heap usage healthy and file counts manageable. A common target is tens of gigabytes per shard after merge.
- Field mapping discipline. Only index fields you query. Use keyword for exact match and aggregations, text with an analyzer for full text search, and disable doc values on fields you never aggregate.
- Avoid unbounded cardinality in aggregations. Fields like user id or request id can explode memory during terms queries.
- Templates. Lock down dynamic fields and use strict templates to prevent field explosion.
Step 6 Lifecycle and retention
Apply an index lifecycle policy per dataset.
- Hot. New data on fast storage with replication for search and aggregates.
- Warm. Older indices on cheaper storage with fewer replicas and merged segments.
- Cold. Searchable snapshot in object storage for rare queries.
- Frozen or archived. Not searchable by default. Restore on demand for incident reviews. Automate rollover by size or age and snapshot continuously to object storage.
Step 7 Query path and performance
A search gateway receives queries, authenticates the user, and routes to the exact indices by time range, tenant, and dataset.
- Pre filter by time first. Use index naming that makes time routing trivial.
- Query patterns. Term filters, range queries, prefix matches, and aggregations like count, percentiles, and top terms.
- Pagination. Use search after to avoid deep page cost.
- Caching. Use query result cache for repeated dashboards and use field data cache carefully to avoid evictions.
- Guardrails. Set hard limits on time range, number of shards per query, and maximum buckets.
Step 8 Multi tenant and security
Two common models
- Index per tenant per time slice for strong isolation and simple deletion.
- Shared indices with tenant id as a filter plus row level security at the search layer. Apply rate limits and quotas per tenant and per query. All access is audited.
Step 9 Reliability and cost controls
- Dead letter queues for parse failures.
- Circuit breakers and admission control for expensive queries.
- Autoscaling on ingest and search tiers with queue depth and query latency as signals.
- Compression and down sampling for long retention. Store raw in object storage and only keep a compacted structured view hot.
Real World Example
Picture a company at the scale of a global video platform. Traffic peaks at five million events per second across regions. The pipeline ingests through a managed stream. A transform service parses JSON logs, adds region and service labels, redacts tokens, and routes to two destinations. Recent seven day data lands in an inverted index cluster for fast interactive search used by on call engineers. The same events are batched into parquet hourly and written to object storage.
A daily job compacts into a columnar warehouse for product and security analytics. Dashboards mostly hit the hot cluster with strict limits on time window. When an incident requires a thirty day search, an operator hydrates one day of archived parquet back into the columnar store or spins up a temporary cluster that mounts a searchable snapshot.
Common Pitfalls or Trade offs
Unbounded cardinality fields
Using user id, session id, or request id as a terms aggregation can blow up memory and crash nodes. Cap the bucket count, use approximate top k, or aggregate on hashed buckets.
Field explosion from dynamic mapping
When new fields appear freely, the cluster creates thousands of field entries and slows indexing and queries. Use strict templates and whitelist fields.
Over sharding or under sharding
Too many small shards overwhelm cluster metadata and too few large shards limit parallelism. Size shards based on bytes and query fan out.
Wide time range queries with heavy aggregations
Expensive scans across hundreds of shards cause timeouts. Enforce time range limits and pre aggregate where possible.
No backpressure on producers
Agents that keep retrying without flow control can add more load during an outage. Use bounded queues and drop debug level messages first if you must shed load.
Keeping everything hot
Hot storage for all history is the fastest way to overspend. Move to warm and cold tiers quickly and restore only when needed.
Weak multi tenant isolation
Shared indices without strict rate limits let one user cause cluster wide issues. Isolate and cap at the gateway.
Interview Tip
A favorite prompt is design a log platform that ingests one hundred terabytes per day with seven day hot retention and ninety day cold retention for a multi region service. Start with source agents and a durable bus. Propose a transform tier that enforces schema and redaction. Choose inverted index for hot search and a columnar or object store for longer analytics. Show shard and index sizing, lifecycle policy, and guardrails on query cost. End with tenant isolation, quotas, and on demand restore from archive.
Key Takeaways
-
Treat the message bus as the write ahead log and the safety valve for bursts.
-
Use structured JSON at the source to reduce parse errors and costs later.
-
Separate hot interactive search from long term analytics to control cost.
-
Design shards and indices by time and dataset and enforce strict field templates.
-
Add guardrails for query cost and tenant isolation to protect shared clusters.
Table of Comparison
Table of Comparison
| Store | Best for | Query patterns | Compression and cost | Operational notes |
|---|---|---|---|---|
| Inverted index search cluster | Interactive search on recent data | Fast term filters, range on time, nested fields, aggregations with moderate cardinality | Moderate compression, higher hot storage cost | Shard and heap tuning required, strict templates, query guardrails |
| Columnar warehouse | Large scans and heavy group by analytics | Time range scans, joins with reference tables, robust SQL | High compression, low cost for cold and warm layers | Great for scheduled jobs and exploratory analysis, less ideal for sub second dashboards |
| Object storage archive | Very long retention and compliance | Restore a slice then query, or query via external table engines | Lowest cost per byte, excellent durability | Not interactive by default, requires hydrate on demand |
FAQs
Q1 What is the difference between logs metrics and traces?
Logs are verbose event records with rich context. Metrics are numeric time series for fast aggregates. Traces capture end to end request paths. A strong platform can correlate all three using shared ids.
Q2 How do I reduce the cost of hot storage without losing visibility?
Move older indices to warm and cold tiers quickly. Keep only the fields you search. Sample verbose logs while keeping all errors. Store raw events in object storage and hydrate rare days on demand.
Q3 How should I pick shard counts and index intervals?
Start with daily indices per dataset. Size shards to tens of gigabytes after merge. If a data set is massive switch to hourly. If it is small, combine multiple days. Validate by measuring search fan out and heap use.
Q4 How do I handle schema drift?
Lock field templates and keep a versioned schema in your transform service. Add new fields through controlled changes. Avoid dynamic mapping and reject unknown fields at ingest.
Q5 What is the fastest way to make common dashboards snappy?
Pin time to recent windows, pre filter by dataset and tenant, cache frequent queries, and use search after for pagination. Pre aggregate heavy metrics into a columnar store and show those in dashboards.
Q6 Should I send logs straight to the search cluster or through a bus?
Use a durable bus. It absorbs bursts, enables retries, and allows multiple consumers to enrich and route the same event to different stores.
Further Learning
Build a strong foundation with the practical lessons in Grokking System Design Fundamentals. For deeper scale patterns and storage trade offs, consider the project style syllabus in Grokking Scalable Systems for Interviews. If you want a full interview playbook with practice questions, explore Grokking the System Design Interview.
GET YOUR FREE
Coding Questions Catalog
$197

$78
$78