How would you design log ingestion & search (ELK/OpenSearch‑like)?

Log ingestion and search is the nervous system of a scalable platform. You collect high volume semi structured events from services and infrastructure, ship them reliably, index them for fast discovery, and retain them cost effectively. Think of it as a time ordered data pipeline that turns raw text into indexed facts you can filter, aggregate, and visualize during incidents and analytics. If you can explain this flow clearly in a system design interview, you signal strong knowledge of distributed systems and observability.

Why It Matters

  • Faster incident response. Engineers need real time queries on recent errors, slow requests, and spikes across services and regions.
  • Lower mean time to recovery. Rich filters and aggregates cut through noise to isolate the blast radius quickly.
  • Cost and compliance controls. Hot warm cold storage and redaction prevent runaway bills and data exposure.
  • Product and security insights. The same pipeline supports feature telemetry, audit trails, access investigations, and anomaly detection.
  • Interview relevance. This design touches ingestion pressure, backpressure, indexing strategy, retention policy, multi tenant isolation, and cost trade offs.

How It Works step by step

Step 1 Collect at the source

Install lightweight agents next to applications and nodes to capture stdout, files, syslog, and app spans that you choose to treat as logs. Normalize a minimal envelope across all sources such as timestamp, level, service, environment, region, host, request id, user id if allowed, and raw message. Emit structured JSON whenever possible so downstream parsing is cheap and reliable.

Step 2 Buffer and transport

Send events to a durable message bus to smooth bursts and protect against downstream outages. Kafka, Pulsar, or a managed stream serves as the write ahead log for the pipeline. Enable batching, compression, and retries with exponential backoff. Enforce backpressure so producers slow down instead of dropping data when the bus is full.

Step 3 Ingest and transform

A fleet of stateless consumers reads from the bus and applies:

  • Parsing and schema shaping. Prefer structured JSON. Use patterns for text lines.
  • Enrichment. Add service taxonomy, geo attributes, build id, and Kubernetes metadata.
  • PII redaction. Apply allow lists and tokenization before data leaves the secure boundary.
  • Sampling. For very high volume services, sample debug level lines while keeping errors.
  • Routing. Partition by time then tenant or index family to keep shards balanced.

Step 4 Choose a storage strategy

You rarely use one store. Mix tiers to match query patterns.

  • Inverted index store such as OpenSearch or Elasticsearch for fast keyword and nested field filters with aggregations.
  • Columnar store such as ClickHouse or BigQuery for wide scans and heavy group by analytics with very high compression.
  • Object storage as the long term archive holding raw or compacted parquet, with the ability to hydrate back into a hot tier for rare investigations.

Step 5 Index and shard design

For an inverted index cluster:

  • Time based indices. Daily or hourly per index family depending on volume.
  • Shard sizing. Target shard sizes that keep heap usage healthy and file counts manageable. A common target is tens of gigabytes per shard after merge.
  • Field mapping discipline. Only index fields you query. Use keyword for exact match and aggregations, text with an analyzer for full text search, and disable doc values on fields you never aggregate.
  • Avoid unbounded cardinality in aggregations. Fields like user id or request id can explode memory during terms queries.
  • Templates. Lock down dynamic fields and use strict templates to prevent field explosion.

Step 6 Lifecycle and retention

Apply an index lifecycle policy per dataset.

  • Hot. New data on fast storage with replication for search and aggregates.
  • Warm. Older indices on cheaper storage with fewer replicas and merged segments.
  • Cold. Searchable snapshot in object storage for rare queries.
  • Frozen or archived. Not searchable by default. Restore on demand for incident reviews. Automate rollover by size or age and snapshot continuously to object storage.

Step 7 Query path and performance

A search gateway receives queries, authenticates the user, and routes to the exact indices by time range, tenant, and dataset.

  • Pre filter by time first. Use index naming that makes time routing trivial.
  • Query patterns. Term filters, range queries, prefix matches, and aggregations like count, percentiles, and top terms.
  • Pagination. Use search after to avoid deep page cost.
  • Caching. Use query result cache for repeated dashboards and use field data cache carefully to avoid evictions.
  • Guardrails. Set hard limits on time range, number of shards per query, and maximum buckets.

Step 8 Multi tenant and security

Two common models

  • Index per tenant per time slice for strong isolation and simple deletion.
  • Shared indices with tenant id as a filter plus row level security at the search layer. Apply rate limits and quotas per tenant and per query. All access is audited.

Step 9 Reliability and cost controls

  • Dead letter queues for parse failures.
  • Circuit breakers and admission control for expensive queries.
  • Autoscaling on ingest and search tiers with queue depth and query latency as signals.
  • Compression and down sampling for long retention. Store raw in object storage and only keep a compacted structured view hot.

Real World Example

Picture a company at the scale of a global video platform. Traffic peaks at five million events per second across regions. The pipeline ingests through a managed stream. A transform service parses JSON logs, adds region and service labels, redacts tokens, and routes to two destinations. Recent seven day data lands in an inverted index cluster for fast interactive search used by on call engineers. The same events are batched into parquet hourly and written to object storage.

A daily job compacts into a columnar warehouse for product and security analytics. Dashboards mostly hit the hot cluster with strict limits on time window. When an incident requires a thirty day search, an operator hydrates one day of archived parquet back into the columnar store or spins up a temporary cluster that mounts a searchable snapshot.

Common Pitfalls or Trade offs

Unbounded cardinality fields

Using user id, session id, or request id as a terms aggregation can blow up memory and crash nodes. Cap the bucket count, use approximate top k, or aggregate on hashed buckets.

Field explosion from dynamic mapping

When new fields appear freely, the cluster creates thousands of field entries and slows indexing and queries. Use strict templates and whitelist fields.

Over sharding or under sharding

Too many small shards overwhelm cluster metadata and too few large shards limit parallelism. Size shards based on bytes and query fan out.

Wide time range queries with heavy aggregations

Expensive scans across hundreds of shards cause timeouts. Enforce time range limits and pre aggregate where possible.

No backpressure on producers

Agents that keep retrying without flow control can add more load during an outage. Use bounded queues and drop debug level messages first if you must shed load.

Keeping everything hot

Hot storage for all history is the fastest way to overspend. Move to warm and cold tiers quickly and restore only when needed.

Weak multi tenant isolation

Shared indices without strict rate limits let one user cause cluster wide issues. Isolate and cap at the gateway.

Interview Tip

A favorite prompt is design a log platform that ingests one hundred terabytes per day with seven day hot retention and ninety day cold retention for a multi region service. Start with source agents and a durable bus. Propose a transform tier that enforces schema and redaction. Choose inverted index for hot search and a columnar or object store for longer analytics. Show shard and index sizing, lifecycle policy, and guardrails on query cost. End with tenant isolation, quotas, and on demand restore from archive.

Key Takeaways

  • Treat the message bus as the write ahead log and the safety valve for bursts.

  • Use structured JSON at the source to reduce parse errors and costs later.

  • Separate hot interactive search from long term analytics to control cost.

  • Design shards and indices by time and dataset and enforce strict field templates.

  • Add guardrails for query cost and tenant isolation to protect shared clusters.

Table of Comparison

Table of Comparison

StoreBest forQuery patternsCompression and costOperational notes
Inverted index search clusterInteractive search on recent dataFast term filters, range on time, nested fields, aggregations with moderate cardinalityModerate compression, higher hot storage costShard and heap tuning required, strict templates, query guardrails
Columnar warehouseLarge scans and heavy group by analyticsTime range scans, joins with reference tables, robust SQLHigh compression, low cost for cold and warm layersGreat for scheduled jobs and exploratory analysis, less ideal for sub second dashboards
Object storage archiveVery long retention and complianceRestore a slice then query, or query via external table enginesLowest cost per byte, excellent durabilityNot interactive by default, requires hydrate on demand

FAQs

Q1 What is the difference between logs metrics and traces?

Logs are verbose event records with rich context. Metrics are numeric time series for fast aggregates. Traces capture end to end request paths. A strong platform can correlate all three using shared ids.

Q2 How do I reduce the cost of hot storage without losing visibility?

Move older indices to warm and cold tiers quickly. Keep only the fields you search. Sample verbose logs while keeping all errors. Store raw events in object storage and hydrate rare days on demand.

Q3 How should I pick shard counts and index intervals?

Start with daily indices per dataset. Size shards to tens of gigabytes after merge. If a data set is massive switch to hourly. If it is small, combine multiple days. Validate by measuring search fan out and heap use.

Q4 How do I handle schema drift?

Lock field templates and keep a versioned schema in your transform service. Add new fields through controlled changes. Avoid dynamic mapping and reject unknown fields at ingest.

Q5 What is the fastest way to make common dashboards snappy?

Pin time to recent windows, pre filter by dataset and tenant, cache frequent queries, and use search after for pagination. Pre aggregate heavy metrics into a columnar store and show those in dashboards.

Q6 Should I send logs straight to the search cluster or through a bus?

Use a durable bus. It absorbs bursts, enables retries, and allows multiple consumers to enrich and route the same event to different stores.

Further Learning

Build a strong foundation with the practical lessons in Grokking System Design Fundamentals. For deeper scale patterns and storage trade offs, consider the project style syllabus in Grokking Scalable Systems for Interviews. If you want a full interview playbook with practice questions, explore Grokking the System Design Interview.

TAGS
System Design Interview
System Design Fundamentals
CONTRIBUTOR
Design Gurus Team
-

GET YOUR FREE

Coding Questions Catalog

Design Gurus Newsletter - Latest from our Blog
Boost your coding skills with our essential coding questions catalog.
Take a step towards a better tech career now!
Image
One-Stop Portal For Tech Interviews.
Copyright © 2025 Design Gurus, LLC. All rights reserved.