How would you design log ingestion & search (ELK/OpenSearch‑like)?

Question

Design Gurus · Accepted Answer

Log ingestion and search is the nervous system of a scalable platform. You collect high volume semi structured events from services and infrastructure, ship them reliably, index them for fast discovery, and retain them cost effectively. Think of it as a time ordered data pipeline that turns raw text into indexed facts you can filter, aggregate, and visualize during incidents and analytics. If you can explain this flow clearly in a system design interview, you signal strong knowledge of distributed systems and observability.

Why It Matters

Faster incident response. Engineers need real time queries on recent errors, slow requests, and spikes across services and regions.
Lower mean time to recovery. Rich filters and aggregates cut through noise to isolate the blast radius quickly.
Cost and compliance controls. Hot warm cold storage and redaction prevent runaway bills and data exposure.
Product and security insights. The same pipeline supports feature telemetry, audit trails, access investigations, and anomaly detection.
Interview relevance. This design touches ingestion pressure, backpressure, indexing strategy, retention policy, multi tenant isolation, and cost trade offs.

How It Works step by step

Step 1 Collect at the source

Install lightweight agents next to applications and nodes to capture stdout, files, syslog, and app spans that you choose to treat as logs. Normalize a minimal envelope across all sources such as timestamp, level, service, environment, region, host, request id, user id if allowed, and raw message. Emit structured JSON whenever possible so downstream parsing is cheap and reliable.

Step 2 Buffer and transport

Send events to a durable message bus to smooth bursts and protect against downstream outages. Kafka, Pulsar, or a managed stream serves as the write ahead log for the pipeline. Enable batching, compression, and retries with exponential backoff. Enforce backpressure so producers slow down instead of dropping data when the bus is full.

Step 3 Ingest and transform

A fleet of stateless consumers reads from the bus and applies:

Parsing and schema shaping. Prefer structured JSON. Use patterns for text lines.
Enrichment. Add service taxonomy, geo attributes, build id, and Kubernetes metadata.
PII redaction. Apply allow lists and tokenization before data leaves the secure boundary.
Sampling. For very high volume services, sample debug level lines while keeping errors.
Routing. Partition by time then tenant or index family to keep shards balanced.

Step 4 Choose a storage strategy

You rarely use one store. Mix tiers to match query patterns.

Inverted index store such as OpenSearch or Elasticsearch for fast keyword and nested field filters with aggregations.
Columnar store such as ClickHouse or BigQuery for wide scans and heavy group by analytics with very high compression.
Object storage as the long term archive holding raw or compacted parquet, with the ability to hydrate back into a hot tier for rare investigations.

Step 5 Index and shard design

For an inverted index cluster:

Time based indices. Daily or hourly per index family depending on volume.
Shard sizing. Target shard sizes that keep heap usage healthy and file counts manageable. A common target is tens of gigabytes per shard after merge.
Field mapping discipline. Only index fields you query. Use keyword for exact match and aggregations, text with an analyzer for full text search, and disable doc values on fields you never aggregate.
Avoid unbounded cardinality in aggregations. Fields like user id or request id can explode memory during terms queries.
Templates. Lock down dynamic fields and use strict templates to prevent field explosion.

Step 6 Lifecycle and retention

Apply an index lifecycle policy per dataset.

Hot. New data on fast storage with replication for search and aggregates.
Warm. Older indices on cheaper storage with fewer replicas and merged segments.
Cold. Searchable snapshot in object storage for rare queries.
Frozen or archived. Not searchable by default. Restore on demand for incident reviews.
Automate rollover by size or age and snapshot continuously to object storage.

Step 7 Query path and performance

A search gateway receives queries, authenticates the user, and routes to the exact indices by time range, tenant, and dataset.

Pre filter by time first. Use index naming that makes time routing trivial.
Query patterns. Term filters, range queries, prefix matches, and aggregations like count, percentiles, and top terms.
Pagination. Use search after to avoid deep page cost.
Caching. Use query result cache for repeated dashboards and use field data cache carefully to avoid evictions.
Guardrails. Set hard limits on time range, number of shards per query, and maximum buckets.

Step 8 Multi tenant and security

Two common models

Index per tenant per time slice for strong isolation and simple deletion.
Shared indices with tenant id as a filter plus row level security at the search layer.
Apply rate limits and quotas per tenant and per query. All access is audited.

Step 9 Reliability and cost controls

Dead letter queues for parse failures.
Circuit breakers and admission control for expensive queries.
Autoscaling on ingest and search tiers with queue depth and query latency as signals.
Compression and down sampling for long retention. Store raw in object storage and only keep a compacted structured view hot.

Store	Best for	Query patterns	Compression and cost	Operational notes
Inverted index search cluster	Interactive search on recent data	Fast term filters, range on time, nested fields, aggregations with moderate cardinality	Moderate compression, higher hot storage cost	Shard and heap tuning required, strict templates, query guardrails
Columnar warehouse	Large scans and heavy group by analytics	Time range scans, joins with reference tables, robust SQL	High compression, low cost for cold and warm layers	Great for scheduled jobs and exploratory analysis, less ideal for sub second dashboards
Object storage archive	Very long retention and compliance	Restore a slice then query, or query via external table engines	Lowest cost per byte, excellent durability	Not interactive by default, requires hydrate on demand

How would you design log ingestion & search (ELK/OpenSearch‑like)?

Why It Matters

How It Works step by step

Step 1 Collect at the source

Step 2 Buffer and transport

Step 3 Ingest and transform

Step 4 Choose a storage strategy

Step 5 Index and shard design

Step 6 Lifecycle and retention

Step 7 Query path and performance

Step 8 Multi tenant and security

Step 9 Reliability and cost controls

Real World Example

Common Pitfalls or Trade offs

Interview Tip

Key Takeaways

Table of Comparison

Table of Comparison

FAQs

Further Learning