How do you choose compression (LZ4/ZSTD) for logs vs analytics?
Compression is a lever that shifts cost, latency, and CPU across your data pipeline. For logs you usually care about fast ingest and quick reads for debugging or alerting. For analytics you care about compact storage and efficient scans across large datasets. LZ4 and ZSTD are both modern general purpose codecs that win in different parts of this space. This guide gives you a clear rule of thumb and a practical checklist so you can choose with confidence during a system design interview and in real production.
Why It Matters
Choosing the wrong codec can quietly tax your platform in three ways. First, CPU time spent compressing and decompressing competes with application work. Second, compression ratio affects network and storage cost which dominates at scale. Third, decompression speed impacts tail latency for search and query workloads. Interviewers love this topic because it reveals whether you see the pipeline as a whole and whether you can justify a trade using numbers. Strong answers reference throughput, ratio, and cost together, and explain how choices differ for streaming logs versus offline analytics in distributed systems.
How It Works (Step-by-Step)
-
Clarify the workload
- Logs are append-only streams with high write rates, bursty traffic, and frequent small reads during incidents.
- Analytics is scan-heavy, often columnar, and read-dominated with large batch queries and scheduled jobs.
-
Decide the primary constraint
- Logs prioritize low CPU and low latency at producers and consumers.
- Analytics prioritizes compression ratio to cut storage and I/O and to shrink scan time.
-
Map the constraint to codec behavior
- LZ4 targets very high compression and decompression speed with a modest ratio. It is excellent when CPU is scarce or ingest is hot.
- ZSTD targets flexible ratio with adjustable levels from very fast to very compact. It shines when you can spend a little CPU to save a lot of bytes.
-
Choose level and framing
- For LZ4 pick default level with frames sized to your batch payloads. Favor smaller flush intervals for quick availability of log lines.
- For ZSTD pick low levels for streaming and brokers and medium to high levels for cold analytics files. Dictionaries help a lot for repetitive schemas.
-
Validate with back-of-envelope math
- Compute daily raw size, multiply by expected ratio, price the storage and egress, and check CPU headroom.
- Confirm that consumers can decompress at peak read rate without increasing tail latency.
-
Plan lifecycle
- Use LZ4 in the hot path for producers, agents, and brokers.
- Recompress to ZSTD in your lake during batch compaction to maximize savings for long-term analytics.
Real World Example
A product team ships ten terabytes of raw logs per day. The logging agents forward to a message broker and short-term store is retained for seven days. Data is then compacted into Parquet in a lake for twelve months of analytics.
-
Hot path choice LZ4 on agents and brokers gives a ratio around two to three with very low CPU cost. Ingest keeps up during traffic spikes and on-call engineers can tail logs with minimal lag.
-
Cold path choice ZSTD at a moderate level on Parquet pages gives a ratio around three to five while keeping decompression fast. Queries scan fewer bytes which cuts both cost and wall clock time.
-
Monthly impact If LZ4 yields three times and ZSTD yields four times, the lake stores about twenty-five percent fewer bytes with ZSTD. At multi-petabyte scale this is a large dollar difference while the hot path remains stable and efficient.
Common Pitfalls or Trade-offs
1. Choosing one codec for all stages What is optimal for a producer under burst load is rarely optimal for archived analytics. Mix codecs across lifecycle stages.
2. Setting ZSTD to a very high level everywhere High levels improve ratio but can spike CPU and slow flushes which hurts near real-time alerting. Use low levels for streaming and higher levels for batch.
3. Ignoring decompression speed Reads dominate the cost for search and query. A codec with great ratio but slow decompression increases tail latency.
4. Overlooking dictionaries ZSTD dictionaries trained on representative samples can materially boost ratio for small messages. Keep them versioned and rotate when schemas drift.
5. Using gzip by habit gzip often loses on both speed and ratio relative to modern codecs. Only keep it for legacy toolchains or when you need the exact format.
Interview Tip
When asked which codec you would use, start with the requirement that drives the decision. For example say the producer CPU is the bottleneck and latency for incident tailing must be low, so pick LZ4 in the hot path. Then add that you would recompress compacted files in the data lake with ZSTD for better long-term ratio. Finish with a quick numeric sketch. Speak in terms of throughput, ratio, and dollars saved. This structure signals architect-level thinking.
Key Takeaways
-
Use LZ4 for logs in the hot path where ingest speed and low CPU dominate
-
Use ZSTD for analytics files where storage and scan cost dominate
-
Tune ZSTD level based on stage (low for streaming, medium to high for batch)
-
Train ZSTD dictionaries for small repetitive messages
-
Recompress during compaction to move from LZ4 to ZSTD without slowing producers
Table of Comparison
| Criterion | LZ4 | ZSTD | Snappy | gzip |
|---|---|---|---|---|
| Typical ratio | About 2–3× | About 3–5× | About 2× | About 2–3× |
| Compression speed | Very high | High at low levels | High | Low |
| Decompression speed | Very high | High | High | Medium |
| CPU cost | Very low | Low to medium (depends on level) | Low | Medium to high |
| Streaming suitability | Excellent for logs and brokers | Good at low levels | Good | Poor |
| Best fit | Hot path logs and tailing | Data lake and columnar analytics | Legacy pipelines | Legacy compatibility only |
FAQs
Q1. When should I use LZ4 for logs?
Use LZ4 when producers and consumers must keep latency low, when CPU is tight, and when you expect spikes in event volume. Its very fast decode makes incident tailing and search responsive.
Q2. When should I use ZSTD for analytics?
Use ZSTD in your lake or warehouse during batch compaction so you get strong ratio and faster scans with columnar formats. Pick a moderate level that keeps decode fast.
Q3. Can I mix LZ4 and ZSTD in one pipeline?
Yes. Use LZ4 in the hot path for agents and brokers. During daily compaction or ETL, recompress artifacts to ZSTD for long-term storage and analytics.
Q4. Which ZSTD level should I start with?
Start low for streaming such as level 1–3. For batch files try level 5–8 and measure. Move up only if ratio gains are meaningful compared to the CPU increase.
Q5. Do ZSTD dictionaries help with logs?
They can help when messages share many tokens such as repeated keys or templates. Train on a representative sample, version the dictionary, and roll updates as schemas change.
Q6. What metrics should I track while deciding?
Track ingest CPU, producer flush latency, broker throughput, compression ratio by topic or table, query scan bytes, and tail latency of common reads. Tie these to cost so the trade is visible.
Further Learning
Build your intuition for codec trade-offs inside complete architectures by exploring the foundational patterns in Grokking System Design Fundamentals.
For deeper end-to-end planning across streaming and lakes with performance tuning and cost modeling, level up with Grokking Scalable Systems for Interviews.
GET YOUR FREE
Coding Questions Catalog
$197

$78
$78