Parquet vs ORC vs Avro: Choosing formats for data lakes.

Choosing between Parquet, ORC, and Avro is a key decision in designing efficient and scalable data lakes. These formats determine how data is stored, compressed, and queried. Parquet and ORC are columnar, which makes them ideal for analytics, while Avro is row-based, optimized for streaming and data interchange. The right combination of these formats can make your data lake both fast and cost-effective.

Why It Matters

Data format directly impacts query speed, storage cost, and pipeline complexity. Columnar formats like Parquet and ORC excel at analytical workloads with selective queries. Avro, being row-oriented, is great for write-heavy streaming systems and schema evolution. In system design interviews, explaining why you’d choose one format over another demonstrates strong data architecture reasoning.

How It Works (Step-by-Step)

Understand the access pattern
- Use columnar (Parquet/ORC) for analytical queries that read few columns from many rows.
- Use row-based (Avro) for streaming or transactional writes.
Lifecycle of data
- Raw data often lands in Avro for flexible schema evolution.
- Transformed, aggregated data moves to Parquet or ORC for efficient querying.
Compression and encoding
- Parquet and ORC compress each column differently (e.g., dictionary, run-length).
- Avro applies compression at the block level, balancing size and CPU cost.
Schema management
- Avro supports strong forward and backward compatibility using a schema registry.
- Parquet and ORC support evolution, but changes like renames or type conversions require caution.
Performance tuning
- For Parquet: adjust row group size (128–512 MB typical).
- For ORC: tune stripe size and enable Bloom filters.
- For Avro: manage block size and compression codec to optimize streaming throughput.

Real-World Example

At Netflix, raw events (user activity, playback logs) are first ingested in Avro to handle high write rates and evolving schemas. Later, these events are converted into Parquet for analytical queries in Spark and Presto. This two-stage design enables flexibility during ingestion and high performance during analytics.

Common Pitfalls or Trade-offs

Using Avro for analytics – since it’s row-based, it leads to large scans and slow queries.
Choosing Parquet/ORC for heavy real-time ingestion – they add metadata overhead unsuitable for small writes.
Mismatched schema evolution – rename or type changes can break downstream readers.
Ignoring file size – too many small files (below 128 MB) can slow down query planning.
Ecosystem mismatch – ORC performs best in Hive; Parquet is widely supported by modern engines like Athena and BigQuery.

Interview Tip

Interviewers may ask: “If your data lake handles streaming events and analytical queries, which format will you choose and why?” An ideal answer: “Avro for raw ingestion due to schema flexibility, Parquet for curated analytical datasets for columnar efficiency.”

Key Takeaways

Parquet and ORC are columnar and optimized for read-heavy analytics.
Avro is row-based, better for streaming and schema evolution.
Columnar formats support predicate pushdown and column pruning, improving performance.
Avro ensures backward and forward compatibility through schema registry.
In modern data lakes, combining Avro (raw) with Parquet or ORC (curated) yields best results.

Table of Comparison

Aspect	Parquet	ORC	Avro	When to Choose
Layout	Columnar	Columnar	Row-oriented	Analytics vs. Streaming
Predicate Pushdown	Yes	Yes	No	Needed for analytics
Column Pruning	Strong	Strong	Not supported	When queries use few columns
Compression	Per-column	Per-column	Per-block	Avro for streaming, Parquet/ORC for scans
Schema Evolution	Limited rename support	Limited rename support	Full forward/backward compatibility	Avro for changing schemas
Indexes/Stats	Min-max, count	Min-max, Bloom filters	None	ORC best for selective queries
Ecosystem Support	Wide (Spark, BigQuery, Trino)	Hive, Trino	Kafka, Schema Registry	Match ecosystem
Typical File Size	128–512 MB	128–512 MB	Variable (block-level)	Based on workload
Best Fit	Analytics and dashboards	Hive-based analytics	Streaming ingestion	Mix per lake zone

FAQs

Q1. Which format is best for analytical queries?

Parquet is generally preferred for analytics due to its wide adoption, high compression, and efficient column pruning.

Q2. When should I use Avro?

Use Avro for streaming ingestion, log data, or when schemas evolve frequently.

Q3. Does ORC outperform Parquet?

In Hive-based ecosystems, yes. ORC’s internal indexes and Bloom filters can make certain queries faster.

Q4. Is Parquet good for real-time writes?

No. It’s optimized for batch writes, not for continuous small updates.

Q5. How do schema changes affect each format?

Avro supports full compatibility with its schema registry. Parquet and ORC allow adding columns but may fail with renames or type changes.

Q6. Can I use multiple formats in the same data lake?

Yes. Many modern data lakes use Avro for raw storage and Parquet or ORC for processed analytical zones.

Further Learning

To deepen your understanding of scalable data formats and data lake design:

Explore Grokking System Design Fundamentals for practical lessons on storage, compression, and partitioning.
Master real-world case studies in Grokking Scalable Systems for Interviews to learn how data formats impact performance at scale.
For interview preparation, check out Grokking the System Design Interview for structured frameworks and deep-dive examples.