How would you design blob compaction (merge small files) in object stores?

Blob compaction is the practice of merging many small objects into a few large ones inside an object store. Think of it as cleaning a cluttered drawer so you can find things faster and pay less for every open and close. In data platforms and content stores it is a foundational move that turns chatty tiny writes into smooth high throughput reads.

Why It Matters

Object stores charge per request and maintain metadata per object. Thousands of tiny files cost more, list slower, and throttle analytics jobs. Large sequential reads are faster on cloud storage because throughput per request is high while per request overhead is fixed. In a system design interview, you are often asked to trade request cost, durability, and freshness. Compaction sits at that intersection and shows how you think about scalable architecture and distributed systems under real operational limits.

How It Works Step by Step

1. Choose a partitioning scheme Group files by a stable dimension such as date hour, user shard, or topic. This bounds the working set and lets you compact in parallel without cross talk.

2. Ingest as append only segments Producers write small segment files quickly to avoid back pressure. They tag each segment with partition and a monotonic sequence or event time. Never modify segments in place.

3. Track a manifest per partition Maintain a small manifest object that lists current segments for that partition. Include a version number, a watermark that signals what time is safe to compact, and checksums for sanity.

4. Schedule compaction intelligently A lightweight coordinator selects partitions that exceed thresholds such as segment count, total bytes, or age. Apply rate limits so compaction never contends with live traffic.

5. Acquire exclusive work on a partition Use a lease in a coordination store or conditional update on the manifest version. This prevents two compactors from rewriting the same partition.

6. Build the compacted object with multipart upload Stream segments in order, optionally sort by key, deduplicate by id, and compress. Write a single large object via multipart upload with a deterministic name that includes the manifest target version.

7. Publish a new manifest that points to the compacted object Write a new manifest version that references the single big object and removes the small ones from the active set. Keep the previous manifest for rollback. Readers always fetch the latest manifest then follow its pointers, which gives atomic visibility.

8. Garbage collect with a safety delay Delete old segments only after a grace period or when you know no reader is still on the prior manifest. A lifecycle rule can auto expire files marked as obsolete.

9. Handle late or duplicate arrivals If new segments arrive after compaction, either trigger a quick incremental compaction or add them to the next cycle. Use idempotent event ids to avoid double counting.

10. Observe and tune Key metrics include average object size per partition, small file count, compaction backlog time, write amplification, and the rate of late arrivals. Alert when partitions remain un compacted beyond target windows.

Engineering details that impress seniors

Atomic swap is implemented by manifest rewrite not by rewriting many reader paths.
Idempotency comes from deterministic output naming and from manifest version checks.
Readers never scan the raw prefix. They only read the manifest and then the listed objects.
Target sizes matter. Analytics engines prefer 128 megabytes to one gigabyte per object. Archival workloads like even larger sizes. Streaming consumers prefer smaller increments.
Cross region replication doubles write work. Compacting before replication reduces replication churn.

Real World Example

Imagine a social network that ingests click events and media moderation logs into cloud object storage. Writers drop minute sized JSON segments into prefixes like dataset partition date hour equals 2025 11 11 hour 10. A compactor runs every fifteen minutes. For each partition whose total segments exceed a threshold, it reads them, converts to Parquet, and writes a single 512 megabyte object. It then atomically updates the manifest. Analytics jobs in Spark read only the compacted Parquet paths from the manifest and finish in a fraction of the time because they avoid small file overhead and enjoy column pruning and compression.

Common Pitfalls or Trade offs

Non atomic reader view If readers discover data by listing prefixes they may see both small and large files and double count. Always route through a manifest layer.
Thrashing hot partitions If you compact files that are still receiving writes you will re compact over and over. Require a watermark based on event time or a minimum file age.
Write amplification and cost spikes Large partitions take time to rewrite. Use leveling. For instance merge twenty small files into one medium file, then later merge medium files into one large file.
Eventual list consistency surprises Some stores may return stale listings. The manifest approach sidesteps this by relying on a single small object with strong read patterns.
Replication and retention rules Deleting old segments before replication completes can create gaps in a replica. Apply safety delays or check replication status. Also confirm that retention policies will not remove active data referenced by manifests.

Interview Tip

Interviewers often ask how you make the cutover safe. A strong answer is to say that readers consult a manifest named current inside each partition. The compactor writes the new big object, updates a new manifest version, then atomically swaps the logical current pointer to point at the new manifest. Old files are kept until a grace period ends. Describe how you ensure idempotency with deterministic names and conditional writes on the manifest version.

Key Takeaways

Blob compaction converts many small objects into a few large ones to reduce request overhead and speed up scans.
Use partitioning, manifests, and atomic swaps to guarantee consistent reads during compaction.
Choose target sizes by workload and use leveling to avoid long pause times.
Delay deletion and handle late arrivals to prevent data loss or double counting.
Measure small file ratio, backlog time, and write amplification to guide tuning.

Table of Comparison

Approach	When it fits	Strengths	Limitations
Blob compaction with manifest	Data lakes, logs, media metadata, batch analytics	Atomic reader view, low request cost, high scan speed	Background rewrite cost, needs coordinator and GC
Roll-over large objects	Streaming ingest where you can write directly to large files	Fewer rewrites, steady write pattern	Object stores lack true append, requires careful staging
Table formats with built-in metadata	Complex analytics with schema evolution and snapshots	Rich manifests, time travel, partition pruning	Extra metadata management and commit protocol
Keep many small files	Very low volume or archival where simplicity wins	No background jobs, trivial writes	Slow scans, high request cost, poor cache hit rates
Message log systems as primary store	Short retention streams with consumers downstream	Ordering, backpressure, built-in compaction options	Not ideal for long term cold storage or ad hoc analytics

FAQs

Q1. What is blob compaction in object stores?

It is a background process that merges many small objects in a partition into one or a few large objects, then updates a manifest so readers see a single atomic view.

Q2. How big should compacted files be?

Pick sizes that match your engines and network. A common target is between one hundred megabytes and one gigabyte. Measure throughput and tune per workload.

Q3. How do I ensure readers do not double count during compaction?

Force all readers to fetch the partition manifest and read only the objects it lists. Update the manifest version atomically, then garbage collect old segments later.

Q4. How do I handle late arriving files?

Keep accepting segments but mark them as late. Either run an incremental mini compaction that appends them to the large object or include them in the next scheduled pass.

Q5. Do I still need compaction if I already write Parquet?

Yes if you write many tiny Parquet files. The format helps with compression and column pruning but small file overhead still hurts. Merge them into fewer files.

Q6. What should I monitor to know if compaction is healthy?

Watch small file ratio, average object size, number of partitions awaiting compaction, compaction duration, read latency for analytics, and error rates in manifest updates.

Further Learning

Start with a solid foundation on partitioning, caching, and storage models in Grokking System Design Fundamentals.

If you want to practice trade offs like atomic swaps and lifecycle policies at scale, enroll in Grokking Scalable Systems for Interviews and master the blueprint for production grade data platforms.