How would you design CDC pipelines to sync OLTP→OLAP near real‑time?
Change Data Capture (CDC) lets you stream inserts, updates, and deletes from an OLTP database into an OLAP warehouse with seconds-to-minutes delay. Instead of running heavy queries against your production database, you subscribe to its commit log and continuously upsert changes into an analytics-optimized store such as BigQuery, Snowflake, Redshift, or ClickHouse. This pattern enables fresh insights without hurting transaction performance — a key skill to demonstrate in system design interviews.
Why It Matters
Businesses rely on real-time analytics for monitoring performance, fraud detection, personalization, and operational dashboards. Batch ETL often introduces delays of hours, making insights stale and decision-making reactive. CDC-based OLTP→OLAP synchronization allows you to keep analytical systems almost up-to-date while maintaining the scalability and isolation of your transactional system. In interviews, this design showcases your ability to balance latency, data freshness, and consistency.
How It Works (Step-by-Step)
1. Capture from the Source Commit Log Use log-based CDC rather than polling queries. For MySQL, read from the binlog; for Postgres, the write-ahead log (WAL); and for MongoDB, the oplog. This method ensures minimal impact on OLTP performance while capturing every change event.
2. Represent Change Events Each change event includes operation type (insert/update/delete), before and after data, commit timestamp, transaction ID, and table metadata. This helps downstream systems apply the right transformations accurately.
3. Register and Manage Schemas Integrate a schema registry (e.g., Confluent Schema Registry) using Avro or Protobuf. Handle backward-compatible changes gracefully, adding new columns with defaults and avoiding breaking modifications.
4. Stream through a Durable Bus Publish events to Kafka, Pub/Sub, or Kinesis with partitions keyed by primary key. This guarantees in-order processing for each record and ensures fault tolerance and replayability.
5. Enrich and Validate in Stream Apply lightweight transformations such as data type normalization, timezone alignment, or null handling. Avoid heavy joins or business logic here — push those to the OLAP stage.
6. Stage for Efficient OLAP Loads Batch small CDC events into optimized file formats like Parquet. Store them temporarily in object storage (e.g., S3, GCS) before merging into the target OLAP system.
7. Upsert into OLAP
Use efficient merge operations (MERGE INTO in Snowflake or BigQuery). Ensure deduplication using the source commit ID or CDC sequence number. Handle deletes with tombstone records or soft-delete flags.
8. Ensure Idempotency and Fault Recovery Maintain an offset checkpoint for every micro-batch. If the pipeline crashes, resume from the last committed offset to avoid duplicate processing. Idempotent upserts are critical for correctness.
9. Bootstrap with Snapshot and Switch to Stream Start with a consistent snapshot of the OLTP data, note the binlog offset, and then begin streaming from that position. This ensures no data loss or duplication during the cutover.
10. Monitor, Validate, and Reprocess Implement metrics like end-to-end latency, event lag, error rates, and data drift. Keep historical retention to replay and rebuild target tables if required.
Real-World Example
At Instagram, MySQL handles user activity (likes, comments, follows). Using Debezium, MySQL changes are streamed to Kafka. A Spark job enriches and validates the data, then writes micro-batches in Parquet to S3 every 30 seconds. ClickHouse ingests these micro-batches using ReplacingMergeTree, supporting low-latency analytics for dashboards and recommendation algorithms. Latency remains under one minute, and schema evolution is safely handled via the schema registry.
Common Pitfalls or Trade-offs
- Polling instead of log-based CDC increases OLTP load and misses deletes.
- Unmanaged schema changes break pipelines unexpectedly.
- Writing too many small files hurts OLAP performance; batch properly.
- Doing complex joins in streaming causes slowdowns; move heavy transformations downstream.
- Lacking replay or audit mechanisms complicates recovery from bugs.
Interview Tip
A frequent interview question is: “How would you ensure exactly-once delivery when syncing MySQL to Snowflake?”
A solid answer is: use Kafka for ordered delivery, store change IDs, load data into a staging table, and use an atomic MERGE operation that deduplicates using the change ID or binlog offset.
Key Takeaways
- Log-based CDC ensures minimal OLTP load and high data fidelity.
- Upserts and idempotent processing keep OLAP data accurate.
- Schema evolution and replay capability are critical for robustness.
- Micro-batch writes improve efficiency for analytical stores.
- Observability ensures freshness and correctness of the pipeline.
Table of Comparison
| Approach | Latency | OLTP Load | Captures Deletes | Schema Change Handling | Complexity | When to Use |
|---|---|---|---|---|---|---|
| Log-based CDC | Seconds to mins | Low | Yes | Excellent (registry) | Medium | Default for scalable systems |
| Trigger-based CDC | Minutes | Medium | Yes | Manual | Medium | When log access is unavailable |
| Query Polling | Minutes to hours | High | Rarely | Manual diffs | Low | For small data or temporary use |
FAQs
Q1. What is Change Data Capture in OLTP→OLAP pipelines?
It is the technique of continuously capturing database changes from OLTP systems and syncing them to OLAP stores for near real-time analytics.
Q2. Why use log-based CDC instead of polling?
Log-based CDC reads database commit logs directly, ensuring low latency, minimal source impact, and accurate detection of deletes and updates.
Q3. How do you prevent duplicate processing?
Maintain checkpoints and use idempotent operations like MERGE or deduplication keys to ensure each change is applied exactly once.
Q4. How are deletes handled?
CDC emits tombstone events for deleted rows, which the target OLAP system interprets to remove or mark data as deleted.
Q5. What is the best format for OLAP ingestion?
Parquet or ORC formats are preferred for their compression, columnar storage, and efficient analytical reads.
Q6. What tools are commonly used for CDC pipelines?
Popular tools include Debezium, Kafka Connect, AWS DMS, and stream processors like Flink, Spark Structured Streaming, and Airbyte.
Further Learning
For deeper understanding of CDC, replication, and streaming architectures, explore Grokking Scalable Systems for Interviews.
To master data pipelines, consistency, and performance fundamentals, check out Grokking System Design Fundamentals.
You can also sharpen interview readiness with Grokking the System Design Interview, which is perfect for practicing real-world trade-offs like CDC, caching, and event-driven design.
GET YOUR FREE
Coding Questions Catalog
$197

$78
$78