How do you implement the Outbox pattern to guarantee delivery?

Reliable events are the glue between transactional data and async workflows. The Outbox pattern lets you save state and publish an event without losing either, even when networks are flaky or services crash. Below is a complete, interview ready guide you can use to build and reason about guaranteed delivery in scalable architecture.

Introduction

The Outbox pattern records a message in the same database transaction as your business update, then ships that message to a broker using a separate process. If the database commit succeeds, the event is durably captured in the outbox table and will be retried until it is delivered. If the commit fails, no event is produced. This gives you atomicity between data and side effects with at least once delivery to downstream services.

Why It Matters

Distributed systems fail in creative ways. A service can write to the database but crash before it publishes to Kafka or another broker. Or it can publish to the broker and crash before writing to the database that represents truth. Outbox eliminates that split brain by making the database the source of truth for both the state and the intent to publish. In system design interviews, this pattern shows you can guarantee delivery, avoid dual writes, and reason clearly about exactly once effects using idempotency.

How It Works Step by Step

Create an outbox table Add a table that holds events to publish. Typical columns include id, aggregate_id, event_type, payload, headers, created_at, available_at, published_at, attempts, last_error, and a status or version.
Write domain data and outbox in one transaction When processing a request, update your domain tables and insert one or more outbox rows inside the same database transaction. If the transaction commits, both are persisted. If it rolls back, nothing is saved.
Run a relay to publish events A background worker polls the outbox for unsent rows. It selects a batch, marks them in progress to avoid duplicate workers, publishes each to the message broker, and then marks them sent by setting published_at and status.
Use idempotency keys Put a stable event id in the message and require consumers to process idempotently. A consumer can store processed ids in a table or use natural keys like order_id plus a sequence to dedupe. This converts at least once delivery into exactly once effects at the consumer boundary.
Handle retries with backoff If publishing fails, increment attempts and retry with exponential backoff. After N failures move the row to a dead letter state and alert. Keep the payload so the message can be replayed.
Preserve ordering where required If a consumer needs order per aggregate, publish with a partition key that is the aggregate id. The relay should process rows ordered by created_at for that key.
Scale safely Multiple relays can run in parallel if they use row level locks. For example, the relay can select next rows for update skip locked so workers do not fight over the same events.
Automate cleanup Once published, keep a retention window for auditing then archive or delete. If your regulator needs long retention, compress payloads or move old records to cold storage.
Alternative shipping using change data capture Instead of polling the table, you can tail the database write ahead log with a CDC tool, filter for the outbox table, and publish out. This reduces polling load and improves freshness.
Observe and alert Track metrics such as queue time in outbox, publish latency, error rate, and dead letter counts. Expose a dashboard and alarms.

Real World Example

Consider an ecommerce Order Service. A user places an order. The service must persist the order and notify Inventory, Payment, and Email services.

The handler writes an Order row in the database and inserts an outbox row with event_type OrderCreated and the order payload. This is a single transaction.
A relay reads unsent rows and publishes OrderCreated to the broker with key equal to order_id so all events for the same order stay in order.
Inventory and Payment consume the event idempotently. If the same OrderCreated arrives twice, they check a processed_events table and skip the duplicate.
If the broker is down, the event remains in the outbox. The relay retries with backoff until the broker is back. Delivery is guaranteed because the intent is durably stored with the order.

This pattern appears in large social feeds, ride sharing trip state, and media upload pipelines. It is a simple way to get strong reliability without distributed transactions.

Common Pitfalls or Trade offs

Dual write without a transaction Writing the domain table and producing to the broker in two separate steps can lose or duplicate events. Always insert into the outbox in the same transaction as the domain write.
Deleting outbox rows too early If you delete rows immediately after publish, you lose the ability to replay or audit. Keep a retention period and move poison events to a dead letter state.
Non idempotent consumers At least once delivery means duplicates will happen. If consumers are not idempotent, you cannot claim exactly once effects. Use natural idempotency keys and conditional updates like insert if not exists or update with version checks.
Global ordering expectations Outbox can preserve ordering per key but global ordering across keys is not realistic at scale. Align ordering guarantees to real business needs.
Relay hot spots A large batch size or scanning the entire outbox can overload storage. Use paged queries with created_at and a composite index on status plus created_at. Consider CDC when write volume is high.
Transactional boundaries across services Outbox guarantees atomicity within one service. For multi service sagas you will still need compensating actions for cross service failures.

Interview Tip

Interviewers often ask how to avoid the classic dual write problem. A crisp answer is to put an outbox row in the same database transaction as the state change, then have a background relay publish and mark as sent, with idempotent consumers to handle duplicates. Mention row locking with skip locked, a dead letter state, and partition keys for per aggregate ordering.

Key Takeaways

Outbox makes data and event publishing atomic by using one database transaction.
Delivery to the broker is at least once and becomes exactly once at the consumer via idempotency.
A relay polls or uses CDC to publish, retries on failure, and records publish state.
Ordering is per key such as aggregate id, not global.
Monitoring, backoff, and dead letter handling turn the pattern into a production grade solution.

Table of Comparison

Approach	Primary guarantee	Best fit	Main cost
Outbox with relay	Atomic state plus publish intent with at least once delivery	Service with a relational store that emits events	Needs a worker and retention management
Outbox with CDC	Same guarantees with lower polling overhead	High write volume or tighter freshness	Extra infra and operational work
Two phase commit	Atomic commit across a few participants	Small sets of resources with strict atomicity	Blocking coordinator and complex recovery
Direct publish without outbox	No atomicity with the database	Low value events where some loss is acceptable	Risk of lost or duplicated events
Saga choreography	Event driven consistency with compensation	Long running business workflows	More moving parts and careful failure design

FAQs

Q1. How do I scale the relay safely?

Run multiple workers that select for update skip locked to claim different rows. Use small batches, backoff, and partition aware publishing to keep order per key.

Q2. Do I need a separate database for the outbox table?

Usually no. Co locate it with your service database so the single transaction covers both the domain write and the outbox insert. Use indexing and retention to keep it healthy.

Q3. When should I prefer CDC over polling?

Pick CDC when insert volume is high, latency needs are tight, or the poller starts to add load. Polling is simpler and works well for moderate traffic.

Q4. How do I test an Outbox implementation?

Test the happy path and these cases. Transaction rollback leaves no outbox row. Broker outage keeps rows pending and retries later. Duplicate delivery does not break consumers. Poison events move to dead letter and trigger alerts.

Q5. What problem does the Outbox pattern solve?

It prevents dual writes by making the database the source of truth for both the state change and the intent to publish. If the transaction commits, the event exists and will be delivered.

Q6. Is Outbox enough for exactly once processing?

Outbox gives at least once delivery. Combine it with idempotent consumers and deduplication to achieve exactly once effects at the consumer.

Further Learning

Build a strong foundation for reliable event driven design with Grokking System Design Fundamentals
Turn these ideas into production ready skills with Grokking Scalable Systems for Interviews