Designing systems using event-driven architecture principles
Event-driven architecture (EDA) is a design pattern where system components communicate by producing and consuming events—notifications of state changes—rather than making direct synchronous calls to each other. When a customer places an order, the order service publishes an OrderPlaced event; the payment service, inventory service, notification service, and analytics service each consume that event independently, without the order service knowing they exist. This decoupling is EDA's core strength and its core complexity. In system design interviews, interviewers test whether you understand when EDA is the right choice—not just what it is. The best system designers know that most service communication should remain synchronous, and EDA should be reserved for specific scenarios where its benefits outweigh its significant complexity.
Key Takeaways
- EDA decouples producers from consumers: the service that publishes an event does not know—or care—which services consume it. This enables independent deployment, scaling, and evolution of services.
- Three core EDA patterns appear in interviews: pub/sub (one-to-many broadcast), event sourcing (storing events as the source of truth), and CQRS (separating read and write models). Know when to apply each.
- Kafka is the default event streaming platform for system design interviews in 2026. Mention it with specifics: topics, partitions, consumer groups, offset management, and retention policies.
- The saga pattern handles distributed transactions in EDA—coordinating multi-service operations through compensating transactions when a step fails.
- EDA is not always the answer. If a service needs to call another service and wait for the result, use a synchronous API call. If you have a fixed set of known integrations, direct calls are simpler. Reserve EDA for scenarios requiring true decoupling, fan-out to unknown consumers, event replay, or real-time stream processing.
Event-Driven vs Request-Response: The Core Trade-Off
| Dimension | Request-Response (Synchronous) | Event-Driven (Asynchronous) |
|---|---|---|
| Coupling | Tight—caller knows the callee | Loose—producer does not know consumers |
| Latency | Immediate response | No immediate response; eventual processing |
| Failure handling | Caller handles errors directly | Events are retried from the broker; consumer failures are isolated |
| Scaling | Both services must scale together | Producer and consumers scale independently |
| Complexity | Low—simple request/response flow | High—event ordering, idempotency, eventual consistency |
| Debugging | Linear request trace | Distributed event flow across multiple consumers |
| Best for | "Do this and tell me the result" | "Something happened—react as you see fit" |
Interview insight: Do not default to EDA for all communication. The order service charging a payment card needs a synchronous response—"Did the charge succeed?" That is a direct API call. The order service notifying the analytics service that an order was placed is a fire-and-forget event. Mix both patterns in the same system based on the specific interaction.
The Three Core EDA Patterns
1. Publish/Subscribe (Pub/Sub)
How it works: A producer publishes an event to a topic. Multiple consumers subscribe to the topic and each receives a copy of the event. The producer does not know how many consumers exist or what they do with the event.
Example: When a user uploads a photo, the upload service publishes a PhotoUploaded event. The thumbnail service generates thumbnails. The content moderation service scans for policy violations. The notification service alerts followers. The analytics service logs the upload. All four consumers operate independently—adding a fifth consumer requires zero changes to the upload service.
When to use: Fan-out scenarios where one event triggers multiple independent reactions. Systems where new consumers are frequently added. Integration platforms where external developers subscribe to events (Shopify publishes order events; thousands of merchant apps consume them).
Implementation: Kafka topics with consumer groups. Each consumer group receives all messages independently. Within a consumer group, messages are distributed across instances for parallel processing.
2. Event Sourcing
How it works: Instead of storing only the current state of an entity (the traditional CRUD model), event sourcing stores the full sequence of events that produced the current state. The current state is derived by replaying the event log.
Example: A banking account stores every transaction as an event: AccountOpened, DepositMade(500), WithdrawalMade(200), DepositMade(300). The current balance (600) is computed by replaying these events. The event log is immutable—no data is ever deleted or overwritten.
Benefits: Complete audit trail of every change. Ability to rebuild the current state at any point in time. Ability to replay events through new business logic—a fraud detection system can retroactively analyze historical transactions through updated rules. Investment banks replay months of trades through updated risk models.
Trade-offs: Event storage grows indefinitely (requires snapshotting for performance). Rebuilding state from millions of events is slow without periodic snapshots. Increased complexity compared to simple CRUD. Not appropriate for systems where the event history has no business value.
When to use: Financial systems requiring audit trails. Systems where temporal queries matter ("What was the state at time T?"). Scenarios where replaying history through new logic has business value.
3. CQRS (Command Query Responsibility Segregation)
How it works: Separate the write model (commands that modify state) and the read model (queries that return state). Write operations go to a write-optimized store. Events propagate changes to a read-optimized store (often denormalized for fast queries).
Example: In an e-commerce platform, the order service writes to a normalized PostgreSQL database (optimized for transactional integrity). An event stream propagates order data to an Elasticsearch index (optimized for search) and a Redis cache (optimized for dashboard queries). Each read model is tailored to its specific access pattern.
Benefits: Independent scaling of reads and writes. Each store is optimized for its access pattern. Read models can be rebuilt from the event stream if corrupted.
Trade-offs: Eventual consistency between write and read models. Increased operational complexity (multiple stores to maintain). Not justified for simple CRUD applications with balanced read/write ratios.
When to use: Systems with dramatically different read and write patterns (1000:1 read-to-write ratio). Systems requiring multiple query patterns over the same data. High-scale systems where read and write loads must scale independently.
Kafka: The Interview-Standard Event Platform
Apache Kafka is the default event streaming platform for system design interviews. Know these specifics.
Topics and partitions: Events are published to topics. Each topic is divided into partitions for parallelism. Messages within a partition are ordered; across partitions, ordering is not guaranteed. Choose the partition key carefully—user_id ensures all events for one user are ordered.
Consumer groups: Each consumer group receives all messages from a topic. Partitions are distributed across consumers within a group. Adding consumers to a group increases parallelism (up to the number of partitions).
Retention and replay: Kafka retains messages for a configurable period (default 7 days, but many production systems use 30 days or indefinite). Consumers can replay from any offset, enabling reprocessing of historical events through updated logic.
Delivery semantics: At-most-once (messages may be lost), at-least-once (messages may be duplicated—requires idempotent consumers), exactly-once (highest guarantee, highest overhead, supported by Kafka transactions).
Interview application: "I would use a Kafka topic called 'order-events' with 50 partitions, keyed by order_id for ordering guarantees within an order. Three consumer groups subscribe independently: payment-processor, inventory-updater, and notification-sender. Each consumer group processes events at its own pace. If the notification service goes down for 30 minutes, it resumes from its last committed offset when it recovers—no events are lost."
Kafka vs alternatives: RabbitMQ for traditional message queuing (point-to-point, complex routing). AWS SQS for managed queuing without operational overhead. AWS SNS for simple fan-out. Kafka for high-throughput event streaming with replay capability. "I chose Kafka over SQS because we need event replay for our analytics pipeline to reprocess historical data when we update our models."
The Saga Pattern: Distributed Transactions in EDA
In a monolith, placing an order is a single database transaction: deduct inventory, charge payment, create shipment record—all or nothing. In EDA with independent services, this atomicity is lost. The saga pattern provides an alternative.
How it works: A saga is a sequence of local transactions. Each service performs its operation and publishes an event. If a step fails, compensating transactions undo the previous steps.
Example flow:
Order service publishes OrderPlaced. Payment service processes payment, publishes PaymentSucceeded. Inventory service reserves stock, publishes StockReserved. Shipping service creates shipment, publishes ShipmentCreated.
If payment fails: Payment service publishes PaymentFailed. Order service consumes it and executes a compensating transaction: cancel the order and publish OrderCancelled.
If shipping fails after payment succeeded: Shipping service publishes ShippingFailed. Payment service consumes it and issues a refund (compensating transaction). Order service cancels the order.
Interview application: "For the e-commerce order flow, I would use a choreography-based saga. Each service reacts to events and publishes its own. If any step fails, compensating events trigger rollbacks in preceding services. I would ensure all consumers are idempotent—processing the same event twice produces the same result—because at-least-once delivery means duplicates are possible."
The Outbox Pattern: Reliable Event Publishing
A common failure mode: a service commits a database transaction but crashes before publishing the corresponding event, leaving the system in an inconsistent state.
Solution: Write the business data and the event to the same database in a single ACID transaction. The event is written to an "outbox" table. A separate process (CDC via Debezium, or a polling worker) reads from the outbox table and publishes events to Kafka. Once published, the outbox record is marked as processed.
Interview application: "To guarantee that every order is both persisted and published as an event, I would use the outbox pattern. The order service writes the order record and the OrderPlaced event to the same PostgreSQL database in a single transaction. Debezium captures changes to the outbox table via CDC and publishes them to Kafka. This eliminates the dual-write problem."
When NOT to Use Event-Driven Architecture
This is the section that earns the highest interview points. Knowing when to avoid EDA demonstrates mature judgment.
Do not use EDA when: A service needs a synchronous response (payment processing that needs an immediate success/failure). You have a fixed set of known integrations (three internal services consuming order data—just call their APIs). The system is a simple CRUD application with no fan-out, no replay, and no real-time processing requirements. The team lacks operational maturity to manage Kafka, handle event ordering, and debug distributed event flows.
Interview phrasing: "I would not use EDA for the payment-to-order confirmation path because the user is waiting for a response. I would use a synchronous gRPC call. However, I would use EDA for post-order processing—notifying the warehouse, updating analytics, sending confirmation emails—because these are fire-and-forget operations where decoupling and independent scaling provide real value."
For structured practice applying EDA patterns across complete system design problems, Grokking the System Design Interview covers event-driven design as a core architectural pattern.
For advanced EDA patterns including event sourcing at production scale, distributed sagas, and stream processing architectures, Grokking the Advanced System Design Interview provides the depth required for L6+ interviews. The system design interview guide maps how EDA discussions fit into the overall interview framework.
Frequently Asked Questions
What is event-driven architecture in system design?
A design pattern where components communicate by producing and consuming events rather than direct synchronous calls. A producer publishes an event ("order placed"), and consumers react independently (process payment, update inventory, send notification). This decouples services, enabling independent scaling and deployment.
When should I use event-driven architecture in an interview?
When you need fan-out to multiple independent consumers, event replay for analytics or reprocessing, real-time stream processing, audit trails, or integration with unknown external systems. Do not use EDA when a synchronous response is needed or when direct API calls between known services are simpler.
What is the difference between pub/sub and event sourcing?
Pub/sub is a communication pattern—one producer broadcasts events to multiple consumers. Event sourcing is a storage pattern—storing the full sequence of events as the source of truth rather than just the current state. You can use pub/sub without event sourcing and event sourcing without pub/sub, though they often appear together.
How does the saga pattern work?
A saga coordinates multi-service transactions through a sequence of local transactions and compensating transactions. Each service performs its operation and publishes an event. If any step fails, compensating events trigger rollbacks in preceding services. This replaces distributed ACID transactions in event-driven microservices.
Why is Kafka the default for system design interviews?
Kafka provides high-throughput event streaming (millions of messages/second), durable message retention with replay capability, partitioned topics for parallelism, and consumer groups for independent consumption. These features map directly to interview scenarios: fan-out, replay, ordering, and independent scaling.
What is the outbox pattern?
A solution for reliable event publishing. The service writes business data and the event to the same database in one ACID transaction. A separate process (CDC or polling) reads from the outbox table and publishes to Kafka. This prevents the dual-write problem where a database commit succeeds but event publishing fails.
What is CQRS and when should I use it?
Command Query Responsibility Segregation separates write operations (commands) from read operations (queries) into different models and stores. Use it when read and write patterns differ dramatically (1000:1 read-to-write ratio) or when multiple query patterns require different data representations.
How do I handle event ordering in a distributed system?
Kafka guarantees ordering within a partition. Choose a partition key that groups related events—user_id ensures all events for one user are processed in order. Across partitions, ordering is not guaranteed. If global ordering is required, use a single partition (at the cost of throughput).
What are the main challenges of event-driven architecture?
Increased debugging complexity (distributed event flows vs linear request traces), eventual consistency between services, event ordering guarantees, idempotency requirements (at-least-once delivery means duplicates), schema evolution for event formats, and operational overhead of managing a message broker like Kafka.
Should I always choose EDA over synchronous communication?
No. Most service communication should be synchronous API calls. EDA adds significant complexity. Reserve it for scenarios where its benefits—decoupling, fan-out, replay, independent scaling—clearly outweigh the costs. Interviewers reward candidates who know when not to use a pattern.
TL;DR
Event-driven architecture decouples services through asynchronous event communication—producers publish events, consumers react independently. Three core patterns: pub/sub (fan-out to multiple consumers), event sourcing (events as the source of truth with full audit trail and replay), and CQRS (separate read and write models). Kafka is the interview-standard platform—know topics, partitions, consumer groups, offsets, and delivery semantics. The saga pattern handles distributed transactions through compensating events when steps fail. The outbox pattern ensures reliable event publishing by writing business data and events in a single database transaction. The highest-scoring interview insight: EDA is not always the answer. Use synchronous calls when a service needs an immediate response. Use EDA when you need fan-out, replay, real-time processing, or true service decoupling. Interviewers reward the judgment to choose the right pattern for each interaction, not the reflexive application of EDA everywhere.
GET YOUR FREE
Coding Questions Catalog

$197

$72

$78