On this page
What stream processing means (and why you need it)
The critical distinction: message brokers vs stream processors
The three stream processing paradigms
Kafka Streams
Apache Flink
Apache Storm
Side-by-side comparison
Decision framework: which stream processor when?
How this shows up in system design interviews
Common follow-up questions
Putting it all together
Keep learning
Kafka Streams vs Apache Flink vs Apache Storm: A Complete Guide to Stream Processing

On This Page
What stream processing means (and why you need it)
The critical distinction: message brokers vs stream processors
The three stream processing paradigms
Kafka Streams
Apache Flink
Apache Storm
Side-by-side comparison
Decision framework: which stream processor when?
How this shows up in system design interviews
Common follow-up questions
Putting it all together
Keep learning
A quick pop quiz: is Kafka a stream processor?
If you said yes, you're in the majority — and the majority is wrong. Kafka is a message broker. Kafka Streams is a stream processor that happens to read from and write to Kafka. They share a name but do completely different things, and conflating them is the single most common mistake engineers make when they start working with real-time data.
This guide walks through the three stream processors you'll actually evaluate in practice: Kafka Streams, Apache Flink, and Apache Storm. By the end, you'll know:
- What stream processing actually means — and why it's different from just using a message broker
- The critical broker-vs-processor distinction (this is the section that will save you in interviews)
- How Kafka Streams, Flink, and Storm each handle the hard problems: state, time, fault tolerance, exactly-once delivery
- Which one to pick in which situation, using a decision framework senior engineers use
- How this comes up in system design interviews, and what answers interviewers are listening for
Let's dig in.
What stream processing means (and why you need it)
Imagine you're building a fraud detection system for a payment platform. Every second, millions of transactions flow through. Your job: catch fraudulent ones before the money moves.
The old way — batch processing — pulls yesterday's transactions out of a database overnight and runs fraud rules on them. By the time you detect fraud, the money is already gone. Useful for reporting, useless for prevention.
The new way — stream processing — runs your fraud rules on each transaction as it happens, within milliseconds of the event. If the rule fires, you block the transaction in-flight. This is data in motion instead of data at rest.
Stream processing shows up everywhere real-time matters:
- Fraud detection (evaluate events as they happen)
- Feed ranking (compute ranked feeds based on the last few minutes of activity)
- IoT and telemetry (aggregate sensor readings continuously)
- Change data capture (turn database changes into event streams for downstream consumers — see change data capture)
- Real-time recommendations (update recommendations as the user clicks around)
- Analytics dashboards (keep metrics updated second-by-second rather than refreshing nightly)
- ETL pipelines (transform data continuously instead of in batch jobs)
The three tools in this guide all do stream processing. They differ in how they handle state, time, delivery guarantees, and operational complexity — which is what we'll unpack.
The critical distinction: message brokers vs stream processors
Before getting to the tools, get this framing right. This is the most important section in this post.
A message broker moves messages between producers and consumers. It's pipes. You write to it, something else reads from it, the broker doesn't care what's inside the messages. Kafka, RabbitMQ, and ActiveMQ are message brokers. If you want the deep dive on those, see Kafka vs RabbitMQ vs ActiveMQ.
A stream processor runs computation over messages as they flow. It's logic. You define transformations (filter, map, aggregate, join, window), and the processor runs them continuously as new data arrives. Kafka Streams, Apache Flink, and Apache Storm are stream processors.
In most real architectures, you use both together. Messages flow through a broker, and a processor reads from the broker, computes something, and writes results back (possibly to the same broker on a different topic):
producer → [Kafka broker] → [Flink processor] → [Kafka broker] → consumer
Here's where candidates trip up in interviews. Someone says "we'll use Kafka for real-time analytics" — and an interviewer who's paying attention asks "which Kafka — the broker or Kafka Streams?" If you can't answer that distinction cleanly, you've flagged yourself as junior.
The safe formulation in an interview: "I'll use Kafka as the broker to move events, and Kafka Streams (or Flink) as the processor to compute the aggregations." Name both components. Don't let one stand in for the other.
With that out of the way, let's look at the three processors.
The three stream processing paradigms
Like with brokers, the tools become easier to place once you name the underlying patterns. There are three ways to think about stream processing.
1. Micro-batching. Group incoming events into small batches (say, every 100ms) and process each batch as a unit. Simpler to reason about, easier to handle failures, but adds latency equal to the batch size. Apache Spark Streaming (not in this guide but worth knowing) uses this model. Storm's Trident API also uses micro-batching.
2. Record-at-a-time (true streaming). Process each event individually as it arrives. Lower latency than micro-batching — sub-millisecond if tuned right. But more complex for fault tolerance and state management. Kafka Streams and Flink both use this model as their native approach.
3. Windowed processing. Group events into time-based or count-based windows and compute over the window. "How many clicks happened in the last 5 minutes?" is a windowed query. All three processors support windowing, but with significantly different semantics around late-arriving events and event-time vs processing-time.
The last distinction — event time vs processing time — is the biggest technical divide in stream processing, and I'll come back to it because it's where Flink shines and the others struggle.
Kafka Streams
Kafka Streams is not a server. It's not a cluster. It's a client library — a JAR file you include in your Java application that turns your app into a stream processor.
Core model: You write a Java app that uses the Kafka Streams API to define a topology — a graph of stream operations (map, filter, aggregate, join, window). The library runs your topology, reading from Kafka topics and writing results back to Kafka topics. State (for aggregations, joins) is stored locally using RocksDB and replicated to Kafka for durability.
The "no separate cluster" part is Kafka Streams' signature move. Your app IS the processor. Scale by running more instances of your app; Kafka Streams handles partition assignment and state redistribution automatically.
Strengths:
- No separate cluster to operate. If you already run Kafka, you don't need new infrastructure to add stream processing. This is a massive operational win.
- Tight Kafka integration. Reading from Kafka and writing to Kafka is the native case, not an afterthought. Exactly-once semantics between Kafka and Kafka Streams are first-class.
- Stateful operations out of the box. Aggregations, joins, and windowed computations all work with local state + Kafka-backed durability.
- Straightforward scaling. Run more app instances; Kafka Streams rebalances partitions and state automatically.
- Lightweight mental model for simple use cases. "Read topic, transform, write topic" is easy to reason about.
Weaknesses:
- Kafka-only. Your input has to be in Kafka. Your output goes to Kafka. If you need to read from Kinesis, a database CDC feed, or files, you're either piping into Kafka first or using a different tool.
- JVM-only. Kafka Streams is Java/Scala. If your team lives in Python, Go, or Rust, you're out of luck (Python has
faust-streamingand similar libraries, but they're not official Kafka Streams). - Out-of-order event handling is limited. Late-arriving events are harder to handle than in Flink. For most use cases this is fine; for complex event processing it's a real limitation.
- Less feature-rich than Flink. No SQL interface (the broader Kafka ecosystem has ksqlDB, but that's a separate product). No web UI for visualization. No advanced windowing semantics.
- Limited to Kafka-scale. Kafka Streams inherits Kafka's architectural assumptions. For truly massive state (terabytes), Flink scales further.
Best use cases:
- Enriching or transforming Kafka topics — "for every event in topic A, join with reference data and write to topic B"
- Real-time aggregations within a Kafka-centric architecture — "count events per user per 5-minute window"
- Microservices that consume and produce events with some processing logic
- Teams that already run Kafka and want stream processing without standing up a new cluster
- Use cases where exactly-once semantics end-to-end with Kafka matter
In an interview: "For the feed enrichment pipeline in this design, I'd use Kafka Streams. The data is already flowing through Kafka, the transformations are straightforward (joining user events with profile data), and I'd rather not stand up a separate Flink cluster just for this. If the complexity grows — sophisticated windowing, late events, multi-source joins — I'd migrate to Flink."
Apache Flink
Apache Flink is the most powerful stream processor in common use. It's a full distributed cluster — you stand up a Flink cluster separately from your application code, submit jobs to it, and it runs them.
Core model: Flink is a true streaming engine with first-class support for event time, watermarks, stateful operations at massive scale, and sophisticated windowing semantics. Jobs are defined in Java, Scala, Python, or SQL (Flink SQL) and submitted to a cluster that manages execution, state, and fault tolerance.
Flink's killer feature is event-time processing with watermarks. Most streams have events that arrive out of order — a mobile app buffers events offline and sends them when the user reconnects, a network hiccup delays one partition, etc. Flink lets you reason about event time (when the event actually happened) separately from processing time (when the system saw it), handle late arrivals with watermarks, and produce correct results even when events arrive badly out of order. This is hard, and Flink does it better than anyone else.
Strengths:
- The most powerful windowing and time handling of any stream processor. Event-time windows, session windows, processing-time windows, watermarks, allowed lateness. If your use case has real event-time complexity, Flink is the only serious choice.
- Exactly-once processing semantics across diverse sources and sinks, not just Kafka.
- Massive stateful scale. Flink can manage terabytes of state via RocksDB-backed state backends and incremental checkpointing.
- Polyglot. Java, Scala, Python (PyFlink), and SQL (Flink SQL).
- Rich ecosystem of connectors. Kafka, Kinesis, Pulsar, Cassandra, Elasticsearch, JDBC, file systems, and more. You're not locked into Kafka like with Kafka Streams.
- Web UI for monitoring and a mature tooling ecosystem.
- CEP (complex event processing) library for pattern detection — "did user A click X, then click Y, then not click Z within 30 seconds?" is a one-liner in Flink CEP.
Weaknesses:
- Operational complexity. Running a Flink cluster in production is real work. You need to understand JobManagers, TaskManagers, checkpointing, state backends, and how to tune memory for stateful jobs. Managed offerings (AWS Kinesis Data Analytics for Flink, Ververica, Confluent Cloud's Flink) have softened this, but it's still more operationally involved than Kafka Streams.
- Steeper learning curve. The API is more powerful, which means more concepts to internalize — event time, watermarks, state backends, checkpoints vs savepoints, side outputs.
- Resource-heavy. A Flink cluster consumes more baseline resources than a Kafka Streams app. For small workloads it's overkill.
- More moving parts to reason about when things go wrong.
Best use cases:
- Complex event processing — detecting patterns across streams ("suspicious login pattern," "abandoned cart sequence," fraud rings).
- Real-time analytics on diverse sources — reading from Kafka, Kinesis, and a JDBC source, joining them, and writing to Elasticsearch.
- Fraud detection at scale where event-time correctness matters.
- Large-scale stateful processing — aggregations and joins over windows that span hours or days, with state sizes in the hundreds of gigabytes.
- Streaming ETL pipelines that need to handle late-arriving data correctly.
- Teams that already use Flink elsewhere — there's significant learning investment, so consolidating on one tool pays off.
In an interview: "For the fraud detection system, I'd use Flink. We need to detect patterns across events ('three failed logins followed by a successful one from a different IP within 10 minutes'), which is Flink CEP's sweet spot. We also need event-time correctness because the mobile SDKs batch and buffer events, so they arrive out of order. Flink's watermark-based windowing handles that cleanly. Kafka Streams would struggle with the out-of-order semantics."
Apache Storm
Apache Storm was the first widely-adopted open-source stream processor (Twitter open-sourced it in 2011). For years it was the default choice. In 2026, it's legacy — Flink has largely superseded it for new projects, but Storm still runs in production at many companies and deserves understanding.
Core model: Storm processes streams as tuples flowing through a topology of spouts (data sources) and bolts (processing units). Each tuple is processed independently. State management isn't built in — you roll it yourself or use Trident (the higher-level API).
Strengths:
- Low latency for stateless processing. Sub-millisecond end-to-end for simple transformations.
- Language-agnostic. Storm can run bolts written in any language via its multi-language protocol (though Java/Clojure are native).
- Mature and battle-tested. It's been in production for over a decade. The bugs are known.
- Simple mental model. Spouts emit tuples, bolts transform them, topology describes the DAG. Easy to reason about for basic use cases.
Weaknesses:
- No native state management. Want to do a windowed aggregation? You're writing your own state code or using Trident (which is slower and less maintained). This alone disqualifies Storm for most modern use cases.
- At-least-once processing by default. Exactly-once requires Trident, which has performance overhead and is less common.
- Processing-time only (mostly). Event-time semantics are weak compared to Flink. Out-of-order events are hard.
- Declining community. Fewer releases, smaller community, fewer recent tutorials. Most new stream processing projects pick Flink or Kafka Streams.
- Operational complexity similar to Flink, but for less capability.
Best use cases:
- Legacy Storm deployments — you have Storm running, and migrating isn't worth it.
- Simple, stateless, low-latency transformations where Storm's model is sufficient and you're already operating it.
- Language-polyglot shops where Storm's multi-language support matters more than state management.
In an interview: Be careful here. If you name Storm as your primary choice for a new project, you're signaling "I haven't kept up with the space." The correct framing is: "Storm is still used in production, but for a new project I wouldn't pick it. Flink has surpassed it for stateful and event-time use cases, and Kafka Streams has surpassed it for Kafka-centric cases. Storm's niche these days is mostly inertia." That answer shows you know the history AND the current state of the ecosystem.
Side-by-side comparison
| Feature | Kafka Streams | Apache Flink | Apache Storm |
|---|---|---|---|
| Deployment model | Library embedded in your app | Separate cluster | Separate cluster |
| Processing model | Record-at-a-time | Record-at-a-time | Tuple-at-a-time |
| Typical latency | <10ms | 10–100ms (tunable) | <10ms (stateless) |
| State management | Native (RocksDB + Kafka) | Native (RocksDB + checkpoints) | Manual or via Trident |
| Event-time support | Basic | Excellent (watermarks) | Weak |
| Out-of-order handling | Limited | Robust | Poor |
| Exactly-once semantics | Yes (with Kafka) | Yes (with many sinks) | Via Trident only |
| Data sources | Kafka only | Kafka, Kinesis, Pulsar, JDBC, files, etc. | Many (spouts) |
| Language support | Java/Scala (JVM) | Java, Scala, Python, SQL | Java + multi-lang bolts |
| SQL interface | No (ksqlDB is separate) | Yes (Flink SQL) | No |
| CEP support | No | Yes (Flink CEP) | No (limited) |
| Operational overhead | Low (no cluster) | High (cluster + tuning) | Medium–High |
| Community health (2026) | Active | Very active, growing | Declining |
| Best for | Kafka-native transformations | Complex, stateful, event-time processing | Legacy deployments |
Decision framework: which stream processor when?
Here's the sequence I use to pick cleanly, in interviews or in real decisions:
1. Is your input data already in Kafka (or will be)?
- If yes and the processing is straightforward: Kafka Streams. You skip an entire cluster's worth of operational work.
- If no: Kafka Streams is off the table. Move to question 2.
2. Do you need sophisticated event-time semantics or complex event processing?
- If yes (patterns across events, late-arriving data, event-time windows): Flink. This is its sweet spot and nothing else comes close.
- If no: continue.
3. How much state do you need to manage?
- Megabytes to a few gigabytes: Kafka Streams is fine.
- Tens of gigabytes to terabytes: Flink. Its state backends and incremental checkpointing are designed for this.
- None (stateless): any of the three, or even simpler tools like a Kafka consumer with direct logic.
4. What does your team's experience look like?
- Strong JVM team already running Kafka: Kafka Streams reduces cognitive overhead.
- Team with Flink experience or willingness to invest: Flink gives you more headroom.
- Team running Storm already: keep Storm if it works; migrate when you hit its limits.
5. Do you need polyglot support (Python, SQL, etc.)?
- Python for ML engineers, SQL for analysts: Flink (PyFlink, Flink SQL).
- Pure JVM shop: any works.
6. What's your latency requirement?
- Sub-millisecond for stateless workloads: Storm, or Kafka Streams.
- Single-digit-ms to tens-of-ms: any of the three can hit this.
- Hundreds of ms is fine: any.
In practice, most modern greenfield decisions come down to Kafka Streams (when you're Kafka-native and the use case is straightforward) vs Flink (when it's not). Storm is a legacy consideration.
How this shows up in system design interviews
Three canonical questions and the stream processing reasoning I'd use:
Q: "Design a real-time leaderboard for an online game."
"The events are game scores flowing through Kafka. I'd use Kafka Streams for the leaderboard computation — windowed aggregation by game mode and region, emitting updates every second. The state is small (top N players per region), we need exactly-once semantics so scores aren't double-counted, and running Kafka Streams alongside our existing Kafka cluster avoids spinning up Flink for something this contained."
Q: "Design a fraud detection pipeline for a payment platform."
"This is a Flink problem. We need to detect patterns across transaction events — 'three transactions from geographically impossible locations within 30 seconds,' 'a card used in a new country within an hour of its previous use' — which is exactly what Flink CEP is built for. We also need event-time correctness because transactions can arrive out of order (retries, mobile app buffering), and Flink's watermark-based windowing handles that. I'd use Kafka as the broker for the event stream and Flink as the processor. For the blocking decisions, I'd write back to a Kafka topic that the payment authorization service consumes synchronously."
Q: "Design a trending topics feature for a social platform."
"Events — likes, shares, comments — flow through Kafka. Flink computes windowed counts per topic with an event-time tumbling window of 5 minutes. Results go to a Kafka topic and from there to Redis for serving. I'd use Flink over Kafka Streams here because the state can grow large (millions of topics across all windows), and Flink's state backend handles that more gracefully."
Notice in each answer I name the broker (Kafka) and the processor (Kafka Streams or Flink) separately. That reinforces the distinction from the beginning of this post — and it's exactly the phrasing interviewers want to hear.
Common follow-up questions
- "How does Flink guarantee exactly-once processing?" (Distributed snapshots via the Chandy-Lamport algorithm, called checkpoints. Combined with transactional sinks.)
- "What's the difference between a watermark and a window?" (Window = a bucket of events. Watermark = a signal that says "we don't expect events earlier than this time anymore.")
- "What happens in Kafka Streams when you add a new instance of your app?" (Consumer group rebalancing redistributes partitions and state. See idempotency for the correctness story during rebalancing.)
- "How do you handle backpressure?" (Flink's dataflow-based backpressure propagates automatically. Kafka Streams uses consumer lag as the signal. Storm has built-in backpressure via its tuple tree.)
- "What's the failure recovery story for each?" (Kafka Streams: restart the app, it rebuilds state from Kafka-backed changelogs. Flink: restart from the last checkpoint. Storm: replay tuples from spouts.)
- "Why does exactly-once cost more than at-least-once?" (Transactions, two-phase commits, coordinated checkpoints. Compare with eventual vs strong consistency — same family of trade-off.)
Putting it all together
Here's the takeaway in one sentence: Kafka Streams is the right answer when your data already lives in Kafka and the processing is straightforward. Flink is the right answer when anything about your requirements is complex — event-time, large state, multiple sources, CEP, or SQL. Storm is the right answer when you already have Storm and it works.
In interviews, the thing that separates senior from junior is the framing, not the tool name. Say "this is a stream processing problem, and specifically it's [record-at-a-time | windowed | stateful | event-time sensitive], which points me toward [tool]." That structure signals that you understand the space, not just the vocabulary.
And keep the broker-vs-processor distinction sharp. Kafka moves messages. Kafka Streams processes them. Name both components separately when you describe your architecture. It's a small phrasing change that communicates a large amount of understanding.
Good luck with your next interview.
Keep learning
- Kafka vs RabbitMQ vs ActiveMQ — the sibling post for message brokers. Read this first if you're fuzzy on the broker-vs-processor distinction.
- Messaging Patterns — the pillar post for messaging patterns in Silo 5.
- Observer vs Pub-Sub Pattern — the conceptual pattern underlying event streaming.
- Change Data Capture — a canonical stream processing use case: turning database changes into event streams.
- How to Design a Recommendation System — recommendation pipelines are a stream processing use case.
- How to Design a Social Media News Feed — feed ranking and trending computation typically use stream processors.
- Grokking Idempotency — critical for correctly consuming messages in a stream processor.
- Eventual vs Strong Consistency — the trade-off that underlies delivery-guarantee choices.
- High Availability in System Design — relevant when configuring processor cluster resilience.
For the full system design interview roadmap, start with my complete system design interview guide.
What our users say
Brandon Lyons
The famous "grokking the system design interview course" on http://designgurus.io is amazing. I used this for my MSFT interviews and I was told I nailed it.
Arijeet
Just completed the “Grokking the system design interview”. It's amazing and super informative. Have come across very few courses that are as good as this!
Eric
I've completed my first pass of "grokking the System Design Interview" and I can say this was an excellent use of money and time. I've grown as a developer and now know the secrets of how to build these really giant internet systems.
Access to 50+ courses
New content added monthly
Certificate of completion
$29.08
/month
Billed Annually
Recommended Course

Grokking the Object Oriented Design Interview
59,389+ students
3.9
Learn how to prepare for object oriented design interviews and practice common object oriented design interview questions. Master low level design interview.
View CourseRead More
Distributed System Design: A Complete Guide for Beginners (Concepts, Patterns, Examples)
Arslan Ahmad
System Design Interview Guide – 7 Steps to Ace It
Arslan Ahmad
What Happens When You Type a URL? (Step-by-Step Explanation)
Arslan Ahmad
System Design Fundamentals: The Load Balancing Algorithms Guide
Arslan Ahmad