On this page

Requirements gathering

Capacity estimation

API design

Data model

High-level architecture

Deep dive: WebSockets vs long polling

Deep dive: message ordering

Deep dive: presence and "online" indicators

Deep dive: group chats at scale

Caching strategy

Scaling strategy

Trade-offs to name explicitly

A note on end-to-end encryption

Follow-up questions interviewers ask

Putting it all together

Related case studies and deep dives

FAQs

How to Design a Real-Time Chat Application (WhatsApp/Slack) — System Design Interview

Image
Arslan Ahmad
How to design a real-time chat app like WhatsApp: presence, message ordering, group chats, push notifications, and delivery guarantees with interview-ready trade-offs.
Image

Requirements gathering

Capacity estimation

API design

Data model

High-level architecture

Deep dive: WebSockets vs long polling

Deep dive: message ordering

Deep dive: presence and "online" indicators

Deep dive: group chats at scale

Caching strategy

Scaling strategy

Trade-offs to name explicitly

A note on end-to-end encryption

Follow-up questions interviewers ask

Putting it all together

Related case studies and deep dives

FAQs

"Design a chat application like WhatsApp" is one of those system design questions that sounds simple but hides genuine depth. Every FAANG company asks some version of it because it touches real-time communication, persistent connections, delivery guarantees, fan-out, consistency trade-offs, and distributed state. Get it right and you've demonstrated fluency across half the system design curriculum in a single interview.

The trap most candidates fall into: they start drawing WebSocket connections and skip the requirements. Then the interviewer asks "how do you guarantee message ordering in a group chat?" and the candidate has no answer because they never laid the foundation.

This guide walks through the full design, step by step, so that every follow-up has an answer already built into your architecture. By the end, you'll know:

  • How to frame requirements so the interviewer knows you're designing for the right problem
  • The math on how many connections and how much storage WhatsApp-scale actually requires
  • Why WebSockets beat everything else for this use case (and the one fallback you should name)
  • The message delivery pipeline: sent → delivered → read, with retry logic
  • How group chats work — fan-out on write, and why large groups are a different problem
  • The data model, the storage trade-offs, and the caching strategy
  • How presence ("online" indicators) actually works at scale
  • The follow-up questions interviewers ask and what they're listening for

Let's get into it.

Requirements gathering

Start here. Every time. See how to gather and prioritize requirements for the full technique.

Functional requirements:

  • One-to-one messaging between users (text, images, video, files)
  • Group messaging (up to ~1,000 members per group)
  • Online/offline presence indicators ("last seen")
  • Delivery status: sent, delivered, read
  • Message history (persistent, queryable, synced across devices)
  • Push notifications for offline users
  • Typing indicators (ephemeral, real-time)

Non-functional requirements:

  • Real-time delivery. Messages should appear within 100-200ms of being sent.
  • Message ordering. Messages in a conversation must appear in the order they were sent.
  • Reliability. No messages lost, even during server crashes, client disconnections, or network partitions.
  • Scale. Hundreds of millions of concurrent connections.
  • Low latency worldwide. Users in different geographies should experience similar performance.

Explicitly out of scope (for initial design):

  • End-to-end encryption (mention it, don't design it)
  • Voice and video calls (separate system — see designing a video conferencing system)
  • Stories / status updates
  • Bot integrations (Slack-specific feature)

Capacity estimation

Interviewers want numbers, not hand-waving. Here's the math for a WhatsApp-scale system. For the full estimation technique, see back-of-the-envelope estimation.

Users:

  • 2 billion registered users
  • 500M daily active users (DAU)
  • Each active user sends ~40 messages/day on average

Messages:

  • 500M × 40 = 20B messages/day
  • 20B / 86,400 ≈ 230K messages/sec average, peak ~700K/sec

Connections:

  • 500M DAU, but not all online simultaneously
  • Assume 100M concurrent WebSocket connections at peak
  • Each connection: ~10 KB memory on the server → ~1 TB total memory for connections across the fleet

Storage (messages):

  • Average message size: ~100 bytes (text) + metadata ≈ 200 bytes
  • 20B messages/day × 200 bytes = ~4 TB/day of new message data
  • 5 years retention: ~7 PB

Media:

  • 10% of messages include images/video → 2B media messages/day
  • Average media: 200 KB → ~400 TB/day of media (stored in object storage + CDN, not in the message DB)

Key insight: The system is write-heavy (230K writes/sec of messages) AND connection-heavy (100M concurrent WebSockets). Both constraints shape the architecture.

API design

Three core APIs:

POST /v1/messages
  Body: { conversation_id, content, media_url?, message_type }
  Returns: { message_id, timestamp, status: "sent" }

GET /v1/conversations/{id}/messages?cursor={cursor}&limit=50
  Returns: { messages: [...], next_cursor }

GET /v1/conversations
  Returns: { conversations: [...with last_message, unread_count...] }

Plus the real-time channel (not REST):

WebSocket /v1/ws
  Client → Server: { type: "send_message", conversation_id, content }
  Server → Client: { type: "new_message", message }
  Server → Client: { type: "delivery_ack", message_id, status }
  Server → Client: { type: "typing", conversation_id, user_id }
  Server → Client: { type: "presence", user_id, status }

Important design choice: messages go through the WebSocket, not REST. The REST API exists for history and conversation listing, but live messaging flows over the persistent WebSocket connection. This avoids HTTP overhead on every single message.

For API design patterns, see mastering the API interview.

Data model

Three main entities:

Users table:

user_id (PK)
username
last_seen_at
device_tokens[]     -- for push notifications
created_at

Conversations table:

conversation_id (PK)
type: "one_to_one" | "group"
member_ids[]
group_name?
created_at

Messages table (sharded by conversation_id):

message_id (PK, generated with time-sortable IDs like Snowflake)
conversation_id (partition key)
sender_id
content
media_url?
message_type: "text" | "image" | "video" | "file"
status: "sent" | "delivered" | "read"
created_at

Why shard by conversation_id? Because the dominant query is "get messages in conversation X sorted by time." Sharding by conversation keeps all messages in a single conversation on the same shard, making that query a local scan instead of a scatter-gather.

Database choice: For the messages table, a wide-column store like Cassandra or HBase is the standard choice. Messages are time-series data (ordered, append-heavy, read by range), which is exactly what wide-column stores optimize for. For the conversations and users tables, a relational database (PostgreSQL with sharding, or Vitess) works well — the data is smaller and queries are simple.

For the deeper database comparison, see NoSQL databases for system design and database sharding guide.

High-level architecture

The system breaks into these services:

  • Chat Service — manages WebSocket connections, routes messages between users
  • Message Service — persists messages to the database, handles retries
  • Presence Service — tracks online/offline status, broadcasts presence updates
  • Notification Service — sends push notifications to offline users
  • Group Service — manages group membership, handles group fan-out
  • Media Service — handles file uploads, generates thumbnails, writes to object storage
  • Cache Layer (Redis) — caches recent messages, conversation metadata, online status
  • Message Queue (Kafka) — decouples message ingestion from delivery and persistence
  • Object Storage + CDN — images, video, files

The message flow for a 1:1 message:

  1. Alice's device sends the message over her WebSocket connection to Chat Service
  2. Chat Service publishes the message to Kafka (topic: messages)
  3. Chat Service immediately sends a "sent" ack back to Alice over her WebSocket
  4. Message Service consumes from Kafka, writes to the messages database
  5. Chat Service looks up Bob's connection. Two cases:
    • Bob is online: Chat Service pushes the message over Bob's WebSocket. Bob's device sends a "delivered" ack back. Chat Service forwards the ack to Alice.
    • Bob is offline: Notification Service sends a push notification (APNs/FCM). The message sits in the database. When Bob reconnects, Chat Service delivers all pending messages.
  6. When Bob opens the conversation and reads the message, his device sends a "read" ack. Chat Service forwards it to Alice.

The message flow for a group message:

  1. Alice sends a message to the group conversation
  2. Chat Service publishes to Kafka
  3. Chat Service looks up all group members
  4. Fan-out on write: the message is pushed to every online member's WebSocket connection. For offline members, push notifications are sent.
  5. Each recipient's device sends a delivery ack independently

This fan-out pattern is the same core concept behind designing a social media news feed. The key difference: news feeds can tolerate seconds of delay; chat messages can't.

Deep dive: WebSockets vs long polling

Chat is the canonical use case for WebSockets. Here's why:

WebSockets establish a single persistent TCP connection between client and server. Once open, both sides can send data at any time without HTTP overhead. The connection stays alive until explicitly closed or the client disconnects.

Long polling is the fallback. The client sends an HTTP request, the server holds it open until there's a new message (or a timeout), then responds. The client immediately sends another request. Near-real-time, but each message cycle involves HTTP overhead (headers, new connection setup).

Server-Sent Events (SSE) is the third option. Server pushes data over a single long-lived HTTP connection. One-directional only (server → client), so the client still needs a separate channel to send messages. Less suitable for chat than WebSockets.

The clear winner: WebSockets. Bidirectional, low overhead, true real-time. Long polling is the fallback for environments where WebSockets aren't available (corporate firewalls, specific network configurations).

In an interview, name both. "I'd use WebSockets for the persistent connection. If WebSockets are blocked by the client's network, the system falls back to long polling. The server-side logic is identical either way — only the transport layer differs." That answer shows you know the trade-offs.

Deep dive: message ordering

This is the subtlety that separates a good interview answer from a great one.

In a 1:1 conversation: ordering is straightforward. Each message gets a timestamp from the server (not the client — client clocks can't be trusted). Messages are stored and displayed sorted by server timestamp.

In a group conversation: ordering gets harder. If Alice and Bob both send a message to the group at the same millisecond, which one appears first? The answer depends on which message reaches the server first. Both are "correct" orderings — but all group members need to see the same ordering.

The solution: server-assigned sequence numbers per conversation. Every message in a conversation gets a monotonically increasing sequence number. The server (or, more precisely, the shard responsible for that conversation) assigns the number atomically. All clients display messages sorted by sequence number, not by their local timestamp.

This is why sharding by conversation_id matters — all messages in a conversation hit the same shard, so a single atomic counter can assign sequence numbers without coordination across shards.

What about cross-conversation ordering? You don't need it. Each conversation is its own independent ordered stream. The conversation list on the home screen is sorted by "last message timestamp," which doesn't require global ordering.

For the deeper treatment of consistency guarantees, see eventual vs strong consistency.

Deep dive: presence and "online" indicators

Presence sounds simple ("green dot means online") but it's surprisingly hard at scale.

The naive approach: every client sends a heartbeat every 5 seconds. The server stores {user_id: last_heartbeat_at} in Redis. A user is "online" if their last heartbeat was within the last 10 seconds.

Why this is expensive: 100M concurrent users × heartbeat every 5 seconds = 20M writes/sec to Redis just for presence. That's a lot.

Optimization 1: coarser granularity. Heartbeat every 30 seconds instead of 5. Presence changes are less snappy but the write load drops by 6x.

Optimization 2: presence is only sent to people who care. You don't need to know the presence of every user — only the ones in your open conversation. The client subscribes to presence for a handful of users at a time. The server only pushes presence updates to subscribers.

Optimization 3: presence as pub/sub. Use a pub/sub pattern — each user's presence is a "channel." Users subscribe when they open a conversation, unsubscribe when they navigate away. The presence service publishes to subscribed clients only, not to everyone.

WhatsApp's approach: presence updates are pushed only to users who have a 1:1 conversation open with the person. If you're on the home screen, you don't get presence updates — you see "last seen at" from the stored heartbeat.

Deep dive: group chats at scale

Groups under ~200 members are handled by direct fan-out — the Chat Service pushes the message to each member's WebSocket in a loop. At 200 members, that's 200 writes, which takes a few milliseconds.

Groups over ~1,000 members (like Slack channels or Discord servers) need a different approach:

Fan-out via message queue. Instead of the Chat Service pushing to each member synchronously, it publishes the message to a Kafka topic partitioned by group_id. Consumer workers pick up the event and handle delivery to each member asynchronously. This decouples the sender's experience from the delivery fan-out.

Read receipts at scale. In a 5-person group, showing "Alice, Bob, and Carol have read this" is useful. In a 500-person group, it's noise. Large groups typically aggregate read receipts ("34 members have read this") or disable them entirely.

Muted groups. Users who've muted a group don't get push notifications, but the message still lands in their chat history for when they open it.

For the deeper treatment of fan-out patterns, see messaging patterns and Kafka vs RabbitMQ vs ActiveMQ.

Caching strategy

At 700K peak messages/sec and 100M concurrent users, caching is what keeps the system responsive.

Recent messages cache (Redis). The last 50 messages per active conversation, stored in Redis sorted sets. When a user opens a conversation, the recent messages come from Redis (sub-millisecond) instead of Cassandra (~5ms). For the full caching breakdown, see caching for system design interviews.

Conversation metadata cache. The conversation list ("last message, unread count, group name") is read every time the user opens the app. Cache it in Redis with TTLs of a few minutes, invalidated on new messages.

Presence cache. Online/offline status in Redis with TTLs equal to the heartbeat interval. This IS the primary store for presence — no database involved.

User profile cache. Profile photos and display names change rarely. Cache with long TTLs.

CDN for media. Images and video URLs in messages resolve to CDN edge nodes. The CDN handles the heavy bandwidth. See CDN system design basics.

Scaling strategy

Connection servers scale horizontally. Each Chat Service instance holds ~50K-100K WebSocket connections. At 100M concurrent users, you need ~1,000-2,000 Chat Service instances behind a load balancer. Use consistent hashing (or sticky sessions) to route a user's WebSocket to the same server for the session duration.

Cross-server message routing. Alice is on Server A, Bob is on Server B. When Alice sends Bob a message, Server A publishes to Kafka, and Server B (which subscribes for Bob's conversations) picks it up and delivers. The message queue is the glue that makes multi-server routing work.

Database sharding. Messages are sharded by conversation_id. New conversations get assigned to shards using consistent hashing. Hot conversations (popular group chats) may need their own dedicated shard. For the sharding deep-dive, see database sharding guide.

Geo-distributed deployment. Deploy Chat Service clusters in multiple regions (US-East, EU-West, AP-Southeast). Users connect to the nearest cluster. Cross-region messages route through the message queue — slightly higher latency (50-100ms) but correct. For the availability discussion, see high availability in system design.

Idempotent message delivery. Network issues can cause retries. The same message might be pushed to a client's WebSocket twice. The client deduplicates by message_id. See idempotency in system design for the pattern.

Trade-offs to name explicitly

Interviewers listen for whether you can articulate trade-offs, not just describe the happy path. Three big ones in this design:

WebSocket vs long polling. We picked WebSockets for latency and efficiency, but we accepted that some clients behind corporate firewalls can't use them. The fallback adds complexity but keeps the system universally accessible.

Consistency model. We're using eventual consistency for message delivery — the message hits the database first, then gets pushed to recipients asynchronously. There's a brief window where the sender sees "sent" but the recipient hasn't received the message yet. For chat, this is fine. For a banking application, it wouldn't be. The framework for reasoning about this: CAP theorem vs PACELC.

Fan-out on write for groups. We push messages to all group members immediately, which means write amplification proportional to group size. The alternative — fan-out on read — would mean every group member queries for new messages on each open, which is slower for the reader. We accepted write amplification because chat is latency-sensitive and users expect instant delivery.

A note on end-to-end encryption

WhatsApp uses the Signal Protocol for end-to-end encryption. The server never sees plaintext — it only stores encrypted blobs. This means the server can't index message content, can't search messages for the user, and can't do server-side abuse detection on message bodies.

In an interview, acknowledge E2E encryption exists and name one architectural implication: "With E2E encryption, the server stores encrypted blobs it can't read. This means features like server-side search and content moderation need to work on metadata only, not message content." That one sentence shows awareness without derailing the design into a cryptography lecture.

Follow-up questions interviewers ask

  • "How do you handle message delivery when both users are on different servers?" (Kafka routes the message. Server A publishes, Server B consumes and delivers via WebSocket.)
  • "What happens if the chat server holding a user's connection crashes?" (The user's client detects the broken WebSocket and reconnects to a different server. On reconnect, the server fetches pending messages from the database and delivers them.)
  • "How do you sync messages across multiple devices (phone + laptop)?" (Each device has its own WebSocket. Messages fan out to all of a user's connected devices. A sync cursor per device tracks what's been delivered to each.)
  • "How do you handle a user who sends 1,000 messages in a second?" (Client-side throttling as the first line; server-side rate limiting as the enforcement layer.)
  • "How do you implement 'delete for everyone'?" (Publish a delete event to the conversation. All members' clients remove the message from their local display. The server marks the message as deleted in the database but may retain it for compliance/legal hold.)
  • "How would you add reactions (emoji responses) to messages?" (Reactions are lightweight events — store as a separate reactions table keyed by message_id, fan out like messages. Small payload, no need to modify the message itself.)
  • "How is this different from designing Slack?" (Slack adds: channels with topic-based organization, threaded replies, integrations/bots, file sharing with previews, searchable message history. The core message delivery pipeline is similar; the data model and features layer on top.)

If you can handle five of those without stumbling, you've aced the follow-ups.

Putting it all together

The one-sentence version: A chat application is a real-time, write-heavy system where the job is to deliver messages to the right recipients within 200ms, guarantee no messages are lost, and scale the connection layer to hundreds of millions of concurrent users.

In an interview, the structure that signals seniority is:

  1. Clarify requirements (1:1 vs group, scale, latency, reliability)
  2. Do the capacity estimation (connections, messages/sec, storage)
  3. Name WebSockets as the transport, with long polling as fallback
  4. Walk through the message flow end-to-end (sent → delivered → read, with retry)
  5. Explain group chat fan-out and why large groups differ
  6. Describe the data model and sharding strategy
  7. Name the trade-offs explicitly (consistency, fan-out amplification, WebSocket vs polling)
  8. Handle the follow-ups — which are mostly about failure modes

Good luck with your next interview.

For the full system design interview roadmap, start with my complete system design interview guide.

FAQs

Q1: Why use WebSockets for chat instead of HTTP long polling?
WebSockets allow the server to send data to the client as soon as new information is available, over a single always-open connection. This makes them ideal for real-time chat since messages appear almost instantly without the client constantly requesting updates. Long polling also achieves near-instant updates but with more overhead – the client has to repeatedly ask the server for new messages. Thus, WebSockets are preferred for efficient, low-latency communication, with long polling as a backup when WebSockets aren’t feasible.

Q2: How do chat apps handle offline users to ensure no message is lost?
If a user is offline, the chat server stores their incoming messages. When the user comes back online, the server delivers all pending messages to their device. No messages get lost this way. Plus, delivery acknowledgments tell the server which messages were delivered, so it will keep retrying any that remain unacknowledged.

Q3: How can a chat application scale to millions of concurrent users without lag?
The application must scale horizontally. That means using multiple servers and load balancing rather than relying on one big server. User connections are spread across many servers, and data (like messages and user info) is sharded across databases to distribute the load. Caching frequently accessed data helps reduce database work. By horizontally adding capacity and optimizing each layer (servers and databases), the chat app can handle millions of users in real time without noticeable lag.

System Design Interview

What our users say

Arijeet

Just completed the “Grokking the system design interview”. It's amazing and super informative. Have come across very few courses that are as good as this!

Simon Barker

This is what I love about http://designgurus.io’s Grokking the coding interview course. They teach patterns rather than solutions.

Eric

I've completed my first pass of "grokking the System Design Interview" and I can say this was an excellent use of money and time. I've grown as a developer and now know the secrets of how to build these really giant internet systems.

More From Designgurus
Annual Subscription
Get instant access to all current and upcoming courses for one year.

Access to 50+ courses

New content added monthly

Certificate of completion

$29.08

/month

Billed Annually

Recommended Course
Grokking the Object Oriented Design Interview

Grokking the Object Oriented Design Interview

59,389+ students

3.9

Learn how to prepare for object oriented design interviews and practice common object oriented design interview questions. Master low level design interview.

View Course
Join our Newsletter

Get the latest system design articles and interview tips delivered to your inbox.

Read More

Best System Design Resources on GitHub (2026 Guide)

Arslan Ahmad

Arslan Ahmad

12 System Design Algorithms You Must Know Before the Interview

Arslan Ahmad

Arslan Ahmad

How to Design a Recommendation System

Arslan Ahmad

Arslan Ahmad

System Design Interviews at Google, Amazon, and Meta: The 2026 Guide

Arslan Ahmad

Arslan Ahmad

Design Gurus logo
One-Stop Portal For Tech Interviews.
Copyright © 2026 Design Gurus, LLC. All rights reserved.