System Design Interview Crash Course
Ask Author
Back to course home

0% completed

Vote For New Content
Design WhatsApp
On this page

1. Problem Definition and Scope

2. Clarify functional requirements

3. Clarify non-functional requirements

4. Back of the envelope estimates

5. API design

6. High level architecture

7. Data model

8. Core flows end to end

Flow 1: User A sends a message to User B (Both Online)

Flow 2: User A sends a message to User B (User B is Offline)

Flow 3: User B comes Online (Synchronization)

9. Caching and read performance

10. Storage, indexing and media

11. Scaling strategies

12. Reliability, failure handling and backpressure

13. Security, privacy and abuse

14. Bottlenecks and next steps

Summary

Image

Here is a detailed system design for WhatsApp.

1. Problem Definition and Scope

We are designing a global, real-time messaging application where users can communicate text and media instantly.

The goal is to build a highly reliable system that delivers messages with low latency while handling billions of active users.

  • Main User Groups: Mobile users (iOS and Android).

  • Main Actions: Sending text messages, sharing images, and seeing message delivery status.

  • Scope:

    • Included: 1-on-1 text chat, basic Group chat, Message receipts (Sent/Delivered/Read), Offline support, and Online status.
    • Excluded: Voice/Video calling (VoIP), Status/Stories, Web client specifics, and deep cryptographic details of End-to-End Encryption (E2EE).
Image

2. Clarify functional requirements

Must Have:

  • 1-on-1 Messaging: Users can send and receive text messages in real-time.

  • Group Messaging: Users can broadcast messages to a group of people (up to 256 members).

  • Message Receipts:

    • Sent: Reached the server (1 gray tick).
    • Delivered: Reached the recipient's device (2 gray ticks).
    • Read: User opened the chat (2 blue ticks).
  • Offline Support: If a recipient is offline, the message is queued and delivered immediately when they return.

  • Media Sharing: Users can upload and send images/videos.

Nice to Have:

  • Last Seen: Users can see when a contact was last online.
  • Push Notifications: Wake up the device when a new message arrives.
Image

3. Clarify non-functional requirements

  • Target Users: 1 Billion Daily Active Users (DAU).

  • Message Volume: 50 Billion messages per day (approx. 50 messages per user).

  • Latency: Extremely low (Real-time). Messages should arrive within 200ms if both users are online.

  • Availability: High (99.99%). The service is a utility; it must essentially never be down.

  • Consistency: Message ordering must be strict (FIFO) within a single chat.

  • Data Retention: Messages are stored permanently to support restoring history on new devices.

Image

4. Back of the envelope estimates

  • Traffic (QPS):
    • Total Messages = 50 Billion / day.
    • Seconds in a day \approx 86,400.
    • Average QPS = 50,000,000,000 / 86,400 \approx 580,000 requests/sec.
    • Peak QPS (e.g., New Year's) \approx 3x Average \approx 1.7 Million requests/sec.
  • Storage (Text):
    • Average message size \approx 100 Bytes.
    • Daily Text Storage = 50 Billion \times 100 Bytes = 5 TB / day.
    • Yearly Text Storage = 5 TB \times 365 \approx 1.8 PB (Petabytes).
    • Conclusion: We need a database that scales horizontally (Sharding).
  • Storage (Media):
    • Assume 10% of messages are media. Average size \approx 200 KB.
    • Daily Media Storage = 5 Billion \times 200 KB = 1 PB / day.
    • Conclusion: Media must use Object Storage (S3), not the database.
  • Bandwidth:
    • Ingress (Media) \approx 1 PB / day \approx 11.5 GB/sec. Massive bandwidth usage; requires a CDN.
Image

5. API design

For real-time chat, HTTP is too slow because of header overhead and lack of server-push capability.

We will use WebSockets (or a persistent TCP protocol) to keep a connection open between the client and server.

1. Connect (WebSocket Handshake)

  • GET /ws/connect
  • Headers: Authorization: Bearer <token>
  • Response: 101 Switching Protocols (Connection Established).

2. Send Message (WebSocket Event)

  • Client sends:
{ "type": "send_message", "data": { "to_user_id": "user-uuid-2", "content": "encrypted-blob", "type": "TEXT" } }

3. Upload Media (REST Endpoint)

  • POST /v1/media
  • Request: Multipart file (image/video).
  • Response:
{ "media\_id": "media-uuid-99", "url": "https://s3.region.amazonaws.com/bucket/media-uuid-99.jpg" }

4. Acknowledge Message (WebSocket Event)

  • Client sends:
{ "type": "ack", "data": { "message\_id": "msg-123", "status": "DELIVERED" } }

6. High level architecture

We will decouple the layer that holds the connections from the layer that processes the logic.

Client \rightarrow Load Balancer \rightarrow Gateway Service \rightarrow Message Service \rightarrow Database

  1. Client: The mobile app. Maintains a local SQLite database for chat history and a persistent WebSocket connection to the server.

  2. Load Balancer: Distributes millions of WebSocket connections across the Gateway servers.

  3. Gateway Service (Connection Holder):

    • A stateful server that holds the persistent TCP/WebSocket connection for users.
    • It maintains a map of UserID \rightarrow Socket Connection.
    • It acts as a pipe: pushes data to clients and forwards requests to the Message Service.
  4. Session Cache (Redis): A global lookup that tells us "Which Gateway server is User B connected to?".

  5. Message Service: Stateless microservice. It handles message validation, persistence to the DB, and routing logic.

  6. Group Service: Manages group members and fan-out logic.

  7. Database (Cassandra): Stores chat history and offline message queues.

  8. Object Storage (S3): Stores images and videos.

Image

7. Data model

We need a database optimized for extremely high write throughput. SQL databases struggle to scale to billions of writes/day without complex sharding. We will use a Wide-Column NoSQL store (Cassandra).

1. Messages Table (Chat History)

  • Partition Key: chat_id (Unique ID for 1-on-1 or Group conversation). This keeps all messages for a chat on the same database node, making "Load History" extremely fast.
  • Clustering Key: message_id (TimeUUID). This sorts messages chronologically.
ColumnTypeDescription
chat_idUUIDPartition Key.
message_idTIMEUUIDClustering Key.
sender_idUUIDWho sent it.
contentBLOBEncrypted payload.
statusINTSent, Delivered, Read.

2. User_Unread_Queue (Offline Optimization)

  • Partition Key: user_id.
  • Clustering Key: message_id.
  • Purpose: When a user comes online, we query this table to quickly find only the messages they missed, rather than scanning the entire chat history.

3. Group_Members (SQL or NoSQL)

  • group_id (PK), user_id (PK). Used to look up "Who do I send this group message to?".

8. Core flows end to end

This section details exactly how data moves through the system. We will trace the path of a message step-by-step to understand how the components (Gateways, Database, Caches) interact.

Flow 1: User A sends a message to User B (Both Online)

This is the "happy path" where both users have the app open and are connected to the internet.

  1. Connection Established (Pre-requisite)
    • Both User A and User B have already established a long-lived WebSocket connection with the Gateway Service.
    • User A is connected to Gateway Node 1.
    • User B is connected to Gateway Node 2.
    • Redis (Session Cache) stores this mapping: User A -> Gateway 1 and User B -> Gateway 2.
  2. Message Sending (User A)
    • User A types "Hello" and presses send.
    • The mobile app generates a message_id locally (for deduplication) and sends a JSON payload over the existing WebSocket connection to Gateway Node 1.
    • Note: The UI immediately shows a single grey tick (meaning "Sending...") to make the app feel responsive, even before the server confirms.
  3. Processing and Persistence
    • Gateway Node 1 receives the payload. It is just a pipe, so it forwards the request to the Message Service (the brain of the system).
    • The Message Service first persists the message to the Cassandra Database (Messages Table) with a status of SENT.
    • Why Persist First? We save to disk before attempting delivery to ensure we never lose the message, even if the server crashes milliseconds later.
  4. Acknowledgment to Sender
    • Once Cassandra confirms the write, the Message Service tells Gateway Node 1 to send an ACK back to User A.
    • User A’s screen updates: The single grey tick becomes solid (Server Received).
  5. Routing and Delivery
    • The Message Service needs to find User B. It queries the Redis Session Cache: "Where is User B connected?"
    • Redis replies: "User B is on Gateway Node 2."
    • The Message Service forwards the message payload to Gateway Node 2.
    • Gateway Node 2 pushes the message down the open WebSocket connection to User B's device.
  6. Delivery Receipt
    • User B's app receives the message and immediately sends a background ACK: DELIVERED packet back to Gateway Node 2.
    • This Ack travels back: Gateway 2 -> Message Service -> Update Cassandra Status -> Gateway 1 -> User A.
    • User A’s screen updates: The tick becomes a double grey tick (Delivered).
Image

Flow 2: User A sends a message to User B (User B is Offline)

This flow handles the case where the recipient has no internet connection or has the app closed.

  1. Send and Persist (Same as Flow 1)
    • User A sends the message.
    • The Message Service saves it to Cassandra as SENT.
    • User A receives the single grey tick (Server Confirmation).
  2. Routing Failure
    • The Message Service queries Redis for User B.
    • Redis returns NULL or an "Offline" status.
  3. Queuing for Later
    • The Message Service writes a new entry into the User_Unread_Queue (a specific table in Cassandra designed for fast lookups).
    • It stores user_id: B and message_id: XYZ.
  4. External Notification
    • Since the WebSocket path is dead, the Message Service calls the Push Notification Service.
    • This service sends a payload to Apple (APNS) or Google (FCM) to wake up User B’s phone with a system notification: "New message from User A."
Image

Flow 3: User B comes Online (Synchronization)

This flow explains how User B catches up on missed messages when they open the app.

  1. Reconnection
    • User B opens the app. The client establishes a new WebSocket connection to a random Gateway Node (e.g., Gateway Node 3).
    • Gateway Node 3 updates Redis: User B -> Gateway 3.
  2. Checking the Queue
    • Upon connection, Gateway Node 3 (or the Message Service) automatically queries the User_Unread_Queue table in Cassandra for User B.
  3. Bulk Delivery
    • If the queue has messages (e.g., 5 unread messages), the service fetches the full message content from the Messages Table.
    • It sends all 5 messages down the WebSocket to User B in a batch.
    • User B's client sends ACK: DELIVERED for all of them.
  4. Cleanup
    • Once the specific ACKs are received, the system deletes those entries from the User_Unread_Queue so they aren't redelivered later.
Image

9. Caching and read performance

  • Session Cache (Redis):
    • Data: Key: UserID, Value: Gateway_ID.
    • Why: This is hit for every message sent. It must be blazing fast (latency < 5ms).
    • TTL: Keys expire if the WebSocket connection is closed or heartbeats fail.
  • User Profile Cache:
    • Data: Display Name, Avatar URL, Status.
    • Why: Profile data rarely changes but is read often.
  • Chat History:
    • We do not cache chat history in Redis. It is too large (Petabytes).
    • Cassandra is optimized for sequential reads. When a user opens a chat, we fetch the last 50 messages from Cassandra efficiently using the chat_id.

10. Storage, indexing and media

  • Database (Cassandra):
    • Used for all text messages and metadata.
    • Indexing: The primary key (chat_id, message_id) allows us to query "Get messages for Chat X where time > Y".
  • Media Storage (S3 + CDN):
    • We never store binary images in Cassandra.
    • Flow:
      1. User uploads image to API Server \\rightarrow S3.
      2. API Server returns a URL.
      3. User sends a text message containing the URL.
      4. Receiver's phone sees the URL and downloads the image from a CDN (like CloudFront) to ensure fast download speeds globally.

11. Scaling strategies

  • Gateway Scaling:
    • To handle 1 Billion users, we need thousands of Gateway servers.
    • We use a Load Balancer to route new users to the least busy server.
    • If a Gateway crashes, the 100k users on it disconnect and automatically reconnect to a new Gateway.
  • Database Partitioning:
    • Cassandra partitions data automatically based on the chat_id.
    • This ensures even distribution of data across the cluster.
  • Group Chat Fan-out:
    • For a group with 200 people, the server must generate 200 outgoing messages.
    • To prevent blocking the sender, we use a Message Queue (Kafka). The Message Service publishes a "Group Message" event to Kafka. Background workers consume this event, look up the 200 members, and process the delivery for each member in parallel.

12. Reliability, failure handling and backpressure

  • No Single Point of Failure:
    • Gateways, Services, and Databases are all replicated across multiple Availability Zones (AZs).
  • Store and Forward:
    • We guarantee At-Least-Once delivery. If we are not sure if User B got the message (e.g., network timeout during Ack), we resend it. The client app handles deduplication using the message_id.
  • Backpressure:
    • If a user is being spammed (e.g., receiving 1000 messages/sec), the Gateway buffer will fill up. The Gateway can stop reading from the socket (TCP flow control), forcing the sender to slow down.

13. Security, privacy and abuse

  • End-to-End Encryption (E2EE):
    • This is the gold standard (Signal Protocol).
    • Keys are generated on the user's device. The server only holds the Public Key.
    • The server processes encrypted blobs (content). It cannot read the messages.
  • Authentication:
    • Phone number verification via SMS OTP.
    • WebSocket connections are authenticated via a secure, short-lived session token.
  • Abuse:
    • Rate limiting on the "Send Message" API to prevent bots.
    • Ability to Block users (the server drops messages from blocked users before delivery).

14. Bottlenecks and next steps

  • Bottleneck: Group Fan-out:
    • For very large groups (e.g., 10,000 users), the loop-based fan-out is too slow.
    • Next Step: Switch to a "Pull" model for large groups, where users poll for the latest messages when they open the group, rather than the server pushing every message.
  • Bottleneck: Global Latency:
    • A user in Australia messaging a user in the UK might experience lag if the data center is in the US.
    • Next Step: Deploy Gateways in multiple regions (Edge locations). Connect these regions via a high-speed private backbone network to reduce latency.

Summary

  1. Architecture: Stateful Gateways (WebSockets) + Stateless Services.

  2. Storage: Cassandra for write-heavy chat logs; S3 for media.

  3. Reliability: "Unread Queue" ensures messages are never lost if a user is offline.

  4. Performance: Routing via Redis Session Cache ensures millisecond delivery.

Mark as Completed

On this page

1. Problem Definition and Scope

2. Clarify functional requirements

3. Clarify non-functional requirements

4. Back of the envelope estimates

5. API design

6. High level architecture

7. Data model

8. Core flows end to end

Flow 1: User A sends a message to User B (Both Online)

Flow 2: User A sends a message to User B (User B is Offline)

Flow 3: User B comes Online (Synchronization)

9. Caching and read performance

10. Storage, indexing and media

11. Scaling strategies

12. Reliability, failure handling and backpressure

13. Security, privacy and abuse

14. Bottlenecks and next steps

Summary