00Quick Orientation
Chat looks like one problem from the outside but decomposes into several distinct sub-problems once you start designing it. Persistent connections drive the architecture. Delivery semantics drive the reliability story. Group chat is its own fan-out problem. Presence has its own scaling characteristics. Each of these is a depth probe; strong walkthroughs treat them as separate beats rather than blending them into one continuous discussion.
The four framework steps from the methodology page structure the walkthrough:
- Clarify (5 min) — Scope, scale, edge cases. One-on-one only, or also groups? Encryption? What's the latency target?
- Decompose (10 min) — High-level architecture. Where do persistent connections live, how does a message flow from sender to recipient.
- Deep-dive (25-30 min) — Message delivery semantics, group fan-out, brief notes on presence and end-to-end encryption.
- Evaluate (5 min) — Failure modes, what we'd add at the next scale, where the design would break.
Pattern recognition first
This is a chat / messaging pattern. Naming it out loud in the first minute is the senior signal: "This is a chat messaging pattern. The dominant decisions are how we handle persistent connections at scale, what delivery guarantees we offer, and how group chat fans out. End-to-end encryption is in scope for some variants. Where would you like me to focus deepest?" That sentence does the same work as in any walkthrough: shows pattern fluency, sets up the conversation, gives the interviewer a chance to redirect.
Step 1Clarify (5 min)
Chat has a wide scope by default. The clarification step narrows the scope and surfaces assumptions that will drive every later decision.
The conversation
You
Before I dive in, let me confirm the scope. When you say "WhatsApp," are we focused on the core messaging experience — sending and receiving text messages, one-on-one and groups — or do you also want me to cover voice/video calls, status updates, payments?
Interviewer
Stay focused on text messaging for the core. Cover both one-on-one and group chats. We can talk about other features briefly if there's time.
You
What scale? Day-one launch, or established WhatsApp scale with billions of users?
Interviewer
Mature scale. Assume two billion users globally with about a billion daily actives.
You
A few more clarifying questions. What's our latency target for message delivery — sender hits send, recipient sees it. And is end-to-end encryption in scope?
Interviewer
Sub-second p99 for online recipients. End-to-end encryption is in scope but I'd like you to focus on it briefly rather than spend the whole interview on key exchange.
You
Group sizes? And do we need read receipts and typing indicators, or just delivery?
Interviewer
Cap groups at a few hundred members; you don't need to design for broadcast channels of millions. Read receipts yes, typing indicators yes but lower priority.
Functional requirements
- Send / receive messages. Text only for the core design.
- One-on-one chat. Two users exchanging messages.
- Group chat. Multiple users (capped at a few hundred members) sharing a conversation.
- Presence. Online / offline / last-seen indicators.
- Read receipts. Sender sees when recipient has read the message.
- Message persistence. Full conversation history, retrievable.
- Push notifications. Wake up offline clients when messages arrive.
- End-to-end encryption. Server can route but not read message content.
Non-functional requirements
- Scale. Two billion users, one billion DAU.
- Latency. Sub-second p99 message delivery for online recipients.
- Reliability. No lost messages. At-least-once delivery is acceptable; client deduplicates.
- Availability. Highly available. Brief degradation acceptable; permanent message loss not.
- Mobile-friendly. Intermittent connectivity, low data usage, battery-conscious.
Quick scale estimation
| Quantity | Estimate |
|---|---|
| Daily active users | ~1B DAU |
| Messages per user per day | ~80 (sent and received combined) |
| Total messages per day | ~80B / day |
| Average message rate | ~1M messages / second |
| Peak message rate (3x average) | ~3M messages / second |
| Concurrent online users (peak) | ~300M (30% of DAU online at peak) |
| Persistent connections at peak | ~300M concurrent |
| Average message size | ~200 bytes (text) |
| Daily message storage | ~16 TB / day · ~6 PB / year |
Three things this tells us. First, the message rate (3M/sec peak) is high but not exotic. Second, the concurrent connection count (300M) is the hard number — keeping persistent connections to hundreds of millions of clients is what distinguishes chat from request-response systems. Third, storage at 6 PB/year compounds — we need to think about retention and tiering.
The connection number is also what makes the gateway tier the load-bearing layer. A typical server handles ~50K-100K WebSocket connections; 300M concurrent connections means several thousand gateway servers, geographically distributed.
What good clarification looks like
The questions that mattered: scope (core messaging vs all features), scale (two billion vs day-one), latency (sub-second), encryption (yes but brief), groups (capped at a few hundred). Each one had a clear effect on the design. Notice what didn't get asked: "what does the UI look like?" or "do users have profile pictures?" — both real concerns but they don't drive the architecture.
Step 2Decompose (10 min)
The high-level architecture. The defining feature: persistent connections from clients to the gateway tier, which is what enables real-time push from server to client without polling.
High-Level Architecture
A message flows: sender's persistent connection to Gateway A, into Message Service, persisted in Message Store and queued for routing. Delivery Workers route the message either to the recipient's Gateway B (if online) or to the Push Service for mobile push notification (if offline). Both clients hold persistent connections; they're long-lived TCP sessions, not request-response.
Walking through the components
- Clients (sender / recipient). Mobile or web. Maintain a persistent WebSocket connection to a gateway server. The connection stays open as long as the client is online; messages can be pushed in either direction without polling.
- Gateway tier. Terminates client connections. Each gateway server holds tens of thousands of open WebSocket connections. The gateway translates between the client wire protocol and internal message events. It's also the hardest tier to operate because of stateful long-lived connections — see the rate limiting deep-dive on the connection-stickiness problem.
- Message Service. The central authority. Validates the message, generates a server-assigned message ID and timestamp, persists the message to the store, publishes a routing event to the queue. Returns ACK to the sender's gateway, which forwards it to the sender client.
- Message Store. The system of record. Sharded by conversation ID (one-on-one or group). Cassandra, DynamoDB, or similar — the workload is append-heavy with point lookups, exactly what these stores are good at. Database selection and sharding cover the underlying patterns.
- Routing Queue. Kafka or equivalent. Decouples the sender's write from the recipient's delivery. Each message published once; downstream Delivery Workers consume.
- Delivery Workers. Consume from the routing queue. For each recipient, look up whether they're online (presence service) and which gateway holds their connection. If online: route the message event to that gateway. If offline: hand off to the push service.
- Push Service. Talks to APNs (Apple) and FCM (Google) to send notifications to mobile devices that don't have an active connection. This is the "you have a new message" notification on the lock screen.
Why the persistent connection is load-bearing
The most important architectural fact in chat: clients hold persistent connections, not request-response sessions. This single decision drives several downstream concerns:
- Stateful gateways. Gateway servers are not stateless; each one tracks which users' connections it holds. Load balancing is not just round-robin; the recipient's gateway location must be discoverable.
- Connection scaling. Each gateway handles ~50K-100K open connections. 300M concurrent users means thousands of gateway servers. Operationally non-trivial.
- Failure handling. If a gateway crashes, all of its connections drop. Clients reconnect (often to a different gateway), and the system must catch up on undelivered messages.
- Geographic distribution. Gateways are deployed regionally; clients connect to the nearest one. The Message Service stays mostly central but sees regional traffic.
Many candidates skip this architectural commitment and end up with a polling-based design that doesn't meet the latency target. Naming the persistent connection explicitly is what distinguishes a chat design from a generic request-response design.
What you're not building yet
Voice/video calls, status updates, payments, sticker packs, message reactions, message search, archives, multi-device sync. All real chat features but not in our scope. Mention them if asked, but don't add them to the diagram.
Step 3aDeep-Dive: Delivery Semantics (12 min)
The single most-probed depth question in chat interviews: how do you guarantee message delivery? This is where many candidates fall apart by promising exactly-once or by handwaving "we'll just use a queue." The right answer is at-least-once delivery with idempotent client handling and an explicit message lifecycle.
The delivery semantics choice
Three options, only two are practical, and only one is the canonical answer.
Decision
At-most-once, exactly-once, or at-least-once?
At-most-once. Send the message; if anything fails, drop it. No duplicates ever, but messages can be lost. Unacceptable for chat — losing messages silently is the failure mode users complain about most.
Exactly-once. Each message delivered exactly one time, no duplicates, no losses. Theoretically perfect. In practice, achieving exactly-once across a distributed system requires distributed transactions or sophisticated idempotency tokens; the cost is high and the failure modes are worse than at-least-once with deduplication.
At-least-once. The message is delivered one or more times. Duplicates are possible (network glitch causes a redelivery, for example). The client deduplicates based on message ID. This is what production chat systems use.
The message lifecycle
A message moves through three states as it traverses the system. Each transition involves an explicit acknowledgement that flows back to the sender, which is what powers the WhatsApp-style checkmark UI ("sent" / "delivered" / "read").
Message Lifecycle: Three States, Three ACKs
Three states, three rounds of ACK. SENT means the server has persisted the message. DELIVERED means the recipient's device has it. READ means the user actually viewed it. Each stage is its own ACK that flows back to the sender. The dashed lines are ACKs returning; the solid lines are message movements forward.
Walking through each stage
SENT: client → server, server persists
Client sends the message with a client-side ID (typically a UUID or monotonic counter). The server receives it, generates a server-assigned message ID, persists to the message store, and returns ACK with the server ID. Until the client receives this ACK, the message shows "sending..." in the UI. After ACK, it shows the single check ("sent").
The retry behavior here matters: if the client doesn't receive ACK within a timeout (a few seconds), it retries with the same client_id. The server uses the client_id for idempotency — if it's already seen this client_id, return the same server_id rather than creating a duplicate.
DELIVERED: server → recipient, recipient ACKs
The Delivery Worker (consuming from the routing queue) checks recipient presence. If online, the worker hands the message event to the recipient's gateway, which pushes over the open WebSocket. The recipient's client receives, persists locally, and sends a delivery ACK back. The ACK flows server-side back to the sender's session, which updates the sender's UI to two checks ("delivered").
If the recipient is offline, this stage is delayed. The message sits in the recipient's pending queue (or is fetched on next reconnect). The push notification fires concurrently to wake the device.
READ: recipient opens chat, sends read receipt
When the recipient actually opens the conversation and sees the message, the client sends a read receipt. Server forwards to sender, sender's UI updates to two blue checks ("read"). This stage is only relevant if read receipts are enabled; some users disable them for privacy.
Idempotency: making at-least-once safe
At-least-once means duplicates can happen. The client must handle them safely. Two common dedupe strategies:
- Server-assigned IDs. The server stamps each message with a unique server_id. When the client receives a message, it checks against locally-stored recent IDs. If already seen, drop the duplicate silently.
- Client-assigned IDs. The client_id (sent with the message originally) is also used for dedupe. Useful for sender-side: if the network glitches and the client retries the send, the server uses client_id to avoid creating two copies.
Most production systems use both: server_id for downstream deduplication, client_id for retry safety on the sending side. The cost is small (a few-hundred-ID rolling buffer per client) and the reliability gain is large.
Ordering
The other delivery question: what order do messages appear in the conversation? Two layers of ordering:
- Within a conversation, server timestamps order messages. The server stamps each message at receipt time; messages are stored in timestamp order per conversation. Client UIs sort by this timestamp.
- Across conversations (the user's chat list), last-message-time orders. The recent-conversations view sorts by the most recent message in each conversation.
Clock skew between users is a non-issue because the server is the single source of timestamps. Clock skew between server replicas can cause minor reordering at the boundary, which is usually acceptable; if strict ordering is required (rare in chat), use sequencers per conversation rather than wall clocks.
At-least-once is the answer. Exactly-once is a trap. Idempotent clients turn duplicate handling into a non-event.
Step 3bDeep-Dive: Group Chat Fan-Out (8 min)
Group chat looks like one-on-one chat but it isn't. The architectural decisions diverge at the moment of "send a message to a group of N people." Strong walkthroughs treat group chat as its own sub-problem rather than handwaving it as the same plus more.
The choice: store-once or store-per-recipient?
Two ways to think about a group message: as one logical message that belongs to a conversation, or as N copies (one for each recipient).
- Store-once-per-conversation. The message is persisted once in the conversation's storage. Each member's view of the conversation reads from the same shared log. Storage is cheap.
- Store-per-recipient. The message is copied N times, once per recipient member. Each member has their own message log. Storage is N times more expensive but per-recipient state (delivery, read) is simpler.
The right answer for a chat platform with capped group sizes is store-once-per-conversation, plus per-recipient delivery state tracked separately. We don't pay N-times storage; we do pay the small overhead of tracking delivery/read state per member.
The fan-out itself
Once persisted, the message must reach every member. The fan-out is similar to the social feed pattern but smaller in scale (a few hundred members vs millions of followers).
Group Chat Fan-Out: One Message, N Delivery Paths
A group message is stored once in the conversation but routed individually to each member. Online members get the message over their open WebSocket. Offline mobile members get a push notification that wakes the device. Offline non-mobile members pull on their next reconnect from the pending queue.
Per-recipient delivery state
Each member tracks their own delivery state for the message: not-yet-delivered, delivered, read. Storing per-recipient state separately from the message itself keeps the storage layout clean: one row in the message table (the message), N rows in a delivery state table (one per recipient member).
For groups of a few hundred members, this is bounded and cheap. For broadcast channels (millions of subscribers), this would explode — which is why we limited group size to a few hundred during clarification. Broadcast channels are a different pattern with different decisions.
The depth-probe response
"What about read receipts in a group?" The strong response: "Each member has their own read state. The sender's UI typically shows aggregated state — 'read by 5 of 10' — rather than tracking each individual. We'd compute the aggregation when the sender opens the message details, not push it to them in real time. Real-time per-recipient read tracking would multiply the read-receipt traffic by the group size; not worth the cost."
Why broadcast channels are different
WhatsApp's broadcast lists, Telegram channels with millions of subscribers — these aren't groups in the chat sense. They're closer to the social feed pattern: one author, many subscribers, no individual delivery tracking. The architectural decisions (push vs pull, hybrid for celebrities) come from the social feed walkthrough, not from chat. Naming this distinction is a senior signal.
Step 3cDeep-Dive: Presence, Read Receipts, Encryption (5 min)
Three sub-problems that show up in nearly every chat interview as depth probes. Each one is its own design decision; naming the canonical answer is enough to demonstrate awareness without burning interview time on full deep-dives.
Presence: heartbeats with eventual consistency
Presence (online / offline / last-seen) sounds simple but has subtle scaling problems. Naive presence (every status change is broadcast immediately to every interested party) creates massive write amplification: when a popular user comes online, hundreds of conversations need to know.
The canonical approach:
- Clients heartbeat to their gateway every ~30 seconds. The gateway forwards the heartbeat to a presence service, which updates the user's last-seen timestamp.
- Presence is eventually consistent. A user's status might be a few seconds stale. That's fine for chat; nobody's billing decisions depend on chat presence.
- Subscribers pull rather than get pushed. When a client opens a conversation, it queries the presence of the participants. The query is cheap; pushing every status change to every interested client would not be.
- Typing indicators are ephemeral. They don't persist. The typing-start event flows over the same WebSocket, broadcast to participants, expires in a few seconds.
The key insight: presence is high-volume but low-value-per-update. Optimize for cheap reads (pull on demand) rather than expensive pushes (every status change broadcast).
Read receipts: optional per-message acknowledgements
The READ stage in the message lifecycle. When the user actually opens the conversation, the client sends a "read up to message_id X" signal. Server marks all messages in the conversation up to that ID as read for this user, forwards to the sender's session.
For groups, the per-recipient read state is stored per (message_id, recipient_id) tuple. Aggregation ("3 of 5 read") happens at query time, not at write time.
Read receipts are user-toggleable in production systems. When disabled by either side, the receipt is suppressed end-to-end. The sender's UI shows "delivered" only, not "read."
End-to-end encryption: the server can't read
E2EE changes what the server can see. Three layers of impact:
- Message content is encrypted client-to-client. The server stores ciphertext, routes ciphertext, never sees plaintext. Keys are negotiated client-side via Signal protocol or similar (X3DH for initial key exchange, Double Ratchet for forward secrecy).
- Server can still see metadata. Who sent to whom, when, message size. The conversation graph is visible even when content isn't. This is a real privacy limitation; users sometimes assume E2EE means nothing is visible, which isn't true.
- Some features become impossible. Server-side message search can't work over encrypted content. Server-side spam filtering can only use metadata. Cross-device sync requires careful key management. These tradeoffs are real and should be acknowledged.
For groups, E2EE is more complex: each member has their own keys, so the message must be encrypted N times (once per member) or use a group key shared via secure channels. WhatsApp uses the second approach (sender keys); Signal uses pairwise. Both work; pairwise is more secure but more expensive.
The interview move on E2EE: name the canonical pattern (Signal protocol or pairwise / sender keys for groups) and acknowledge the tradeoffs (no server-side search, metadata still visible). Don't try to walk through the cryptography in detail unless the interviewer specifically asks; that's a different interview.
Step 4Evaluate (5 min)
The closing move. What we built, what we skipped, what would break under stress, what we'd add at the next scale.
What we got right
- Persistent connections named explicitly. The gateway tier with WebSocket termination is the architectural primitive. Without naming this, the design defaults to polling and misses the latency target.
- At-least-once delivery with idempotent clients. The canonical pattern. Avoids the exactly-once trap; pairs reliable retry with safe deduplication.
- Group chat as its own sub-problem. Fan-out to members, per-recipient delivery state, store-once-in-conversation. Recognized as distinct from one-on-one chat rather than blended.
- Three-stage message lifecycle (SENT, DELIVERED, READ). Mirrors what users see in WhatsApp / iMessage / similar; explicit ACKs at each stage.
What we'd add at the next scale
- Geographic distribution of gateways. 300M concurrent connections demand regional gateway pools. Connect users to the nearest gateway; route the routing queue regionally where possible. Replication and consistency covers the cross-region tradeoffs.
- Multi-device sync. WhatsApp now supports up to four linked devices per account; each must see the same conversation state. Adds state synchronization complexity (when a new device joins, it needs the recent message history; when one device reads, the others should reflect it).
- Tiered storage for old messages. Most queries hit recent messages. Old messages (months or years old) are rarely read. Move them to colder, cheaper storage tiers; keep recent messages in fast storage. Database selection covers the tier choices.
What we explicitly didn't cover
- Voice and video calls (whole separate signaling and media architecture, WebRTC).
- Media messages (images, videos, files) — separate object store, transcoding, link previews.
- Status updates and stories (more like a social feed pattern).
- Backups, archives, message export.
- Spam filtering and abuse detection (especially hard with E2EE).
- Account migration, phone number changes, device replacements.
Where this design would break
- Mass connection drops (gateway crash). A single gateway crash drops 50K-100K connections. Clients reconnect (often to different gateways), and the system must catch up on undelivered messages quickly. Mitigation: persistent client state (last-seen message ID per conversation), efficient reconnect-and-fetch paths.
- Routing queue backlog. If Delivery Workers fall behind during a traffic spike, message delivery latency spikes. The 1-second target is missed; users see noticeable delay. Mitigation: provisioned for peak, autoscale workers, separate queues per priority tier (interactive vs background).
- Presence storms. When network conditions cause many users to disconnect simultaneously (bad WiFi, mobile dead zones), the presence service sees a write storm. Mitigation: rate limit presence updates per gateway, batch updates, accept brief inconsistency.
- Push notification rate limits. APNs and FCM have per-app rate limits. A message storm to many offline users could hit those limits. Mitigation: prioritize, batch where possible, accept best-effort for non-critical notifications.
The evaluate step is what separates senior from staff. Naming the failure modes and what you'd add at the next scale is the operational thinking that distinguishes someone who's run systems from someone who's only designed them.
04Common Follow-Up Probes
The most common follow-up questions in chat messaging interviews. The strong responses below assume the design above is on the whiteboard.
Probe
"How does the recipient's gateway know it should receive this user's messages?"
Two layers. First, the user's connection to a specific gateway is registered in a presence/routing service when the connection opens. Second, the Delivery Worker queries this service to learn which gateway holds the recipient's connection. Routing service is sharded by user_id; updates are write-on-connect, write-on-disconnect, with TTLs to handle gateway crashes (stale entries expire). For high volume, the routing service is heavily cached. Observability matters here — alerting on routing-service hit rate and stale entries.
Probe
"What happens when a client reconnects after being offline for hours?"
The client knows the last server_id it acknowledged before disconnecting. On reconnect, it sends "give me everything after server_id X" to the server. The server queries the message store for recent messages in each of the user's conversations after that ID and streams them back. Bounded by recent history (we don't replay months of messages on every reconnect; old messages are pulled lazily when the user opens the conversation). The pending queue acts as a fast path for very recent messages; longer gaps fall back to the message store directly.
Probe
"How do you handle a client sending the same message twice (network glitch)?"
The client_id makes this idempotent. The first send creates a server-assigned message_id and persists. The retry (with the same client_id) is recognized as a duplicate by the message service; it returns the existing server_id rather than creating a new message. The client sees a single ACK (just delayed). On the recipient side, dedupe by server_id ensures they only see one message. The cost is a small lookup per send; the reliability gain is significant.
Probe
"Two users in different regions, talking to each other. How does the cross-region path look?"
Sender's gateway is regional; the message goes to the regional Message Service. Persistence to the global message store happens once. The routing event is published to a regional or global queue. The Delivery Worker resolves the recipient's gateway (which is in a different region). The message event crosses regions to reach the recipient's gateway. End-to-end latency adds the cross-region hop, typically 50-200ms depending on regions. For sub-second p99, this works; for sub-100ms, you'd need same-region routing only or asymmetric architecture. Replication and consistency covers the tradeoffs.
Probe
"How would this design change for Slack or Discord, which have channels with thousands of members?"
Slack and Discord are closer to the social feed pattern at the high-membership end. A channel with 10K members is more broadcast than chat. The architecture needs hybrid push/pull (push to active members, pull-on-read for less-active members), similar to the celebrity solution from the social feed walkthrough. Persistence and routing are similar; the fan-out scaling is different. Slack's actual architecture sits between chat and feeds, with optimizations for the moderate-channel case.
Probe
"What's the failure mode if the presence service goes down?"
Delivery Workers can't determine if recipients are online, so they default to offline-path delivery (push to mobile, queue for non-mobile). Online recipients miss real-time delivery and get push notifications instead — degraded experience but not message loss. Heartbeats also fail to update; status indicators across the system go stale. Clients still send and receive messages over their established WebSocket connections; the failure is bounded to presence-driven routing decisions. Mitigation: replicated presence service with regional failover, accept degraded routing during outages.
05How This Walkthrough Composes Concepts
Chat messaging composes a slightly different set of concepts than social feed. Here's the explicit map:
- Message queues. The routing queue (Kafka) decouples Message Service writes from Delivery Worker consumption. At-least-once delivery semantics, partitioned by conversation or recipient. The same patterns from the message queues deep-dive apply directly.
- Database selection. Message store is Cassandra-class (append-heavy, point lookups by conversation, eventual consistency tolerable). Postgres works at smaller scale; dedicated KV at our scale.
- Sharding. Message store sharded by conversation ID. Gateway-routing service sharded by user ID. Each partition decision driven by access pattern.
- Replication and consistency. Cross-region replication of the message store and presence service. Eventual consistency on presence; stronger consistency on message persistence.
- Load balancing. Stateful load balancing for gateway connections (sticky to the same gateway for the duration of a connection). This is a less common pattern than stateless LB; the gateway-routing service is what makes it work.
- Rate limiting. Per-user message-send rate limits (spam prevention). Per-IP connection-establishment limits (DOS protection). Rate limiting covers the per-user vs per-IP patterns.
- Observability. Connection counts per gateway, message delivery latency p99, queue depth, presence service health, push notification success rate. Standard SLO targets around delivery latency.
- Caching. Gateway-routing cache (which gateway holds which user's connection). Recent-message cache per conversation for fast scrollback. Read-receipt aggregation cache for groups.
If you've worked through the concept library, you've already seen each of these primitives. The walkthrough composes them into a chat-specific solution; the concepts are the toolkit.
06Walkthrough FAQ
WebSockets, long polling, or Server-Sent Events?
WebSockets for production chat in 2026. They're bidirectional, low-overhead, well-supported across mobile and web. Long polling is the fallback when WebSockets aren't available (some corporate networks block them); your client library should handle the fallback gracefully. Server-Sent Events (SSE) is unidirectional (server-to-client only), so you'd still need a separate channel for client-to-server messages — workable but operationally more complex than WebSockets. The interview move: name WebSockets as the primary, mention long polling as the fallback, skip SSE unless the interviewer specifically asks.
How is this different from designing Discord or Slack?
Discord and Slack add channels with very high membership counts (thousands or tens of thousands). Once channels approach broadcast scale, the architecture shifts toward social-feed patterns: hybrid push/pull, materialized views per active member, less per-recipient delivery state. The chat fundamentals (persistent connections, message lifecycle, at-least-once + idempotent) carry over; the fan-out math changes. WhatsApp-style groups (capped at hundreds) stay in pure chat-pattern territory.
What about iMessage's specific architecture?
iMessage uses Apple Push Notification service (APNs) as the persistent-connection layer for many iOS clients, rather than maintaining its own gateway tier. The architectural difference is whether you own the connection layer (WhatsApp, Signal) or delegate it to the OS-level push system (iMessage). Owning your own gateway gives more control and lower latency; delegating saves significant infrastructure. Both are valid choices; the interview move is to name the tradeoff if the interviewer probes.
Should I memorize the message lifecycle diagram?
Internalize the structure (three states, three rounds of ACK), not the specific drawing. In an interview, you'd sketch a similar diagram on the whiteboard during the deep-dive. The point is being able to walk through "what happens when the sender hits send, what happens server-side, what happens at the recipient, what comes back." Memorizing the visual won't help; understanding the flow will.
How do I handle message ordering across multiple devices?
Multi-device sync is one of the harder parts of modern chat. Each device has its own connection, its own local state, its own pending acknowledgements. The canonical approach: the server is the source of truth for ordering (server timestamps), each device pulls from a shared per-conversation log, devices sync read-state to each other through the server. WhatsApp added multi-device support relatively recently (2021) for exactly this reason; it required significant architectural changes. For most interview discussions, single-device is in scope by default and multi-device is a follow-up.
What about end-to-end encryption breaking server-side features?
Real tradeoff. With E2EE, the server can't search messages, can't filter spam by content, can't moderate harmful content automatically. Workarounds exist: client-side search (each device searches its own decrypted messages), reporting flows (users flag messages, the act of reporting reveals content to the server), metadata-based moderation (suspicious patterns in who messages whom). None of these is as effective as server-side content access. Some platforms accept the limitation as the price of privacy; others (like Telegram) have non-encrypted modes for this reason. The honest answer is that it's a real product-level decision, not a technical one to handwave.
How do I reason about the gateway-to-message-service ratio?
Gateways are connection-bound (each holds 50K-100K connections). Message Service is throughput-bound (each handles thousands of messages per second). The ratios come out roughly: thousands of gateways for connection scale, hundreds of message-service instances for write throughput. The two scale independently; you can't infer one from the other. This is also a real interview probe — "how many of each component would you provision?" — and it's worth being able to walk through the math.
What if the interview is at a non-chat company but about a chat-shaped feature?
Same pattern, smaller scale. A customer-support chat feature on an e-commerce site uses the same primitives (persistent connections, message lifecycle, at-least-once delivery) at a fraction of the scale. The clarification step ("what's the scale?") drives whether you build the full architecture or simplify aggressively. For a feature serving thousands of concurrent users, you can probably skip the dedicated gateway tier and use a single-process WebSocket server. For millions, you can't. Pattern recognition tells you which primitives to use; scale tells you how heavily to invest in each.