On this page
Real-Time Streaming and Low Latency
UDP over TCP
WebRTC
Coordinating Group Calls (P2P vs SFU vs MCU)
Scaling the Video Conferencing Infrastructure
Distributed Data Centers
Cloud Burstability
Load Balancing
Maintaining Call Quality at Scale
Adaptive Bitrate Streaming
Multiple Stream Layers
Forward Error Correction (FEC)
Application-Level QoS
Conclusion
FAQs (Frequently Asked Questions)
How to Design a Video Conferencing System (Zoom Architecture)


This blog breaks down how to design a video conferencing system like Zoom, focusing on the backend architecture that makes real-time audio/video streaming possible. It explores how Zoom’s system delivers low-latency streams (with WebRTC protocols), coordinates group calls via media servers, and scales globally while keeping video call quality high.
Ever wondered what happens behind the scenes when you click “Join Meeting” on Zoom?
Designing a video conferencing system is like pulling off a live group chat with video and audio at lightning speed — one tiny delay, and the whole experience feels off.
From capturing your voice and video to delivering them to others around the world in milliseconds, a Zoom-like app requires clever engineering.
In this blog, we’ll walk through how to design a video conferencing system (using Zoom’s architecture as a guide).
Whether you’re curious about the tech or preparing for a system design interview, read on – you’ll learn how Zoom achieves real-time communication, how group calls are managed, and how the system scales to millions of users without skipping a beat.
Real-Time Streaming and Low Latency
A key requirement for any video call system is ultra-low latency.
In a normal streaming video (like YouTube), a few seconds of buffering is okay.
But in a live video chat, even a half-second delay is noticeable and makes conversation awkward.
The goal is to keep round-trip latency (the time for your voice to go to the server and reach the other person) ideally under ~300 ms for a natural, real-time feel.
How can we achieve this?
By favoring speed over perfection.
UDP over TCP
Zoom and similar apps send media over UDP (User Datagram Protocol) instead of TCP.
UDP is a “fire-and-forget” protocol – it sends packets without waiting to ensure each one arrives. This might sound risky, but it’s actually perfect for live media.
If a few video frames or audio packets get lost, it’s not a big deal; the call continues.
TCP, by contrast, tries to guarantee delivery by re-sending lost packets and delivering everything in order – great for web pages, terrible for real-time video (imagine a pause every time a packet is lost).
By not waiting for acknowledgments or retransmissions, UDP minimizes delay.
Zoom defaults to UDP for streaming since it provides the lowest latency and doesn’t mind a bit of packet loss.
In fact, Zoom only falls back to TCP (or even TLS/SSL tunnels) if it has to – for example, if you’re on a network that blocks UDP. This tiered fallback ensures that even behind strict firewalls, the call can connect (albeit with a bit more latency).
WebRTC
Under the hood, Zoom leverages technologies from WebRTC (Web Real-Time Communication), which is a standard protocol suite for real-time audio/video in browsers and apps.
WebRTC handles things like NAT traversal – most users are behind routers or firewalls, so how do two devices send data directly?
WebRTC uses STUN servers to help clients discover their public IP/port and establish a peer-to-peer path for UDP.
If direct peer-to-peer is blocked, WebRTC falls back to using a relay server (via a TURN server) to carry the data. These mechanisms are crucial for getting audio/video streams through the internet’s obstacles quickly.
Zoom uses its own infrastructure for this, but the concepts (STUN, TURN, ICE) are similar.
In short, the system must establish a fast, direct path for media between participants, using UDP and clever networking tricks to keep latency incredibly low.
Coordinating Group Calls (P2P vs SFU vs MCU)
Handling a one-on-one video call is one thing – you can often connect the two clients directly (peer-to-peer) for media exchange if conditions allow.
But what about group calls with 5, 50, or 500 participants?
A naive approach would be a full mesh of peer-to-peer connections (each user sends their video to every other user).
That doesn’t scale at all – a 10-person call would require each user to send 9 separate video streams!
Instead, Zoom uses media servers to make group calling efficient.
Zoom’s architecture relies on what’s known as an SFU (Selective Forwarding Unit) architecture.
In simple terms, each participant sends their audio/video stream to a server (let’s call it a media router).
That server then forwards the streams to all other participants in the meeting.
So each user only has to upload one stream (to the server), and the server takes care of copying/distributing it to others.
This approach is far more scalable than a peer mesh.
It’s also lighter than an MCU (Multipoint Control Unit) approach: in an MCU, the server would actually decode and mix all video streams into one combined video (like a mosaic) and send that out – that ensures one stream per user but at a huge CPU cost on the server (and often lower quality).
Zoom avoids that heavy mixing.
In fact, evidence from network measurements suggests Zoom uses an SFU, not an MCU, for its meetings.
The media server doesn’t transcode everyone’s video together; it simply routes each stream to others.
By skipping the expensive mixing step, an SFU-based system can support many more participants on the same server hardware (one analysis noted Zoom’s design can handle up to 15x more people than a typical MCU approach by offloading work to the clients).
To coordinate a call, Zoom employs a signaling service (often via WebSockets) that helps set up and manage the call.
When you start or join a meeting, a signaling server authenticates you and tells your app which media server to connect to.
Zoom has the concept of a Meeting Zone – essentially a cluster of servers (one or more Multimedia Routers plus a controller) that will host your meeting.
Once you’re connected to the assigned media router, that server will receive your AV stream and forward it to all other participants’ devices, and vice versa.
If it’s just two participants, sometimes a direct peer-to-peer connection is used instead for efficiency, but generally the Zoom backend tries to optimize so that each meeting’s data flows through an efficient route.
TL;DR for Group Calls
Rather than everyone sending video to everyone (inefficient) or the server mixing it all (expensive), Zoom uses a selective forwarding server.
Each user sends one high-quality stream to the server, and the server selectively forwards streams to each participant.
This keeps bandwidth manageable and the server load reasonable, enabling large Brady Bunch-style meetings without melting down.
Scaling the Video Conferencing Infrastructure
Building one meeting server is nice, but Zoom operates at a massive scale – think millions of meetings and hundreds of millions of participants.
How does the system scale to handle that load while keeping latency low?
Distributed Data Centers
The secret sauce is to deploy servers all over the world.
Zoom has data centers across many regions (reportedly 19 interconnected data centers pre-2020) and uses geolocation to connect each user to the nearest data center.
This dramatically reduces latency – the audio/video only has to travel to a nearby Zoom server, not halfway around the globe, before being distributed to others.
If you’re in London and chatting with someone in New York, ideally Zoom will use a European server for you and a US server for them, and those data centers will relay the call over Zoom’s backbone.
In fact, Zoom’s infrastructure is designed so that it can even deploy on-premise servers for enterprise customers (meeting servers inside a company’s own network) to further reduce latency and comply with security needs.
A central Zone Controller keeps an eye on each cluster’s load and manages connections within that zone.
Cloud Burstability
Zoom primarily uses its own data centers for the heavy lifting of video routing, but it also leverages public cloud providers (like AWS and Oracle Cloud) for flexibility.
Non-real-time services (scheduling meetings, user management, chat, etc.) can run in the cloud, and if Zoom’s servers are nearing capacity (think sudden surge in usage), they can offload some meetings to cloud servers on demand.
This hybrid approach gives Zoom a safety valve to scale quickly.
During peak hours or unexpected spikes (for example, the 2020 pandemic boom), Zoom could spin up extra media servers in AWS/Oracle cloud to handle overflow, then wind them down when demand drops.
The result is a highly elastic system ready to serve hundreds of millions of daily meeting participants.
Load Balancing
At the entrance to Zoom’s backend, load balancers and intelligent routing decide which data center and which server should host your meeting.
The goal is to avoid overloading any single server.
If one region’s servers are full, Zoom can redirect new meetings to another region or the cloud.
Each media router server can only handle so many concurrent streams (since it has to receive and forward for everyone in those meetings), so Zoom runs many of them and distributes meetings among them.
In practice, horizontal scaling (adding more servers) is the name of the game.
As user count grows, Zoom adds more media servers and more datacenters.
This distributed design not only scales capacity but also increases reliability – if one server or data center goes down, others can take over.
The system is built with high availability in mind (calls should continue even if a server fails). Zoom achieves this by having meetings isolated in clusters and maintaining backup paths (remember those HTTP tunnels?) in case the direct path fails.
Maintaining Call Quality at Scale
Scaling to millions of users is pointless if call quality degrades.
A standout feature of Zoom’s design is how it maintains good Quality of Service (QoS) for each call, even on subpar networks or crowded meetings.
Here are some techniques used to keep video and audio clear:
Adaptive Bitrate Streaming
The network conditions between each user and the server can vary wildly – one person might be on gigabit fiber, another on a weak Wi-Fi or cellular data.
Zoom addresses this by using adaptive bitrate and multi-bitrate encoding.
In essence, the Zoom client and server negotiate the video quality on the fly.
Each participant’s app constantly monitors metrics like packet loss, latency, and available bandwidth.
If your connection slows down, Zoom will automatically reduce the video resolution or frame rate to avoid lagging.
It might drop you from HD to SD temporarily, or switch to an audio-only mode if bandwidth is really low.
When your network improves, Zoom raises the quality back up. This dynamic adjustment happens continuously and is why Zoom calls often remain smooth even if someone’s network gets spotty.
Multiple Stream Layers
Modern video codecs and Zoom’s architecture allow sending multiple layers of video quality in one stream (or parallel streams of different quality).
For example, your Zoom app might encode your webcam at 1080p, 720p, and 360p simultaneously.
These layers are sent to the server, and the media router can then deliver the appropriate layer to each participant.
So a user on a smartphone with a slow connection might only get the 360p layer, while a user on a desktop with good Wi-Fi gets 720p or 1080p. This is far more efficient than having the server transcode video down for each user.
Zoom’s Multimedia Router (MMR) essentially helps with this selection, ensuring each client receives a stream quality that matches their device and network, thereby maintaining the best possible quality for everyone. It’s a smart balance between quality and performance.
Forward Error Correction (FEC)
To combat inevitable internet hiccups, Zoom uses forward error correction – a technique where extra redundant data is sent so that the receiver can reconstruct lost packets without needing a retransmit.
For example, Zoom might add a little recovery data every so many packets; if one packet drops, your app can repair it using the FEC data. This happens behind the scenes, and it helps reduce audio glitches or video artifacts when network packets go missing.
Application-Level QoS
Zoom even implements some QoS control at the application level (in the Zoom client itself) rather than relying purely on network routers.
This means the app is smart about prioritizing audio over video (so your voice doesn’t cut out), managing congestion, and adapting encoding settings in real time to match the network conditions of your device.
All these optimizations ensure that even as a call scales up in participants or someone’s network link degrades, the overall meeting experience remains as clear and uninterrupted as possible.
Conclusion
Designing a video conferencing system like Zoom is a balancing act between speed, scalability, and quality.
We need real-time performance (using UDP and WebRTC tricks for low latency), a robust architecture for group calls (media servers with an SFU approach to route streams), and global scaling (distributed servers, cloud fallback) – all while dynamically maintaining high call quality through adaptive techniques.
Zoom’s architecture shows how it’s done: a distributed network of multimedia routers coordinates huge calls, and smart software ensures you see and hear everyone with minimal lag.
For those preparing for a system design interview, “Design Zoom or a video chat system” is a popular question.
If you walk through the points we discussed – from using UDP/WebRTC for real-time streaming, to employing media servers (SFUs) for multiparty calls, to scaling out servers worldwide – you’ll cover the key aspects interviewers look for.
Understanding these concepts not only helps in interviews but also gives an appreciation for the engineering behind everyday tools like Zoom.
FAQs (Frequently Asked Questions)
Q1: How does Zoom achieve low latency in video calls?
Zoom achieves low latency by using UDP for most audio/video data instead of TCP. UDP doesn’t wait for lost packets to be re-sent, so it keeps the conversation flowing with minimal delay. Zoom also connects users to the nearest data center and uses efficient codecs, so the travel time for data is short and real-time interaction feels smooth.
Q2: Does Zoom use peer-to-peer or servers for its video conferencing?
For one-on-one calls, Zoom can use a peer-to-peer connection if possible. However, for group meetings Zoom relies on servers. It uses a Selective Forwarding Unit (SFU) architecture, where each participant sends their stream to a server that then forwards it to others. This server-based approach scales much better than a pure peer-to-peer mesh for large calls.
Q3: How does Zoom maintain video quality on slow networks?
Zoom dynamically adjusts to your network. It uses adaptive bitrate streaming, meaning if your internet is slow or experiencing packet loss, Zoom will reduce the video resolution or frame rate to prevent buffering. It also sends multiple quality layers for each video (high, medium, low) and can drop to a lower layer as needed. Techniques like forward error correction add redundancy to fix lost data without re-sending. All of this helps Zoom calls stay clear and smooth, even on less-than-ideal connections.
What our users say
AHMET HANIF
Whoever put this together, you folks are life savers. Thank you :)
Tonya Sims
DesignGurus.io "Grokking the Coding Interview". One of the best resources I’ve found for learning the major patterns behind solving coding problems.
Steven Zhang
Just wanted to say thanks for your Grokking the system design interview resource (https://lnkd.in/g4Wii9r7) - it helped me immensely when I was interviewing from Tableau (very little system design exp) and helped me land 18 FAANG+ jobs!
Access to 50+ courses
New content added monthly
Certificate of completion
$33.25
/month
Billed Annually
Recommended Course
Grokking the System Design Interview
0+ students
4.7
Grokking the System Design Interview is a comprehensive course for system design interview. It provides a step-by-step guide to answering system design questions.
View Course