System Design Library

Video Calling (Zoom)

Low-latency multi-party audio/video calls.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Join room
Audio/video streams
Screen share
Mute/active speaker

Non-functional

<150ms latency
Scales with participants

Scale

Large meetings/webinars

The approach

WebRTC for media; an SFU (Selective Forwarding Unit) receives each participant's stream once and forwards to others (vs mesh which is O(n²)); signaling server for setup; simulcast for adaptive quality.

Key components

Signaling server · SFU media servers · TURN for NAT

Numbers that matter

An SFU for a 1,000-person webinar receives 1 video stream from the presenter and forwards it to ~1,000 subscribers — roughly 1,000× more efficient than a peer-to-peer mesh where each participant would send 999 streams.
Zoom targets audio end-to-end latency under 150ms (one-way) for conversational quality; above 300ms one-way, users begin talking over each other.
Video encoding at 720p30 with H.264 consumes roughly 1-3 Mbps per stream; a 25-person meeting with all cameras on can push 25-75 Mbps through a single SFU node.
Opus audio codec at 48 kHz / 64 kbps is Zoom's default; at 20ms packet intervals, packet loss recovery via FEC adds roughly 30-50% overhead to the audio bitrate.

Senior deep-dive

The SFU (Selective Forwarding Unit) is the scaling breakthrough — instead of each participant sending video to every other participant (O(n²) streams), each sends once to the SFU, which forwards selectively; this makes large meetings feasible.

Simulcast is the quality mechanism — each sender encodes at multiple bitrates (e.g. 1080p, 360p, 180p); the SFU selects which layer to forward to each receiver based on their bandwidth, so bad connections degrade gracefully without affecting others.

Signaling and media are deliberately separated — signaling (who's in the call, mute state, screen share) goes over WebSocket/HTTPS and can tolerate 100-500ms latency; media (audio/video) goes over UDP/SRTP where minimizing latency and packet loss recovery are the primary concerns.

SFU vs MCU vs mesh: the architecture choice

In a mesh (pure WebRTC P2P), each participant sends N-1 video streams and receives N-1 — CPU and bandwidth scale as O(n²), which breaks at ~5 participants. An MCU (Multipoint Control Unit) decodes, composites, and re-encodes a single stream per participant — low client bandwidth, but server-side CPU scales with N participants and any encoding delay compounds latency. The SFU receives each stream once and forwards without decoding — server CPU is near-constant per stream, client receives N-1 streams but can decode only the visible ones (active speaker detection limits rendering to 4-9 tiles in practice).

Simulcast: quality without negotiation

Each sender encodes 3 spatial layers (e.g. 1080p, 360p, 90p) and transmits all three to the SFU. The SFU subscribes each receiver to the appropriate layer based on their available downlink bandwidth (estimated via RTCP feedback). A receiver on a bad connection gets the lowest layer; a widescreen desktop viewer on fiber gets the highest. Layer switching is nearly instantaneous (next keyframe boundary) and transparent to the receiver — no renegotiation required. The sender's CPU cost is ~1.4× a single-encode due to shared motion vectors across layers.

Audio: the hardest latency budget

Audio tolerance is tighter than video — humans detect audio/video desync above ~80ms and find conversations unnatural above 150ms RTT. Zoom uses Opus with 20ms packet windows and NACK + FEC for loss recovery: FEC adds redundant data in each packet to recover from the loss of the previous packet without a round-trip retransmission. Jitter buffers (typically 20-60ms adaptive) absorb network jitter at the cost of added latency — the tradeoff between jitter tolerance and latency is the key tuning parameter for each network condition.

Signaling plane: coordination without media coupling

Meeting state (participants, mute flags, screen share token, chat) is managed by a signaling cluster separate from media servers. Clients connect to signaling via WebSocket; the signaling server pushes participant list updates, mute events, and hand-raises. This separation means a media server failover doesn't require re-negotiating meeting state — clients reconnect media to the backup SFU while the signaling connection stays alive. Signaling state is replicated across signaling nodes for the meeting, with one leader authoritative for serializing events.

End-to-end encryption: the SFU compromise

Standard Zoom encryption (AES-256 per hop) lets the SFU decrypt and re-encrypt to inspect/forward packets — the SFU has access to the media. End-to-end encryption (E2EE) requires the SFU to forward encrypted packets without decrypting them, which means the SFU loses the ability to do server-side audio mixing (it can't hear what to mix) and active speaker detection (it can't analyze audio energy). Zoom E2EE works by having clients signal which spatial/temporal layer to subscribe to, and the SFU forwards blindly — at the cost of these server-side features.

What breaks at scale

SFU CPU saturation during all-hands meetings with many simultaneous screen shares (high-resolution, high-bitrate) can cause the SFU to drop forwarding, manifesting as frozen video for all participants. Network path asymmetry — a participant with great download but poor upload can't send their video but receives fine; naive diagnostics blame Zoom when the bottleneck is their ISP. Cascading reconnects from a rolling SFU upgrade during peak hours trigger simultaneous reconnect storms across all meetings on the upgraded nodes; staggered rollouts with meeting-aware drain are required. Clock drift across SFU nodes corrupts RTP timestamp sequences, causing audio/video desync that's extremely difficult to diagnose.

In production

Zoom runs a global network of SFU data centers (called Media Servers internally) co-located with ISP peering points to minimize Internet transit latency. The signaling cluster (ZooKeeper-backed for membership) assigns each meeting to a media server cluster based on geography and load. The real challenge at scale is cascading SFU failure — if a media server dies mid-meeting, all participants must reconnect to a new SFU simultaneously, causing a reconnect storm. Zoom mitigates this with SFU-to-SFU cascading (a backup SFU receives from the primary and can take over with minimal disruption). Audio mixing for large calls (all-hands with many unmuted speakers) is CPU-intensive; Zoom uses VAD (Voice Activity Detection) to only mix active speakers.

Common mistakes

Mesh topology for large calls
Transcoding every stream (use SFU forwarding)
Ignoring NAT/TURN

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →