How Zoom and Google Meet Actually Work: Real-Time Communication and WebRTC
Click a Zoom or Google Meet link and your laptop and camera quietly walk into a meeting room. You end up face to face with people you have never met, and the conversation flows almost without a hitch. It is easy to forget how recent this is. Ten years ago, video conferencing meant installing a separate program, fiddling with camera settings, and still getting cut off mid-sentence.
This post walks through what video conferencing actually does behind that one short click, and what WebRTC — the standard that codified the work — is. No code.
Three things video conferencing always does #
A video conferencing tool ultimately does three things at the same time. It captures video and audio from the camera and microphone, it encodes that signal — slicing and compressing it so a network can carry it — and it transmits the result fast enough to reach the other side.
A single raw frame from a camera can be tens of megabytes. Send those directly and the internet gives up within a second. So video gets squeezed through codecs like H.264, VP9, or AV1 down to a few hundredths of the original size, and those small chunks are delivered to the other side roughly thirty times a second.
Why “real time” is hard #
The toughest constraint in video conferencing is not picture quality or audio quality. It is latency. A movie can arrive a minute late and you simply press play from the start. A meeting falls apart the moment someone’s mouth and voice drift even a second out of sync. As a rough rule, anything under 100–200 ms feels fine, and beyond 400 ms people start talking over each other.
Inside that tiny window the camera has to grab a frame, the encoder has to compress it, the internet has to carry it across to the other side, and the decoder has to unpack it and draw it on the screen. So conferencing codecs deliberately trade a bit of picture quality for being fast to compress and decode. Movie codecs strike the opposite balance.
WebRTC — the agreement that lets browsers talk directly #
For a long time, video conferencing meant downloading a dedicated program. Today a Zoom or Google Meet link in a browser drops you straight into the meeting, and behind that convenience sits a standard called WebRTC. You can think of WebRTC as “the agreement that lets browsers exchange video, audio, and data directly with each other.” It bundles the pieces you need — asking for camera permission, capturing video, compressing it, sending it to the other browser — into a single standard.
Before WebRTC, every conferencing company shipped its own protocol and its own client. That is why Zoom and Meet still cannot call each other today. WebRTC at least made the parts inside browsers the same.
In transit, the video and audio are automatically encrypted to block eavesdropping. The same encryption idea, applied across the wider web, is covered in The Padlock in the Address Bar — What HTTPS Actually Protects.
Why servers are still in the picture #
“Browsers talk directly” makes it sound like there are no servers involved, but in practice there always are. Toss a laptop onto the internet and other devices on the same Wi-Fi can see it instantly, but the laptop of a friend behind a different office’s Wi-Fi cannot — they are both tucked behind routers. So before a meeting can start, something has to announce that the two of you are in the same room and pass each other’s addresses along. That role is played by the signaling server.
After that, two more servers show up — STUN and TURN. STUN acts as a mirror that tells you “this is the address you appear to have on the open internet.” TURN steps in when two browsers cannot reach each other directly even after the introductions, and quietly relays the video through itself. In environments with strict corporate firewalls, the share of calls that end up routed through TURN is higher than people expect.
Structure changes once the group grows #
A one-to-one call is straightforward — the two browsers connect, and that is it. The moment a meeting has five people, things shift. Each person has to send their own video to four other people, so the upload load on every single participant grows in proportion to the headcount. By eight or ten people, a normal home internet connection already struggles.
That is why most conferencing services place a media server called an SFU in the middle. Each participant uploads their video to the SFU once, and the SFU selectively forwards it to the others. It skips the video of people whose camera is off, and sends low-quality versions to the people whose tiles are small on screen, to save data. Zoom, Google Meet, and Teams are ultimately each company’s client and meeting features layered on top of an SFU architecture.
A coordination you do not see #
In the tiny moment between clicking a link and finding yourself in a meeting, the camera, codec, signaling, STUN, TURN, and SFU shake hands in sequence. From the user’s perspective it is one click, but behind that click several kinds of servers and standards are coordinating step by step. Next time a call drops out or the picture quality dips, do not blame the camera right away — it is worth wondering whether somewhere along that chain a path has narrowed.