How Do Siri and Alexa Understand What I Say? From Speech to Text
Say “Hey Siri” and your phone wakes up and tells you the weather. For you it takes one breath; for the machine it’s a relay: turn sound waves into letters, read intent from the letters, compose an answer, and turn it back into a voice. In this post, I’ll follow that relay leg by leg, and use the structure to answer two common reactions: the worry that “it’s always listening, isn’t it?” and the complaint of “why does it sometimes hear me so wrong?”
Leg 1 — A small ear that waits only for the wake word #
A voice assistant’s first secret is that it has two ears. What’s switched on all the time is the small ear dedicated to the wake word. A tiny recognizer running entirely on the device waits for exactly one fixed sound pattern, like “Hey Siri” or “Alexa”. At this stage it doesn’t understand your conversation, and it doesn’t send it anywhere. Sound that doesn’t match the pattern is discarded on the spot.
Only when the wake word is detected does the big ear open. This is the stage that records what you say next and hands it to full recognition; usually an indicator appears on the screen from this point, and much of the processing moves to a server. So “it’s always listening” is true of the small ear and false of the big one. That said, the small ear misfires now and then, waking up on its own to a TV in the background, and that’s hard to avoid by design.
Leg 2 — Turning sound into letters #
What comes in through the big ear isn’t words but vibrations in the air — a waveform. The technology that turns this into text is speech recognition (STT, Speech-to-Text). It slices your speech into tiny segments, estimates by probability which sound each segment is closest to, and stitches those pieces into the most plausible sentence.
The key word is probability. The machine doesn’t “hear and know” the sound; it picks the most plausible candidate among possible sentences. That’s why it wavers between phrases that sound alike. The classic example is “recognize speech” versus “wreck a nice beach” — nearly the same sounds, completely different sentences. The more close-sounding candidates there are, like a person’s name versus a common noun, and the noisier the surroundings that blur the sound itself, the higher the odds of an off-the-wall sentence getting picked. A voice assistant turning dumb in a loud car isn’t bad hearing; it’s that picking the right candidate got harder.
Context is the referee in this probability fight. After “set an alarm for,” the sound “ate” reads far more plausibly as “eight”. Much of why recognizers have gotten better in recent years comes from using context more deeply like this.
Leg 3 — Reading intent, executing, and answering out loud #
Getting the text isn’t the finish line. For “do I need an umbrella tomorrow?”, what the machine actually has to do is look up the weather. A step is needed that pulls the intent (a weather question) and the ingredients (tomorrow, current location) out of the sentence and turns them into an executable command.
Older voice assistants were weak at this step. Step outside the pre-registered command patterns and you got “I’m not sure I understand” — a structure close to the rule-based chatbots we saw earlier. This is exactly where the recent change happened. With LLMs that chain the next word by probability moving into this slot, the range of off-script phrasing whose intent still gets read has widened a lot.
Once the intent is clear, execution happens (a weather API lookup, setting an alarm, sending a message), and finally the answer sentence is synthesized into a human voice (TTS, Text-to-Speech) and spoken back. The increasingly human sound of synthetic voices is another advance from the same era.
On the phone, or on the server? #
Where each leg of the relay runs depends on the device and its settings. It used to be that only wake-word detection ran on the device and everything else went to a server, but as phone chips have improved, more of the pipeline — up to and including speech recognition — runs on the device. The more on-device processing, the better in two ways: some features work without the internet, and less of your voice leaves the phone.
Here’s the practical privacy summary. The everyday small ear sends nothing; what you say after the wake word may go to a server, depending on the service. If that bothers you, you can turn off storage and human review of voice data in the settings, and in places where wake-word misfires are frequent (a meeting room, say), turning the wake-word feature off entirely is also an option.
Wrapping up #
Behind one sentence — “Hey Siri, do I need an umbrella tomorrow?” — there are four legs. The small ear that was waiting for the wake word wakes the big one, speech recognition assembles the waveform into the most plausible sentence, intent reading turns the sentence into a command, and the result comes back as a synthesized voice. Most wrong answers are a similar-sounding candidate winning the probability fight in leg 2, and the “always listening” worry shrinks by half once you separate the small ear from the big one. Next time your voice assistant answers something completely off, instead of getting mad, take a look around and check whether it wasn’t a bit too noisy.