How Do Google Translate and DeepL Work? Three Generations of Machine Translation
At a restaurant abroad, you point your translation app’s camera at the menu, and the foreign words on the screen turn into your own language. Same with street signs and notices. Barely fifteen years ago this meant flipping through a paper dictionary page by page; now it takes nothing more than raising your phone. And yet the app was never taught English grammar, or French grammar, or any grammar at all. So what is it actually translating with? In this post, we will follow the three big shifts machine translation has gone through and unpack how it works.
Generation one: humans typing in the rules #
The earliest approach was simple and intuitive. Linguists fed the computer grammar rules and dictionaries wholesale. You write conversion rules like “in English the verb comes after the subject; in Japanese the verb goes at the end of the sentence,” and swap words one-for-one using a dictionary. This is rule-based translation.
The problem is that language is made of exceptions more than rules. The English word “bank” can be a place for money or the edge of a river, and “miss the boat” has nothing to do with boats — it means missing an opportunity. Every rule added to handle an exception collided with other rules, and humans typing rules by hand could never keep pace with how fast language changes. After decades of polishing, the quality was still stuck at “the words are right, but the sentence isn’t.”
Generation two: learning probabilities from millions of answer sheets #
In the 1990s, the idea got flipped on its head. Instead of humans teaching rules, researchers collected huge piles of sentence pairs that humans had already translated, and let the computer work out the statistics on its own. Documents recorded in two languages — the proceedings of international bodies, for instance — made excellent raw material. Compare millions of pairs and patterns emerge on their own: “this word usually translates to that word,” “this phrase is often followed by that phrase.”
Google Translate worked this way until around 2016. If you used it back then, you will remember the awkwardness — each word correct on its own, but the sentence somehow wrong when read as a whole. Statistical translation chops a sentence into word- and phrase-sized pieces, translates the pieces, and stitches them back together, so the seams between pieces showed. This is the era that earned machine output the mocking label “translation-ese.”
Generation three: turning a whole sentence’s meaning into numbers #
Many people remember translation quality suddenly getting much better around 2016. That is exactly when Google Translate switched to a neural network approach, and DeepL launched the following year built on neural networks from day one. The core idea was to give up on word-for-word correspondence entirely.
A neural translator first reads the entire sentence and compresses its meaning into a bundle of numbers. Then it unfolds that bundle of numbers into a sentence in the other language. Think of it less like swapping French words in for English ones, and more like holding the meaning in your head for a moment and then saying it again in French from scratch. That is why the word order can change completely and the output still reads as one natural sentence, with no visible seams.
Here are the three generations side by side.
| Generation | How it learns | Weakness |
|---|---|---|
| Rule-based | Humans type in grammar rules and dictionaries | Collapses under exceptions and idioms |
| Statistical | Learns word and phrase probabilities from translation pairs | Stitched-together pieces read awkwardly |
| Neural | Turns a whole sentence’s meaning into numbers | Struggles with languages that have little translation data |
So what’s different about ChatGPT’s translations? #
These days plenty of people hand their translation work to an AI like ChatGPT. As it turns out, LLMs and neural translation are close relatives. The architecture that grew out of neural translation research went on to become the foundation of today’s LLMs. We covered earlier that an LLM is a machine that predicts the next word by probability — and since the text it learned from mixes many languages, predicting “the most plausible French sentence to follow this English sentence” is just the same trick. In other words, translation comes along for free.
The difference is that an LLM takes instructions. A dedicated translator returns one translation for one input, but an LLM accepts conditions like “make it sound like a formal business email” or “explain it so a ten-year-old understands it.” The trade-off: it is slower than a dedicated translator, and it occasionally adds its own commentary instead of just translating.
Why some language pairs are so much harder #
The same translator performs differently depending on the language pair. Between closely related languages like Spanish and Portuguese, quality was decent even back in the rule-based days, while pairs like English and Japanese only became usable after the neural era. The reason is the distance between the languages.
Spanish and Portuguese share nearly the same word order, similar grammar, and a huge stock of Latin-rooted vocabulary. Swapping words in sequence gets you surprisingly far. Japanese and English, on the other hand, disagree from the word order up. Japanese puts the verb at the end of the sentence and routinely drops the subject; English puts the verb up front and demands a subject every time. A pair like that requires tearing the sentence down completely and rebuilding it — which is why the quality just wasn’t there until the neural approach arrived, with its trick of packing meaning into numbers and unpacking it on the other side.
What camera and voice translation really are #
Pointing your camera at a menu and speaking into a microphone look like two separate marvels, but peek inside and both are bundle deals. Camera translation is three steps: character recognition that reads the text out of the image, translation of that text, and screen compositing that paints the translated text back over the original spot. Voice translation is the combination of speech recognition that transcribes your words, translation, and speech synthesis that reads the result out loud.
The translation engine sitting in the middle is the very same one we have been discussing. So when the result comes out strange for a handwritten menu or on a noisy street, the translation probably isn’t what failed — odds are the first step, recognition, misread the letters or the speech.
Why it still gets things wrong #
After all this progress, why do translators still slip up? First, they lack context. A translator mostly works one sentence at a time, so it has no idea whether the previous sentence was about a bird or a construction site when it hits the word “crane.” Ambiguous words cause accidents constantly. Second, formality. An English sentence carries no information about the relationship between speaker and listener, so when translating into French there is no way to know whether to pick tu or vous — and languages like Japanese and Korean have entire honorific systems hanging on that missing information. Third, slang and brand-new coinages that never appeared in the training data come out as bizarre literal translations.
To sum up: machine translation moved from the era of humans typing in rules, through the era of learning statistics from translation pairs, to the era of turning a whole sentence’s meaning into numbers and unfolding it in another language. Camera and voice translation are recognition technologies bundled with a translation engine, and the weaknesses that remain mostly come from context that lives outside the sentence. So for an important document, supply the missing context and give the output one more pass — and on your travels, raise that camera without a second thought. Behind that one small gesture, seventy years of research is hard at work.