It Doesn't Think, It Predicts: The Inner Mechanism of Generative AI

When you chat with ChatGPT or any other large language model, the experience is so fluid that it is almost impossible not to attribute some kind of intelligence to it. It answers questions, writes poems, explains complex concepts, and even seems to reason. But on the inside, the mechanism is far simpler than the appearance suggests: generative artificial intelligence, at its most fundamental core, is an extraordinarily sophisticated autocomplete system. It does not think. It does not understand. It has no consciousness. It simply predicts the next word.

To grasp this, you have to start with tokens. Language models do not work with whole words, but with smaller fragments called tokens. A token can be an entire word like “cat,” a syllable like “ca,” or even a single character — on average, a token equates to about three-quarters of a word. The model receives a sequence of tokens — the context you have written so far — and calculates which one should come next.

That calculation is made possible by the Transformer, a neural network architecture proposed in 2017 by a team of Google researchers led by Ashish Vaswani. The paper, titled “Attention Is All You Need,” introduced a mechanism called attention that allows the model to weigh the importance of each previous token when deciding the next one. When the model processes the phrase “The cat sat on the —,” the attention mechanism learns that “cat” and “sat” are more relevant for predicting the next word than “The” or “on.” That ability to look back and decide what matters is what sets the Transformer apart from previous architectures.

Once the Transformer has processed the full context, it generates a probability distribution over all possible tokens in its vocabulary. Some tokens receive a high probability; others, near zero. If the context is “The cat sat on the —,” the highest probabilities will go to words like “floor,” “sofa,” “chair,” or “mat.” The model does not always choose the most probable option: it can sample within that distribution, which introduces variability in the responses. The same context can produce different text each time.

This process — receiving tokens, processing them with attention, predicting a probability distribution, sampling the next token, adding it to the context, and repeating — is called autoregressive generation. Each new token becomes part of the context for the next step. The model advances token by token, building the response incrementally, exactly like the autocomplete on your phone’s keyboard, but with a far larger context and immensely greater computing power.

And how does the model learn to make these predictions? Through training at massive scale. During training, the model is shown trillions of tokens extracted from the internet: web pages, books, scientific articles, forums, social media, source code. For each text fragment, the last token is hidden and the model is asked to predict it based on the preceding ones. The difference between the model’s prediction and the actual token is an error that is used to adjust the internal parameters — the so-called neural weights — through a process called backpropagation. Repeated millions of times over enormous amounts of data, this cycle of prediction and adjustment produces models that generate coherent, grammatically correct, and surprisingly nuanced text.

But here is the crucial point: coherence does not imply understanding. The model has no internal model of the world. It does not know what a cat is, nor what it means to sit, nor what a sofa is. It has seen those words appear together so many times in its training data that it has learned the statistical correlations between them. When it answers a question correctly, it is not because it understands the question, but because it has seen similar patterns of questions and answers in its training. Its knowledge is borrowed: it reflects what human beings have written on the internet, not a direct experience of the world. In the words of researcher Emily Bender and her colleagues, the model is a “stochastic parrot”: it repeats patterns it has memorized, recombining them in ways that appear novel.

Understanding this mechanism radically changes how we should evaluate and use these tools. If we know the model only predicts the next word, we stop seeing it as an infallible oracle and start treating it as what it is: a statistical machine that can produce both well-grounded truths and utter nonsense with the same fluency. Hallucinations — those confident but completely false responses — cease to be mysterious: they are simply the model predicting probable tokens according to its training, with no ability to check against reality. The biases and prejudices it reflects are not malice, but the byproduct of having learned from an internet full of human contradictions.

This does not mean generative AI is not useful. It is, and extraordinarily so. But its usefulness depends on us understanding its limits. It does not think. It does not reason. It does not comprehend. It predicts the next word. And that prediction, when executed at the scale of hundreds of billions of parameters trained on the texts of all humanity, produces results that seem like magic. But it is not magic. It is statistics. It is autocomplete. It is the simplest mechanism executed at the largest scale.

Main source: Attention Is All You Need — Vaswani et al., NIPS 2017.