Self-Attention and Reasoning

From what I say, my wife sometimes pays attention only to what she thinks is important… but that’s not how arguments (or relationships) work. That’s how transformers (the models behind Generative AI) work!

In 2017, researchers published a paper with a title that’s basically legendary now: “Attention Is All You Need.” No exaggeration—it changed the course of modern AI. Why? Because before transformers, neural networks had a serious weakness: they struggled to hold onto context when sentences got long or complicated.

Take this example:

“The cat that chased the mouse that stole the cheese was in the kitchen and escaped.” Who escaped?

A human gets it immediately: the cat escaped. But for older models, this kind of sentence was a mess. Why? Because recurrent neural networks (RNNs) processed text one word at a time, in sequence. By the time they reached “escaped,” they often had already lost track of who the main subject was.

That’s where transformers come in.

Instead of reading the sentence like a single-file line, transformers look at all the words at once. And that’s where the magic sits: the attention mechanism.

The core idea is simple (even if the math behind it isn’t): for each word, the model asks:

“Which other words in this sentence matter for understanding this word?”

Back to the example. When the model processes “escaped,” it estimates how strongly that word is related to:

cat → highly related

mouse → less related

cheese → barely related

kitchen → somewhat related (context)

And it doesn’t do this vaguely—it computes these relationships for every word with every other word, assigning weights that represent importance. That’s attention.

So how does it compute those relationships? With three components known as:

Query (Q): what a word is “asking” (“what should I look for?”)

Key (K): what each word “offers” to be matched (“I might be relevant”)

Value (V): the information carried by that word if it turns out to matter

Put simply: the Query from “escaped” is compared to the Keys of all other words to decide what deserves attention, and then it uses the corresponding Values to build the correct meaning.

This happens across multiple layers, which is why transformers are:

better at capturing context,

easier to parallelize (they process lots at once),

and highly scalable (you can make them huge—like GPT).

That’s why since 2017, OpenAI, Google, Meta, Anthropic, and basically everyone in the space has relied on transformers. You’re using them when Google Translate translates, when GitHub Copilot autocompletes code, or when a voice assistant understands what you’re asking.

And they didn’t just reshape language. Transformers also show up in vision (image generation) and science, including protein modeling systems like AlphaFold.

So what changed after that?

In 2024, OpenAI pushed hard toward models that reason better—models that don’t just predict the next word, but can plan, verify, and “think” before answering. Because next-word prediction is great for writing, summarizing, and conversation… but if you want to solve hard math problems, build complex software systems, or handle multi-step tasks reliably, you need more than “smart autocomplete.”

You need reasoning, not just attention.