Self-Attention
The Core Idea Behind Transformers
Difficulty
Beginner
Duration
10-12 min
Prerequisites
Neural networks basics
Step
1/ 7
Why Sequences Need Attention
Language is sequential — the order and context of words matter. "The cat sat on the mat" is meaningful because each word relates to others in the sentence.
To process language, a model needs to understand relationships between words:
- •"sat" relates to "cat" (who sat?) and "down" (how?)
- •"The" modifies "cat" (which cat?)
- •"cat" is the subject that governs the verb "sat"
The challenge: how do we build a model that captures these relationships? We need a mechanism that lets each word "look at" other words in the sequence to gather context. This is the core problem that attention was designed to solve.
Before attention, models processed words one at a time and tried to compress everything into a single fixed-size vector. As we'll see, this created a serious bottleneck.
Word Dependencies in "The cat sat down"
| Word | Needs Context From | Why |
|---|---|---|
| cat | The | "The" specifies which cat |
| sat | cat | "cat" is the subject performing the action |
| sat | down | "down" modifies how the sitting happened |
| down | sat | "down" describes the manner of sitting |
Why Context Matters in Language
| Property | Why It Matters |
|---|---|
| Word order | "Dog bites man" vs "Man bites dog" — same words, opposite meanings |
| Long-range deps | "The cat that I saw yesterday sat down" — "cat" and "sat" are far apart |
| Ambiguity | "Bank" means different things near "river" vs "money" |
| Coreference | "She picked up the ball and threw it" — "it" refers to "ball" |