Self-Attention

The Core Idea Behind Transformers

Difficulty
Beginner
Duration
10-12 min
Prerequisites
Neural networks basics
Step
1/ 7

Why Sequences Need Attention

Language is sequential — the order and context of words matter. "The cat sat on the mat" is meaningful because each word relates to others in the sentence.

To process language, a model needs to understand relationships between words:

  • "sat" relates to "cat" (who sat?) and "down" (how?)
  • "The" modifies "cat" (which cat?)
  • "cat" is the subject that governs the verb "sat"

The challenge: how do we build a model that captures these relationships? We need a mechanism that lets each word "look at" other words in the sequence to gather context. This is the core problem that attention was designed to solve.

Before attention, models processed words one at a time and tried to compress everything into a single fixed-size vector. As we'll see, this created a serious bottleneck.

Word Dependencies in "The cat sat down"

WordNeeds Context FromWhy
catThe"The" specifies which cat
satcat"cat" is the subject performing the action
satdown"down" modifies how the sitting happened
downsat"down" describes the manner of sitting

Why Context Matters in Language

PropertyWhy It Matters
Word order"Dog bites man" vs "Man bites dog" — same words, opposite meanings
Long-range deps"The cat that I saw yesterday sat down" — "cat" and "sat" are far apart
Ambiguity"Bank" means different things near "river" vs "money"
Coreference"She picked up the ball and threw it" — "it" refers to "ball"