Query, Key, Value
The Three Projections That Drive Attention
Difficulty
Intermediate
Duration
12-15 min
Prerequisites
Self-Attention
Step
1/ 7
Recap: From Words to Vectors
Before diving into Query, Key, and Value, let's recall where we are. Each word in "The cat sat down" has been converted to an 8-dimensional embedding vector and combined with positional encodings.
These input vectors contain information about:
- •What the word means (from the learned embedding)
- •Where the word sits in the sequence (from the positional encoding)
But these vectors don't yet know about context. The embedding for "cat" is the same whether the sentence is "The cat sat down" or "The cat chased the mouse." Self-attention will fix this by letting each token gather information from all other tokens.
The question is: how exactly does a token decide what to look for and where to find it? The answer is three learned projections: Query, Key, and Value.
Input Vectors: Embedding + Positional Encoding
| Token | Position | Input Vector (first 4 dims) |
|---|---|---|
| The | 0 | [0.12, 0.66, 0.56, 1.78, ...] |
| cat | 1 | [1.75, 0.42, 0.44, 0.44, ...] |
| sat | 2 | [1.36, 0.25, -0.69, 1.10, ...] |
| down | 3 | [-0.20, -0.43, 1.08, 0.84, ...] |
What Each Input Vector Contains
| Component | Source | Encodes |
|---|---|---|
| Embedding | Learned lookup table | Semantic meaning of the word |
| Positional Encoding | Sinusoidal function | Position in the sequence |
| Input Vector | Embedding + Positional | Both meaning AND position |