Positional Encoding
Injecting Order Into Parallel Processing
Why Word Order Matters
Self-attention has a remarkable property: it's permutation invariant. If you scramble the order of tokens in a sequence, the attention mechanism produces the same outputs (just reordered). It treats the input as a set, not a sequence.
But word order is crucial for meaning! Consider these scrambled versions of our sentence:
| Original | Scrambled | Same meaning? | |---|---|---| | "The cat sat down" | "down sat cat The" | No! | | "Dog bites man" | "Man bites dog" | Opposite meaning! | | "I saw her duck" | "Duck her saw I" | Nonsensical |
Without positional information, self-attention cannot distinguish "The cat sat down" from "down sat The cat" — the same attention weights would be computed regardless of order.
The solution: explicitly inject position information into the token representations. Before the first attention layer, we add a positional encoding to each token's embedding. This breaks the permutation invariance and lets the model know where each token sits in the sequence.
The original Transformer paper (Vaswani et al., 2017) used sinusoidal positional encodings — elegant mathematical functions that encode position without any learned parameters.
The Scrambled Sentence Problem
| Sentence | Order | Meaning | Without Position Info |
|---|---|---|---|
| "The cat sat down" | Original | A cat sits down | Same attention as any permutation |
| "down sat cat The" | Reversed | Nonsensical | Identical attention scores! |
| "cat The down sat" | Shuffled | Nonsensical | Identical attention scores! |
| "Dog bites man" | Original | Dog attacks man | Same as "Man bites dog" |
| "Man bites dog" | Swapped | Man attacks dog | Same as "Dog bites man" |
Approaches to Encoding Position
| Approach | How Position Is Encoded | Used By |
|---|---|---|
| No encoding | Model is blind to order | Bag-of-words models |
| Sinusoidal | Fixed mathematical functions of position | Original Transformer (2017) |
| Learned | Trainable embedding per position | BERT, GPT-2 |
| Relative (RoPE) | Encode distance between tokens, not absolute position | LLaMA, GPT-NeoX |
| ALiBi | Bias attention scores by distance | BLOOM, MPT |