Positional Encoding

Injecting Order Into Parallel Processing

Difficulty
Intermediate
Duration
10-12 min
Prerequisites
Self-Attention
Step
1/ 7

Why Word Order Matters

Self-attention has a remarkable property: it's permutation invariant. If you scramble the order of tokens in a sequence, the attention mechanism produces the same outputs (just reordered). It treats the input as a set, not a sequence.

But word order is crucial for meaning! Consider these scrambled versions of our sentence:

| Original | Scrambled | Same meaning? | |---|---|---| | "The cat sat down" | "down sat cat The" | No! | | "Dog bites man" | "Man bites dog" | Opposite meaning! | | "I saw her duck" | "Duck her saw I" | Nonsensical |

Without positional information, self-attention cannot distinguish "The cat sat down" from "down sat The cat" — the same attention weights would be computed regardless of order.

The solution: explicitly inject position information into the token representations. Before the first attention layer, we add a positional encoding to each token's embedding. This breaks the permutation invariance and lets the model know where each token sits in the sequence.

The original Transformer paper (Vaswani et al., 2017) used sinusoidal positional encodings — elegant mathematical functions that encode position without any learned parameters.

The Scrambled Sentence Problem

SentenceOrderMeaningWithout Position Info
"The cat sat down"OriginalA cat sits downSame attention as any permutation
"down sat cat The"ReversedNonsensicalIdentical attention scores!
"cat The down sat"ShuffledNonsensicalIdentical attention scores!
"Dog bites man"OriginalDog attacks manSame as "Man bites dog"
"Man bites dog"SwappedMan attacks dogSame as "Dog bites man"

Approaches to Encoding Position

ApproachHow Position Is EncodedUsed By
No encodingModel is blind to orderBag-of-words models
SinusoidalFixed mathematical functions of positionOriginal Transformer (2017)
LearnedTrainable embedding per positionBERT, GPT-2
Relative (RoPE)Encode distance between tokens, not absolute positionLLaMA, GPT-NeoX
ALiBiBias attention scores by distanceBLOOM, MPT