Self-attention has a remarkable property: it's permutation invariant. If you scramble the order of tokens in a sequence, the attention mechanism produces the same outputs (just reordered). It treats the input as a set, not a sequence.

But word order is crucial for meaning! Consider these scrambled versions of our sentence:

| Original | Scrambled | Same meaning? | |---|---|---| | "The cat sat down" | "down sat cat The" | No! | | "Dog bites man" | "Man bites dog" | Opposite meaning! | | "I saw her duck" | "Duck her saw I" | Nonsensical |

Without positional information, self-attention cannot distinguish "The cat sat down" from "down sat The cat" — the same attention weights would be computed regardless of order.

The solution: explicitly inject position information into the token representations. Before the first attention layer, we add a positional encoding to each token's embedding. This breaks the permutation invariance and lets the model know where each token sits in the sequence.

The original Transformer paper (Vaswani et al., 2017) used sinusoidal positional encodings — elegant mathematical functions that encode position without any learned parameters.

Sentence	Order	Meaning	Without Position Info
"The cat sat down"	Original	A cat sits down	Same attention as any permutation
"down sat cat The"	Reversed	Nonsensical	Identical attention scores!
"cat The down sat"	Shuffled	Nonsensical	Identical attention scores!
"Dog bites man"	Original	Dog attacks man	Same as "Man bites dog"
"Man bites dog"	Swapped	Man attacks dog	Same as "Dog bites man"

Approach	How Position Is Encoded	Used By
No encoding	Model is blind to order	Bag-of-words models
Sinusoidal	Fixed mathematical functions of position	Original Transformer (2017)
Learned	Trainable embedding per position	BERT, GPT-2
Relative (RoPE)	Encode distance between tokens, not absolute position	LLaMA, GPT-NeoX
ALiBi	Bias attention scores by distance	BLOOM, MPT

Positional Encoding

Why Word Order Matters

The Scrambled Sentence Problem

Approaches to Encoding Position