Before diving into Query, Key, and Value, let's recall where we are. Each word in "The cat sat down" has been converted to an 8-dimensional embedding vector and combined with positional encodings.

These input vectors contain information about:

•What the word means (from the learned embedding)
•Where the word sits in the sequence (from the positional encoding)

But these vectors don't yet know about context. The embedding for "cat" is the same whether the sentence is "The cat sat down" or "The cat chased the mouse." Self-attention will fix this by letting each token gather information from all other tokens.

The question is: how exactly does a token decide what to look for and where to find it? The answer is three learned projections: Query, Key, and Value.

Token

Position

Input Vector (first 4 dims)

The

[0.12, 0.66, 0.56, 1.78, ...]

cat

[1.75, 0.42, 0.44, 0.44, ...]

sat

[1.36, 0.25, -0.69, 1.10, ...]

down

[-0.20, -0.43, 1.08, 0.84, ...]

Component

Source

Encodes

Embedding

Learned lookup table

Semantic meaning of the word

Positional Encoding

Sinusoidal function

Position in the sequence

Input Vector

Embedding + Positional

Both meaning AND position

Query, Key, Value

Recap: From Words to Vectors

Input Vectors: Embedding + Positional Encoding

What Each Input Vector Contains