Now that we have Query and Key matrices, how do we compute attention scores? The answer is the dot product — the simplest and most efficient measure of similarity between two vectors.

For each pair of tokens (i, j), the raw attention score is:

score(i, j) = Q_i · K_j = Sum(Q_i[d] × K_j[d]) for all dimensions d

In matrix form, we compute all scores at once:

Scores = Q × K^T

This is a single matrix multiplication that produces a 4×4 matrix — one score for every pair of tokens. The score at position (i, j) tells us how much token i's Query matches token j's Key.

Why dot product works: If two vectors point in similar directions, their dot product is large and positive. If they're orthogonal (unrelated), it's near zero. If they point in opposite directions, it's negative. The learned W_Q and W_K matrices are trained so that relevant Query-Key pairs produce high dot products.

Below are the raw scores (already divided by √d_k — we'll explain scaling next).

Query Token	Key: The	Key: cat	Key: sat	Key: down
The	0.226	0.827	0.029	0.630
cat	0.413	0.820	0.094	0.587
sat	0.847	0.349	-0.078	0.955
down	-0.070	0.648	0.056	0.200

Softmax & Attention Output

Dot Product Scoring: Q · K^T

Raw Attention Scores (4×4)

Scaled Scores = Q·K^T / √d_k

Score Matrix: How Well Each Query Matches Each Key