Softmax & Attention Output

From Scores to Context-Rich Representations

Difficulty
Intermediate
Duration
12-15 min
Prerequisites
Query, Key, Value
Step
1/ 7

Dot Product Scoring: Q · K^T

Now that we have Query and Key matrices, how do we compute attention scores? The answer is the dot product — the simplest and most efficient measure of similarity between two vectors.

For each pair of tokens (i, j), the raw attention score is:

score(i, j) = Q_i · K_j = Sum(Q_i[d] × K_j[d]) for all dimensions d

In matrix form, we compute all scores at once:

Scores = Q × K^T

This is a single matrix multiplication that produces a 4×4 matrix — one score for every pair of tokens. The score at position (i, j) tells us how much token i's Query matches token j's Key.

Why dot product works: If two vectors point in similar directions, their dot product is large and positive. If they're orthogonal (unrelated), it's near zero. If they point in opposite directions, it's negative. The learned W_Q and W_K matrices are trained so that relevant Query-Key pairs produce high dot products.

Below are the raw scores (already divided by √d_k — we'll explain scaling next).

Raw Attention Scores (4×4)

Scaled Scores = Q·K^T / √d_k

0.23
0.83
0.03
0.63
0.41
0.82
0.09
0.59
0.85
0.35
-0.08
0.96
-0.07
0.65
0.06
0.20
4 × 4

Score Matrix: How Well Each Query Matches Each Key

Query TokenKey: TheKey: catKey: satKey: down
The0.2260.8270.0290.630
cat0.4130.8200.0940.587
sat0.8470.349-0.0780.955
down-0.0700.6480.0560.200