Softmax & Attention Output

From Scores to Context-Rich Representations

Difficulty
Intermediate
Duration
12-15 min
Prerequisites
Query, Key, Value
Step
1/ 7

Dot Product Scoring: Q · K^T

Now that we have Query and Key matrices, how do we compute attention scores? The answer is the dot product — the simplest and most efficient measure of similarity between two vectors.

For each pair of tokens (i, j), the raw attention score is:

score(i, j) = Q_i · K_j = Sum(Q_i[d] × K_j[d]) for all dimensions d

In matrix form, we compute all scores at once:

Scores = Q × K^T

This is a single matrix multiplication that produces a 4×4 matrix — one score for every pair of tokens. The score at position (i, j) tells us how much token i's Query matches token j's Key.

Why dot product works: If two vectors point in similar directions, their dot product is large and positive. If they're orthogonal (unrelated), it's near zero. If they point in opposite directions, it's negative. The learned W_Q and W_K matrices are trained so that relevant Query-Key pairs produce high dot products.

Below are the raw scores (already divided by √d_k — we'll explain scaling next).

Raw Attention Scores (4×4)

Scaled Scores = Q·K^T / √d_k

0.23
0.83
0.03
0.63
0.41
0.82
0.09
0.59
0.85
0.35
-0.08
0.96
-0.07
0.65
0.06
0.20
4 × 4

Score Matrix: How Well Each Query Matches Each Key

Query TokenKey: TheKey: catKey: satKey: down
The0.2260.8270.0290.630
cat0.4130.8200.0940.587
sat0.8470.349-0.0780.955
down-0.0700.6480.0560.200

Attention Scoring & Softmax — Lesson Content

Trace the complete attention formula: dot-product scoring, scaling by sqrt(d_k), softmax normalization, and weighted value combination.

This lesson walks through the mathematical core of scaled dot-product attention. Starting from Query and Key dot products, you'll understand why scaling by √d_k prevents gradient saturation, how softmax creates a probability distribution over tokens, and how the final weighted combination of Values produces context-enriched representations. With concrete numbers from "The cat sat down," you'll see each step of the attention formula and learn to read attention heatmaps like a pro.

Learning Objectives

  • Compute dot-product attention scores from Q and K
  • Explain why scaling by √d_k is necessary
  • Apply softmax to convert scores to attention weights
  • Read and interpret attention heatmaps
  • Compute the attention output as a weighted sum of Values
  • Write the complete attention formula from memory

Step 1: Dot Product Scoring: Q · K^T

Now that we have Query and Key matrices, how do we compute attention scores? The answer is the **dot product** — the simplest and most efficient measure of similarity between two vectors. For each pair of tokens (i, j), the raw attention score is: **score(i, j) = Q_i · K_j = Sum(Q_i[d] × K_j[d]) for all dimensions d** In matrix form, we compute all scores at once: **Scores = Q × K^T** This is a single matrix multiplication that produces a 4×4 matrix — one score for every pair of tokens. The score at position (i, j) tells us how much token i's Query matches token j's Key. **Why dot product works:** If two vectors point in similar directions, their dot product is large and positive. If they're orthogonal (unrelated), it's near zero. If they point in opposite directions, it's negative. The learned W_Q and W_K matrices are trained so that relevant Query-Key pairs produce high dot products. Below are the raw scores (already divided by √d_k — we'll explain scaling next).
Scores = Q × K^T

scores[i][j] = Σ_d Q[i][d] × K[j][d]

Shape: (4 tokens × 8 dims) × (8 dims × 4 tokens) = 4 × 4

Step 2: Why Scale by √d_k?

Before applying softmax, we divide all scores by **√d_k** (the square root of the key dimension). For our 8-dimensional keys, that's √8 ≈ 2.83. **Why is this necessary?** When Query and Key vectors have dimension d_k, their dot product is the sum of d_k terms. As d_k grows, the variance of the dot product grows proportionally — scores get larger in magnitude. Consider two random unit vectors of dimension d: - d = 8: dot product variance ≈ 8, scores range roughly [-4, 4] - d = 64: dot product variance ≈ 64, scores range roughly [-16, 16] - d = 512: dot product variance ≈ 512, scores range roughly [-45, 45] Large scores cause softmax to produce **extremely peaked distributions** — one token gets ~1.0 attention and all others get ~0.0. This kills gradient flow during training because softmax gradients are nearly zero in the saturated region. Dividing by √d_k normalizes the variance to approximately 1 regardless of dimension, keeping scores in a range where softmax produces **useful, non-degenerate gradients**. This seemingly minor detail is critical: without scaling, transformer training often fails completely for large d_k.
Scaled Score = (Q · K^T) / √d_k

For d_k = 8: √d_k = 2.83

Var(Q · K) ≈ d_k  (for unit-variance entries)
Var(Q · K / √d_k) ≈ 1  (normalized)

Step 3: Softmax Normalization

After scaling, we apply **softmax** to convert raw scores into a probability distribution — each row sums to 1.0 and all values are between 0 and 1. **softmax(x_i) = e^(x_i) / Σ_j e^(x_j)** Softmax does two things: 1. **Exponentiation** (e^x) makes all values positive and amplifies differences — larger scores become much larger relative to smaller ones 2. **Normalization** (divide by sum) ensures each row sums to 1, creating a valid probability distribution This means the attention weights for each query token tell us the **proportion of attention** allocated to each key token. A token might allocate 40% attention to one word, 30% to another, and 15% each to the remaining two. Below you can see the raw scaled scores transformed into attention weights. Notice how softmax sharpens the distribution — the highest score in each row gets amplified, while lower scores get suppressed.
softmax(score_i) = exp(score_i) / Σ_j exp(score_j)

Properties:
  • All outputs ∈ (0, 1)
  • Each row sums to 1.0
  • Monotonic: larger inputs → larger outputs
  • Differentiable everywhere

Step 4: Reading the Attention Heatmap

The attention heatmap is one of the most informative visualizations in deep learning. Here's how to read it: **Rows** = Query tokens (the token doing the "looking") **Columns** = Key tokens (the token being "looked at") **Color intensity** = Attention weight (darker = more attention) **What to look for:** 1. **Diagonal dominance**: If the diagonal is strong, tokens primarily attend to themselves. This means self-reference is important (the token's own features matter most). 2. **Off-diagonal patterns**: Strong off-diagonal values reveal linguistic relationships. A verb attending strongly to a noun suggests it found its subject. 3. **Row uniformity**: If a row's weights are nearly equal (~0.25 each), that token attends broadly to all tokens — it might be gathering general context. 4. **Column patterns**: If a column is consistently bright, that token is attended to by many other tokens — it's a "hub" of information. Examine the heatmap below and try to identify these patterns in our "The cat sat down" example.

Step 5: Weighted Value Combination

The final step of attention: multiply the attention weights by the Value matrix to produce the output. **Output = Attention_Weights × V** For each query token, this computes a **weighted average of all Value vectors**, where the weights are the attention scores. The output for token i is: output_i = w_i0 × V_0 + w_i1 × V_1 + w_i2 × V_2 + w_i3 × V_3 where w_ij are the attention weights from token i to token j. For example, if "sat" has attention weights [0.15, 0.40, 0.20, 0.25] for [The, cat, sat, down], its output is: - 15% of "The"'s Value + 40% of "cat"'s Value + 20% of "sat"'s Value + 25% of "down"'s Value The result is a **context-enriched representation**: "sat" now contains information gathered from all other tokens, weighted by their relevance. This is how each token goes from knowing only about itself to understanding its role in the full sentence.
Output = Attention_Weights × V

output[i] = Σ_j attention[i][j] × V[j]

Shape: (4×4) × (4×8) = 4×8

Step 6: The Complete Attention Formula

We can now write the complete scaled dot-product attention in one elegant formula: **Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V** Let's trace through the full pipeline one more time: 1. **Project inputs:** Q = XW_Q, K = XW_K, V = XW_V 2. **Compute scores:** Q × K^T (dot product between all query-key pairs) 3. **Scale:** Divide by √d_k to prevent softmax saturation 4. **Normalize:** Apply softmax to get weights that sum to 1 5. **Combine:** Multiply weights by V to get context-enriched output The entire operation is differentiable, meaning gradients flow smoothly from the output back through the attention weights to the Q, K, V projections and ultimately to the input embeddings. This is how the network learns to attend to the right things. In matrix form, the whole computation is just **three matrix multiplications and a softmax** — extremely efficient on GPU hardware. This simplicity and parallelizability is a major reason transformers scaled so well.
Attention(Q, K, V) = softmax(Q K^T / √d_k) V

Where:
  Q = X W_Q    (queries)
  K = X W_K    (keys)
  V = X W_V    (values)
  d_k = dimension of keys

Step 7: Test Your Understanding

You've traced the complete attention formula from dot products to context-enriched outputs. Let's test your understanding!

Prerequisites

  • Query, Key, Value projections
  • Matrix multiplication
  • Basic probability (distributions sum to 1)

Key Concepts

  • Dot Product Scoring
  • Scaled Attention
  • Softmax Normalization
  • Attention Heatmap
  • Weighted Value Combination
  • Quadratic Complexity