Feed-Forward Networks
Adding Non-Linearity to Attention
Difficulty
Intermediate
Duration
8-10 min
Prerequisites
Multi-Head Attention
Step
1/ 7
Why Attention Isn't Enough
Self-attention is powerful, but it has a critical limitation: it's fundamentally a linear operation.
Attention computes a weighted sum of Value vectors:
output_i = sum_j(weight_ij * V_j)
No matter how sophisticated the attention weights are, the output is always a linear combination of the inputs. This means attention alone cannot learn non-linear functions -- it can only mix information, not transform it.
Consider what this means in practice:
- •Attention can say "blend 40% of 'cat' with 30% of 'sat' and 30% of 'down'"
- •But it cannot compute "if 'cat' is a noun AND 'sat' is past tense, THEN mark this as a completed action"
That kind of conditional, non-linear reasoning requires a feed-forward network (FFN) after attention. Together, attention + FFN form a complete Transformer block:
- •Attention gathers relevant context from all tokens
- •FFN transforms each token's representation using that context
Attention vs Feed-Forward Network
| Property | Self-Attention | Feed-Forward Network |
|---|---|---|
| Operation type | Linear (weighted sum) | Non-linear (ReLU activation) |
| Scope | Cross-token (mixes information between tokens) | Per-token (processes each token independently) |
| Purpose | Gather context: "what is relevant?" | Transform: "what to do with the context?" |
| Parameters | W_Q, W_K, W_V, W_O | W_1, b_1, W_2, b_2 |
| Analogy | Reading relevant paragraphs | Reasoning about what you read |
Why Linear Attention Needs Non-Linear FFN
| Limitation | Example | Why FFN Fixes It |
|---|---|---|
| No non-linearity | Cannot learn XOR-like patterns | ReLU activation enables non-linear decision boundaries |
| No per-token transformation | Cannot independently process each position | FFN applies same transformation to each token separately |
| Limited expressiveness | Linear combinations cannot approximate arbitrary functions | Two-layer FFN is a universal function approximator |