Feed-Forward Networks

Adding Non-Linearity to Attention

Difficulty
Intermediate
Duration
8-10 min
Prerequisites
Multi-Head Attention
Step
1/ 7

Why Attention Isn't Enough

Self-attention is powerful, but it has a critical limitation: it's fundamentally a linear operation.

Attention computes a weighted sum of Value vectors:

output_i = sum_j(weight_ij * V_j)

No matter how sophisticated the attention weights are, the output is always a linear combination of the inputs. This means attention alone cannot learn non-linear functions -- it can only mix information, not transform it.

Consider what this means in practice:

  • Attention can say "blend 40% of 'cat' with 30% of 'sat' and 30% of 'down'"
  • But it cannot compute "if 'cat' is a noun AND 'sat' is past tense, THEN mark this as a completed action"

That kind of conditional, non-linear reasoning requires a feed-forward network (FFN) after attention. Together, attention + FFN form a complete Transformer block:

  • Attention gathers relevant context from all tokens
  • FFN transforms each token's representation using that context

Attention vs Feed-Forward Network

PropertySelf-AttentionFeed-Forward Network
Operation typeLinear (weighted sum)Non-linear (ReLU activation)
ScopeCross-token (mixes information between tokens)Per-token (processes each token independently)
PurposeGather context: "what is relevant?"Transform: "what to do with the context?"
ParametersW_Q, W_K, W_V, W_OW_1, b_1, W_2, b_2
AnalogyReading relevant paragraphsReasoning about what you read

Why Linear Attention Needs Non-Linear FFN

LimitationExampleWhy FFN Fixes It
No non-linearityCannot learn XOR-like patternsReLU activation enables non-linear decision boundaries
No per-token transformationCannot independently process each positionFFN applies same transformation to each token separately
Limited expressivenessLinear combinations cannot approximate arbitrary functionsTwo-layer FFN is a universal function approximator