Self-attention is powerful, but it has a critical limitation: it's fundamentally a linear operation.

Attention computes a weighted sum of Value vectors:

output_i = sum_j(weight_ij * V_j)

No matter how sophisticated the attention weights are, the output is always a linear combination of the inputs. This means attention alone cannot learn non-linear functions -- it can only mix information, not transform it.

Consider what this means in practice:

•Attention can say "blend 40% of 'cat' with 30% of 'sat' and 30% of 'down'"
•But it cannot compute "if 'cat' is a noun AND 'sat' is past tense, THEN mark this as a completed action"

That kind of conditional, non-linear reasoning requires a feed-forward network (FFN) after attention. Together, attention + FFN form a complete Transformer block:

•Attention gathers relevant context from all tokens
•FFN transforms each token's representation using that context

Property

Self-Attention

Feed-Forward Network

Operation type

Linear (weighted sum)

Non-linear (ReLU activation)

Scope

Cross-token (mixes information between tokens)

Per-token (processes each token independently)

Purpose

Gather context: "what is relevant?"

Transform: "what to do with the context?"

Parameters

W_Q, W_K, W_V, W_O

W_1, b_1, W_2, b_2

Analogy

Reading relevant paragraphs

Reasoning about what you read

Limitation

Example

Why FFN Fixes It

No non-linearity

Cannot learn XOR-like patterns

ReLU activation enables non-linear decision boundaries

No per-token transformation

Cannot independently process each position

FFN applies same transformation to each token separately

Limited expressiveness

Linear combinations cannot approximate arbitrary functions

Two-layer FFN is a universal function approximator

Feed-Forward Networks

Why Attention Isn't Enough

Attention vs Feed-Forward Network

Why Linear Attention Needs Non-Linear FFN