The Decoder: GPT

Autoregressive Text Generation

Difficulty
Intermediate
Duration
12-15 min
Prerequisites
Transformer Block
Step
1/ 7

The Decoder: Causal (Left-to-Right) Attention

While the encoder reads and understands text bidirectionally, the decoder is designed for generation -- producing text one token at a time, left to right.

The key constraint is causal attention (also called masked self-attention): each token can only attend to tokens at or before its position. Token 2 ("sat") can see tokens 0 ("The"), 1 ("cat"), and 2 ("sat"), but NOT token 3 ("down").

Why this restriction? Because during generation, future tokens don't exist yet. When the model generates the third word, it has only produced words 1 and 2. Allowing it to "peek" at future tokens during training would be cheating -- the model must learn to predict each token based only on what came before.

GPT (Generative Pre-trained Transformer) by OpenAI took the decoder half and showed that scaling it up with massive data produces remarkably capable language models -- from GPT-1's 117M parameters to GPT-4's estimated 1.8 trillion.

Compare the causal attention pattern below with the bidirectional pattern you saw in the BERT lesson -- notice how the upper triangle is now masked out.

Causal Attention: Each Token Only Sees Previous Tokens

The
cat
sat
down
The
cat
sat
down
1.00
0.00
0.00
0.00
0.40
0.60
0.00
0.00
0.50
0.30
0.20
0.00
0.18
0.37
0.21
0.24
Low
High
Each cell shows how much attention the query token (row) pays to the key token (column). Higher values (red) indicate stronger attention.

Encoder vs Decoder: Key Differences

PropertyEncoder (BERT)Decoder (GPT)
Attention directionBidirectional (all tokens)Causal (left-to-right only)
Token i can seeTokens 0, 1, ..., nTokens 0, 1, ..., i
Pre-training taskMasked Language ModelingNext Token Prediction
Primary useUnderstanding (classification, NER)Generation (text, code, chat)
Key modelsBERT, RoBERTa, DeBERTaGPT-1/2/3/4, LLaMA, Claude