The Decoder: GPT
Autoregressive Text Generation
The Decoder: Causal (Left-to-Right) Attention
While the encoder reads and understands text bidirectionally, the decoder is designed for generation -- producing text one token at a time, left to right.
The key constraint is causal attention (also called masked self-attention): each token can only attend to tokens at or before its position. Token 2 ("sat") can see tokens 0 ("The"), 1 ("cat"), and 2 ("sat"), but NOT token 3 ("down").
Why this restriction? Because during generation, future tokens don't exist yet. When the model generates the third word, it has only produced words 1 and 2. Allowing it to "peek" at future tokens during training would be cheating -- the model must learn to predict each token based only on what came before.
GPT (Generative Pre-trained Transformer) by OpenAI took the decoder half and showed that scaling it up with massive data produces remarkably capable language models -- from GPT-1's 117M parameters to GPT-4's estimated 1.8 trillion.
Compare the causal attention pattern below with the bidirectional pattern you saw in the BERT lesson -- notice how the upper triangle is now masked out.
Causal Attention: Each Token Only Sees Previous Tokens
Encoder vs Decoder: Key Differences
| Property | Encoder (BERT) | Decoder (GPT) |
|---|---|---|
| Attention direction | Bidirectional (all tokens) | Causal (left-to-right only) |
| Token i can see | Tokens 0, 1, ..., n | Tokens 0, 1, ..., i |
| Pre-training task | Masked Language Modeling | Next Token Prediction |
| Primary use | Understanding (classification, NER) | Generation (text, code, chat) |
| Key models | BERT, RoBERTa, DeBERTa | GPT-1/2/3/4, LLaMA, Claude |