While the encoder reads and understands text bidirectionally, the decoder is designed for generation -- producing text one token at a time, left to right.

The key constraint is causal attention (also called masked self-attention): each token can only attend to tokens at or before its position. Token 2 ("sat") can see tokens 0 ("The"), 1 ("cat"), and 2 ("sat"), but NOT token 3 ("down").

Why this restriction? Because during generation, future tokens don't exist yet. When the model generates the third word, it has only produced words 1 and 2. Allowing it to "peek" at future tokens during training would be cheating -- the model must learn to predict each token based only on what came before.

GPT (Generative Pre-trained Transformer) by OpenAI took the decoder half and showed that scaling it up with massive data produces remarkably capable language models -- from GPT-1's 117M parameters to GPT-4's estimated 1.8 trillion.

Compare the causal attention pattern below with the bidirectional pattern you saw in the BERT lesson -- notice how the upper triangle is now masked out.

Property	Encoder (BERT)	Decoder (GPT)
Attention direction	Bidirectional (all tokens)	Causal (left-to-right only)
Token i can see	Tokens 0, 1, ..., n	Tokens 0, 1, ..., i
Pre-training task	Masked Language Modeling	Next Token Prediction
Primary use	Understanding (classification, NER)	Generation (text, code, chat)
Key models	BERT, RoBERTa, DeBERTa	GPT-1/2/3/4, LLaMA, Claude

The Decoder: GPT

The Decoder: Causal (Left-to-Right) Attention

Causal Attention: Each Token Only Sees Previous Tokens

Encoder vs Decoder: Key Differences