The Encoder: BERT
Bidirectional Understanding
The Encoder: Bidirectional Attention
The original Transformer (2017) had both an encoder and a decoder. The encoder is the half that reads and understands input text.
The key property of the encoder is bidirectional attention: every token can attend to every other token, both left and right. When processing "The cat sat down":
- •"cat" attends to both "The" (left) and "sat" (right)
- •"sat" attends to both "cat" (left) and "down" (right)
This is different from a decoder, which can only attend to tokens to the left (more on that in the next lesson).
BERT (Bidirectional Encoder Representations from Transformers, 2018) took the encoder half and showed that bidirectional pre-training produces powerful text representations that can be fine-tuned for dozens of NLP tasks.
The attention heatmap below shows our full bidirectional attention pattern -- notice that every token can attend to every other token. There is no masking; all connections are allowed.
Bidirectional Attention: Every Token Sees Every Token
Encoder vs Decoder Architecture
| Property | Encoder (BERT) | Decoder (GPT) |
|---|---|---|
| Attention direction | Bidirectional (left + right) | Unidirectional (left only) |
| Each token sees | All tokens in the sequence | Only previous tokens |
| Best for | Understanding text (classification, NER, QA) | Generating text (completion, chat) |
| Masking | No attention mask (full visibility) | Causal mask (upper triangle blocked) |
| Pre-training task | Masked Language Modeling (fill in blanks) | Next token prediction |