The original Transformer (2017) had both an encoder and a decoder. The encoder is the half that reads and understands input text.

The key property of the encoder is bidirectional attention: every token can attend to every other token, both left and right. When processing "The cat sat down":

•"cat" attends to both "The" (left) and "sat" (right)
•"sat" attends to both "cat" (left) and "down" (right)

This is different from a decoder, which can only attend to tokens to the left (more on that in the next lesson).

BERT (Bidirectional Encoder Representations from Transformers, 2018) took the encoder half and showed that bidirectional pre-training produces powerful text representations that can be fine-tuned for dozens of NLP tasks.

The attention heatmap below shows our full bidirectional attention pattern -- notice that every token can attend to every other token. There is no masking; all connections are allowed.

Property

Encoder (BERT)

Decoder (GPT)

Attention direction

Bidirectional (left + right)

Unidirectional (left only)

Each token sees

All tokens in the sequence

Only previous tokens

Best for

Understanding text (classification, NER, QA)

Generating text (completion, chat)

Masking

No attention mask (full visibility)

Causal mask (upper triangle blocked)

Pre-training task

Masked Language Modeling (fill in blanks)

Next token prediction

The Encoder: BERT

The Encoder: Bidirectional Attention

Bidirectional Attention: Every Token Sees Every Token

Encoder vs Decoder Architecture