The Encoder: BERT

Bidirectional Understanding

Difficulty
Intermediate
Duration
12-15 min
Prerequisites
Transformer Block
Step
1/ 7

The Encoder: Bidirectional Attention

The original Transformer (2017) had both an encoder and a decoder. The encoder is the half that reads and understands input text.

The key property of the encoder is bidirectional attention: every token can attend to every other token, both left and right. When processing "The cat sat down":

  • "cat" attends to both "The" (left) and "sat" (right)
  • "sat" attends to both "cat" (left) and "down" (right)

This is different from a decoder, which can only attend to tokens to the left (more on that in the next lesson).

BERT (Bidirectional Encoder Representations from Transformers, 2018) took the encoder half and showed that bidirectional pre-training produces powerful text representations that can be fine-tuned for dozens of NLP tasks.

The attention heatmap below shows our full bidirectional attention pattern -- notice that every token can attend to every other token. There is no masking; all connections are allowed.

Bidirectional Attention: Every Token Sees Every Token

The
cat
sat
down
The
cat
sat
down
0.19
0.35
0.16
0.29
0.23
0.34
0.16
0.27
0.32
0.19
0.13
0.36
0.18
0.37
0.21
0.24
Low
High
Each cell shows how much attention the query token (row) pays to the key token (column). Higher values (red) indicate stronger attention.

Encoder vs Decoder Architecture

PropertyEncoder (BERT)Decoder (GPT)
Attention directionBidirectional (left + right)Unidirectional (left only)
Each token seesAll tokens in the sequenceOnly previous tokens
Best forUnderstanding text (classification, NER, QA)Generating text (completion, chat)
MaskingNo attention mask (full visibility)Causal mask (upper triangle blocked)
Pre-training taskMasked Language Modeling (fill in blanks)Next token prediction