Back to Transformers

Cross-Attention & Encoder-Decoder

Bridging Understanding and Generation

Difficulty
Intermediate
Duration
12-15 min
Prerequisites
Encoder, Decoder
Step
1/ 7

The Encoder-Decoder Architecture

The original Transformer (2017) used both an encoder and a decoder working together. This is the architecture designed for sequence-to-sequence tasks: take an input sequence and produce a different output sequence.

Encoder: Reads the input with bidirectional attention. Produces rich representations of the source text. ("Understand the input.")

Decoder: Generates the output autoregressively. Uses causal self-attention within the output sequence. ("Produce the output.")

Cross-attention: The bridge between encoder and decoder. The decoder attends to the encoder's output to access source information. ("What in the input is relevant to what I'm generating now?")

Each decoder block has three sublayers:

  1. Causal self-attention: Attend to previously generated output tokens
  2. Cross-attention: Attend to the encoder's output (the source sequence)
  3. Feed-forward network: Transform each token independently

This architecture is ideal for tasks where the input and output are different sequences: translation ("The cat sat" -> "Le chat assis"), summarization, or question answering with generation.

Components of the Encoder-Decoder Transformer

ComponentAttention TypeQuery FromKey/Value FromPurpose
Encoder self-attentionBidirectionalSource tokensSource tokensUnderstand the input
Decoder self-attentionCausal (masked)Target tokensTarget tokensModel output dependencies
Cross-attentionFull (no mask)Target tokensSource tokens (encoder output)Connect output to input
Feed-forward (encoder)N/AN/AN/ATransform encoder representations
Feed-forward (decoder)N/AN/AN/ATransform decoder representations

Sequence-to-Sequence Tasks

TaskInput (Encoder)Output (Decoder)Cross-Attention Role
Translation"The cat sat""Le chat assis"Align source words to target words
SummarizationLong articleShort summarySelect important parts to include
Question AnsweringQuestion + contextGenerated answerFind relevant context for the answer
Speech-to-textAudio featuresTranscript textAlign audio frames to text tokens