Residual Connections & Layer Norm
Enabling Deep Transformer Stacks
The Depth Problem: Why Deep Networks Are Hard to Train
Modern Transformers are deep -- BERT has 12 layers, GPT-3 has 96 layers. But stacking many layers creates serious training challenges:
1. Vanishing Gradients: During backpropagation, gradients are multiplied through each layer. With 96 layers, gradients can shrink exponentially, becoming effectively zero. Early layers stop learning.
2. Exploding Gradients: Conversely, gradients can grow exponentially, causing numerical overflow and unstable training.
3. Degradation Problem: Surprisingly, adding more layers can make accuracy worse, even on training data. A 56-layer CNN performs worse than a 20-layer one -- not because of overfitting, but because deeper networks are harder to optimize.
These problems plagued deep learning until 2015, when ResNets introduced residual connections (also called skip connections). The Transformer adopted this idea, combined with layer normalization, to make networks with 96+ layers trainable.
Without these two techniques, the Transformer architecture simply would not work at scale.
Training Challenges in Deep Networks
| Problem | What Happens | Consequence | Solution |
|---|---|---|---|
| Vanishing gradients | Gradients shrink through many layers | Early layers stop learning | Residual connections create shortcut paths |
| Exploding gradients | Gradients grow through many layers | Training diverges (NaN losses) | Layer normalization stabilizes values |
| Degradation | Deeper != better, even on training data | Adding layers hurts performance | Residuals let layers learn "corrections" |
| Unstable activations | Values drift across layers | Different layers operate at different scales | Layer norm ensures consistent scale |
Depth of Modern Transformers
| Model | Layers | Total Depth | Trainable Without Residuals? |
|---|---|---|---|
| Our example | 1 | 2 sublayers | Yes (trivially) |
| BERT Base | 12 | 24 sublayers | No -- gradients vanish by layer 1 |
| GPT-2 | 12 | 24 sublayers | No -- training would diverge |
| GPT-3 | 96 | 192 sublayers | Absolutely not -- impossible without residuals |
| GPT-4 (estimated) | ~120 | ~240 sublayers | Requires residuals + normalization + careful init |