RLHF
Learning from Human Feedback
The Alignment Problem
After pre-training and instruction tuning, an LLM can follow instructions. But following instructions is not the same as being helpful, harmless, and honest.
Consider these problems:
Helpfulness without safety: "Write a convincing phishing email" — an instruction-tuned model will helpfully comply because it was trained to follow instructions.
Sycophancy: Models trained on human-written responses learn to agree with the user, even when the user is wrong. "Is 2+2=5?" → "That's an interesting perspective..."
Verbosity: Models learn that longer responses get higher ratings in training data, so they pad answers unnecessarily.
Hallucination: The model confidently states false information because it was rewarded for fluent, confident-sounding text.
The core issue: the training objective (predict next token / follow instructions) doesn't capture what we actually want — responses that are helpful, truthful, safe, and appropriately concise.
RLHF (Reinforcement Learning from Human Feedback) addresses this by training the model to optimize for human preferences rather than just next-token prediction. It adds a layer of human judgment on top of language modeling.
The Alignment Gap
| Problem | What the Model Does | What We Want | Root Cause |
|---|---|---|---|
| Harmful compliance | Follows dangerous instructions | Refuses harmful requests | Trained to follow all instructions |
| Sycophancy | Agrees with incorrect claims | Politely corrects errors | Optimized for user satisfaction |
| Hallucination | Invents plausible-sounding facts | Admits uncertainty | Rewarded for confident text |
| Verbosity | Pads responses unnecessarily | Concise, focused answers | Longer = higher training signal |
| Bias amplification | Reflects and amplifies biases | Fair, balanced responses | Trained on biased internet data |
The Three Stages of LLM Training
| Training Stage | Objective | What It Teaches |
|---|---|---|
| Pre-training | Predict next token | Language understanding and knowledge |
| SFT / Instruction tuning | Match human-written responses | Follow instructions, format outputs |
| RLHF | Maximize human preference scores | Be helpful, honest, harmless |