Sampling Strategies
Temperature, Top-k, Top-p
Why Not Always Pick the Most Likely Token?
When a language model predicts the next token, it outputs a probability distribution over the entire vocabulary. The simplest approach is greedy decoding — always pick the token with the highest probability.
But greedy decoding has serious problems:
1. Boring, repetitive text. Greedy decoding tends to produce generic, safe outputs. "The weather is nice. The weather is nice. The weather is nice..." The most probable next token is often the most predictable one.
2. No diversity. Ask the model to write a story ten times with greedy decoding and you get the exact same story every time. There's no randomness.
3. Suboptimal sequences. The locally most-probable token at each step doesn't always lead to the globally best sequence. "The president of the" → greedy picks "United" every time, missing creative alternatives.
Sampling introduces controlled randomness: instead of always picking the top token, we sample from the distribution, occasionally choosing less likely but still reasonable tokens. The key question is how much randomness to introduce.
Greedy Decoding: Always Picks the Top Token
| Token | Probability | Greedy picks? |
|---|---|---|
| the | 0.4245 | Yes (always) |
| cat | 0.2575 | Never |
| sat | 0.1279 | Never |
| on | 0.0776 | Never |
| mat | 0.0470 | Never |
| dog | 0.0348 | Never |
| ran | 0.0211 | Never |
| big | 0.0095 | Never |
Problems with Greedy Decoding
| Problem | Example | Impact |
|---|---|---|
| Repetition | "The cat sat. The cat sat. The cat sat." | Outputs loop on high-probability patterns |
| No creativity | Same output every run, no variation | Useless for creative writing, brainstorming |
| Local optima | Picking "United" after "president of the" every time | Misses globally better sequences |
| Degeneration | Long outputs devolve into repeated phrases | Quality degrades with length |