When a language model predicts the next token, it outputs a probability distribution over the entire vocabulary. The simplest approach is greedy decoding — always pick the token with the highest probability.

But greedy decoding has serious problems:

1. Boring, repetitive text. Greedy decoding tends to produce generic, safe outputs. "The weather is nice. The weather is nice. The weather is nice..." The most probable next token is often the most predictable one.

2. No diversity. Ask the model to write a story ten times with greedy decoding and you get the exact same story every time. There's no randomness.

3. Suboptimal sequences. The locally most-probable token at each step doesn't always lead to the globally best sequence. "The president of the" → greedy picks "United" every time, missing creative alternatives.

Sampling introduces controlled randomness: instead of always picking the top token, we sample from the distribution, occasionally choosing less likely but still reasonable tokens. The key question is how much randomness to introduce.

Token	Probability	Greedy picks?
the	0.4245	Yes (always)
cat	0.2575	Never
sat	0.1279	Never
on	0.0776	Never
mat	0.0470	Never
dog	0.0348	Never
ran	0.0211	Never
big	0.0095	Never

Problem	Example	Impact
Repetition	"The cat sat. The cat sat. The cat sat."	Outputs loop on high-probability patterns
No creativity	Same output every run, no variation	Useless for creative writing, brainstorming
Local optima	Picking "United" after "president of the" every time	Misses globally better sequences
Degeneration	Long outputs devolve into repeated phrases	Quality degrades with length

Sampling Strategies

Why Not Always Pick the Most Likely Token?

Greedy Decoding: Always Picks the Top Token

Problems with Greedy Decoding