Text Generation

Next Token Prediction

Difficulty
Intermediate
Duration
12-15 min
Prerequisites
Pre-training
Step
1/ 7

Next Token Prediction

Text generation in LLMs is built on a single operation: predict the next token.

Given a sequence of tokens (the "context" or "prompt"), the model produces a probability distribution over the entire vocabulary — a number for every possible next token indicating how likely it is.

The pipeline for each prediction:

  1. Tokenize the input text into token IDs: "The cat sat on the" → [464, 3797, 3332, 319, 262]
  2. Embed each token ID into a vector (lookup in the embedding table)
  3. Process through all transformer layers (self-attention + feed-forward, repeated N times)
  4. Project the final hidden state to vocabulary size: a vector of ~50,000 numbers (logits)
  5. Apply softmax to convert logits to probabilities that sum to 1.0
  6. Select the next token from this distribution

This final selection step is where decoding strategies come in — greedy, sampling, beam search, etc. The choice of strategy dramatically affects the quality and diversity of generated text.

The model doesn't "think" about what to say — it computes a mathematical function that maps input tokens to a probability distribution over next tokens. Yet this simple process produces remarkably coherent text.

Prompt: "The cat sat on the ___" → Predict Next Token

The
Pos: 0
cat
Pos: 1
sat
Pos: 2
on
Pos: 3
the
Pos: 4
???
Pos: 5

The Next-Token Prediction Pipeline

Pipeline StageInputOutputShape
1. Tokenize"The cat sat on the"[464, 3797, 3332, 319, 262]5 integers
2. EmbedToken IDsEmbedding vectors5 x 4096
3. TransformEmbeddingsContextualized representations5 x 4096
4. ProjectLast position hidden stateRaw logits over vocabulary1 x 50,000
5. SoftmaxLogitsProbability distribution1 x 50,000 (sums to 1.0)
6. SelectProbabilitiesChosen token: "mat"1 integer