Tokenization
Text to Numbers
Difficulty
Beginner
Duration
10-12 min
Prerequisites
What is an LLM
Step
1/ 7
Why Text Needs Numbers
Neural networks are mathematical machines — they perform matrix multiplications, additions, and nonlinear activations. They operate on numbers, not text. Before an LLM can process "Hello world," it must convert that text into a sequence of numbers.
This conversion process is called tokenization: splitting text into discrete units (tokens) and mapping each to an integer ID.
Why not just use ASCII/Unicode codes? You could represent "H" as 72, "e" as 101, etc. But character-level processing has problems:
- •Sequences become very long (a 500-word essay = ~2,500 characters)
- •Individual characters carry little meaning ("q" alone tells you almost nothing)
- •The model must learn to compose characters into words from scratch
Why not just use whole words? You could assign each word an ID: "Hello" = 1, "world" = 2. But:
- •The vocabulary becomes enormous (English has 170,000+ words)
- •Misspellings, new words, and rare terms get no representation
- •Morphologically related words ("run," "running," "runner") share nothing
The solution is subword tokenization — a middle ground that we'll explore next.
Text Must Be Converted to Numbers
H
Pos: 0
e
Pos: 1
l
Pos: 2
l
Pos: 3
o
Pos: 4
→
Pos: 5
[72, 101, 108, 108, 111]
Pos: 6
Three Tokenization Approaches
| Approach | Input: "Hello world" | Tokens | Vocabulary Size |
|---|---|---|---|
| Character-level | H, e, l, l, o, ·, w, o, r, l, d | 11 tokens | ~256 (ASCII) |
| Word-level | Hello, world | 2 tokens | 170,000+ (English) |
| Subword (BPE) | Hell, o, ·world | 3 tokens | ~32,000-100,000 |