Tokenization

Text to Numbers

Difficulty
Beginner
Duration
10-12 min
Prerequisites
What is an LLM
Step
1/ 7

Why Text Needs Numbers

Neural networks are mathematical machines — they perform matrix multiplications, additions, and nonlinear activations. They operate on numbers, not text. Before an LLM can process "Hello world," it must convert that text into a sequence of numbers.

This conversion process is called tokenization: splitting text into discrete units (tokens) and mapping each to an integer ID.

Why not just use ASCII/Unicode codes? You could represent "H" as 72, "e" as 101, etc. But character-level processing has problems:

  • Sequences become very long (a 500-word essay = ~2,500 characters)
  • Individual characters carry little meaning ("q" alone tells you almost nothing)
  • The model must learn to compose characters into words from scratch

Why not just use whole words? You could assign each word an ID: "Hello" = 1, "world" = 2. But:

  • The vocabulary becomes enormous (English has 170,000+ words)
  • Misspellings, new words, and rare terms get no representation
  • Morphologically related words ("run," "running," "runner") share nothing

The solution is subword tokenization — a middle ground that we'll explore next.

Text Must Be Converted to Numbers

H
Pos: 0
e
Pos: 1
l
Pos: 2
l
Pos: 3
o
Pos: 4
Pos: 5
[72, 101, 108, 108, 111]
Pos: 6

Three Tokenization Approaches

ApproachInput: "Hello world"TokensVocabulary Size
Character-levelH, e, l, l, o, ·, w, o, r, l, d11 tokens~256 (ASCII)
Word-levelHello, world2 tokens170,000+ (English)
Subword (BPE)Hell, o, ·world3 tokens~32,000-100,000