Neural networks are mathematical machines — they perform matrix multiplications, additions, and nonlinear activations. They operate on numbers, not text. Before an LLM can process "Hello world," it must convert that text into a sequence of numbers.

This conversion process is called tokenization: splitting text into discrete units (tokens) and mapping each to an integer ID.

Why not just use ASCII/Unicode codes? You could represent "H" as 72, "e" as 101, etc. But character-level processing has problems:

•Sequences become very long (a 500-word essay = ~2,500 characters)
•Individual characters carry little meaning ("q" alone tells you almost nothing)
•The model must learn to compose characters into words from scratch

Why not just use whole words? You could assign each word an ID: "Hello" = 1, "world" = 2. But:

•The vocabulary becomes enormous (English has 170,000+ words)
•Misspellings, new words, and rare terms get no representation
•Morphologically related words ("run," "running," "runner") share nothing

The solution is subword tokenization — a middle ground that we'll explore next.

Approach	Input: "Hello world"	Tokens	Vocabulary Size
Character-level	H, e, l, l, o, ·, w, o, r, l, d	11 tokens	~256 (ASCII)
Word-level	Hello, world	2 tokens	170,000+ (English)
Subword (BPE)	Hell, o, ·world	3 tokens	~32,000-100,000

Tokenization

Why Text Needs Numbers

Text Must Be Converted to Numbers

Three Tokenization Approaches