Embeddings & Representation Learning
Turning Words into Meaningful Vectors
What You'll Discover
Learn how words become meaningful vectors in neural networks
From Sparse to Dense
See why one-hot encoding wastes dimensions and how dense embeddings fix it.
Meaning as Geometry
Watch similar words cluster together in embedding space.
The King-Queen Analogy
Perform the famous king - man + woman = queen vector arithmetic.
Pre-trained Embeddings
Learn how transfer learning with GloVe and Word2Vec accelerates NLP tasks.
Key Concepts
One-Hot Encoding
Sparse binary vectors where each word is equally different from all others
Dense Embeddings
Compact learned vectors where similar words have similar representations
Word2Vec
Learn embeddings by predicting context: 'you know a word by its company'
Cosine Similarity
Measure semantic similarity by comparing vector directions
Embedding Arithmetic
king - man + woman = queen: vector math captures meaning
Transfer Learning
Reuse embeddings trained on billions of words for your specific task
Continue Learning
Embeddings are the foundation for modern NLP — explore what comes next
The Problem with One-Hot Encoding
The simplest way to represent words as numbers is one-hot encoding: give each word a unique vector where exactly one position is 1 and all others are 0.
For our 8-word vocabulary, "king" = [1,0,0,0,0,0,0,0], "queen" = [0,1,0,0,0,0,0,0], and so on.
This has three fatal problems:
1. No relationships. The cosine similarity between ANY two one-hot vectors is exactly 0. "King" is just as different from "queen" as it is from "cat." The encoding says nothing about meaning.
2. Massive dimensions. Real vocabularies have 50,000+ words. Each word becomes a 50,000-dimensional vector with a single 1 — that's 49,999 wasted zeros per word. Multiply by sequence length and batch size, and memory usage explodes.
3. No generalization. If the network learns that "king is powerful," it learns nothing about "queen" because their representations share zero information.
What we want is a representation where similar words have similar vectors — where the geometry of the vector space reflects the meaning of the words.
One-Hot Encoding: Every Word is Equally Different
One-Hot Encoding Matrix (8 words × 8 dimensions)
One-Hot Encoding Problems
| Problem | One-Hot | Impact |
|---|---|---|
| Dimensionality | 8 words → 8 dimensions; 50K words → 50K dimensions | Wastes memory and compute |
| Sparsity | 99.99% zeros in each vector | Most computation is multiplying by zero |
| No similarity | cos(king, queen) = 0 = cos(king, cat) | Network can't leverage word relationships |
| No generalization | Learning about "king" teaches nothing about "queen" | Needs more data to learn each word independently |
| Fixed vocabulary | New words need a whole new dimension | Can't easily add words after training |