Back to AI Fundamentals

Embeddings & Representation Learning

Turning Words into Meaningful Vectors

Difficulty
Intermediate
Duration
15-18 minutes
Prerequisites
Neural networks, Matrix operations

What You'll Discover

Learn how words become meaningful vectors in neural networks

From Sparse to Dense

See why one-hot encoding wastes dimensions and how dense embeddings fix it.

Meaning as Geometry

Watch similar words cluster together in embedding space.

The King-Queen Analogy

Perform the famous king - man + woman = queen vector arithmetic.

Pre-trained Embeddings

Learn how transfer learning with GloVe and Word2Vec accelerates NLP tasks.

Key Concepts

One-Hot Encoding

Sparse binary vectors where each word is equally different from all others

Dense Embeddings

Compact learned vectors where similar words have similar representations

Word2Vec

Learn embeddings by predicting context: 'you know a word by its company'

Cosine Similarity

Measure semantic similarity by comparing vector directions

Embedding Arithmetic

king - man + woman = queen: vector math captures meaning

Transfer Learning

Reuse embeddings trained on billions of words for your specific task

Step
1/ 8

The Problem with One-Hot Encoding

The simplest way to represent words as numbers is one-hot encoding: give each word a unique vector where exactly one position is 1 and all others are 0.

For our 8-word vocabulary, "king" = [1,0,0,0,0,0,0,0], "queen" = [0,1,0,0,0,0,0,0], and so on.

This has three fatal problems:

1. No relationships. The cosine similarity between ANY two one-hot vectors is exactly 0. "King" is just as different from "queen" as it is from "cat." The encoding says nothing about meaning.

2. Massive dimensions. Real vocabularies have 50,000+ words. Each word becomes a 50,000-dimensional vector with a single 1 — that's 49,999 wasted zeros per word. Multiply by sequence length and batch size, and memory usage explodes.

3. No generalization. If the network learns that "king is powerful," it learns nothing about "queen" because their representations share zero information.

What we want is a representation where similar words have similar vectors — where the geometry of the vector space reflects the meaning of the words.

One-Hot Encoding: Every Word is Equally Different

One-Hot Encoding Matrix (8 words × 8 dimensions)

1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
8 × 8

One-Hot Encoding Problems

ProblemOne-HotImpact
Dimensionality8 words → 8 dimensions; 50K words → 50K dimensionsWastes memory and compute
Sparsity99.99% zeros in each vectorMost computation is multiplying by zero
No similaritycos(king, queen) = 0 = cos(king, cat)Network can't leverage word relationships
No generalizationLearning about "king" teaches nothing about "queen"Needs more data to learn each word independently
Fixed vocabularyNew words need a whole new dimensionCan't easily add words after training

Embeddings & Representation Learning — Lesson Content

Discover how neural networks learn to represent words and concepts as meaningful vectors — from one-hot encoding to the famous king-queen analogy.

One-hot encoding wastes dimensions and captures no meaning. Dense embeddings compress words into compact vectors where similar words are nearby and vector arithmetic captures semantic relationships. Using a vocabulary of 8 words including royalty and animals, you'll see how embeddings cluster related words, compute cosine similarity, perform the famous "king - man + woman = queen" arithmetic, and understand how embedding layers work inside neural networks. You'll also learn about pre-trained embeddings and the evolution from Word2Vec to modern contextual representations.

Learning Objectives

  • Explain the problems with one-hot encoding
  • Describe how dense embeddings encode semantic meaning
  • Understand the Word2Vec training intuition
  • Compute and interpret cosine similarity between embeddings
  • Perform and explain embedding arithmetic (king - man + woman = queen)
  • Compare pre-trained vs custom-trained embeddings

Step 1: The Problem with One-Hot Encoding

The simplest way to represent words as numbers is **one-hot encoding**: give each word a unique vector where exactly one position is 1 and all others are 0. For our 8-word vocabulary, "king" = [1,0,0,0,0,0,0,0], "queen" = [0,1,0,0,0,0,0,0], and so on. This has **three fatal problems**: **1. No relationships.** The cosine similarity between ANY two one-hot vectors is exactly **0**. "King" is just as different from "queen" as it is from "cat." The encoding says nothing about meaning. **2. Massive dimensions.** Real vocabularies have 50,000+ words. Each word becomes a 50,000-dimensional vector with a single 1 — that's 49,999 wasted zeros per word. Multiply by sequence length and batch size, and memory usage explodes. **3. No generalization.** If the network learns that "king is powerful," it learns nothing about "queen" because their representations share zero information. What we want is a representation where similar words have similar vectors — where the **geometry** of the vector space reflects the **meaning** of the words.

Step 2: Dense Embeddings: Compact & Meaningful

**Dense embeddings** replace those huge sparse vectors with small, information-packed vectors where **every dimension carries meaning**. Instead of a 50,000-dimensional one-hot vector, we represent each word as a compact vector — typically 50 to 300 dimensions. But these aren't hand-crafted: the values are **learned during training**, and the network discovers what each dimension should encode. Compare "king" in both representations: - **One-hot**: [1, 0, 0, 0, 0, 0, 0, 0] — 8 dimensions, only 1 is non-zero - **Embedding**: [0.82, 0.65, -0.20, 0.75] — 4 dimensions, all carry information The magic is that similar words end up with similar vectors: - "king" [0.82, 0.65, -0.20, 0.75] and "queen" [0.78, 0.62, 0.55, 0.70] are close - "cat" [-0.65, 0.40, 0.10, -0.55] is far from both No one tells the network that kings and queens are related — it **discovers** this from seeing them in similar contexts during training. The dimensions might implicitly encode concepts like "royalty," "gender," "animate/inanimate" — though individual dimensions rarely map cleanly to human concepts.

Step 3: Word2Vec: Learning from Context

How do we learn these embedding vectors? The most famous approach is **Word2Vec** (Mikolov et al., 2013), based on a simple but powerful idea: > **"You shall know a word by the company it keeps"** — J.R. Firth, 1957 Words that appear in similar contexts should have similar embeddings. If "king" and "queen" both appear near words like "throne," "crown," "royal," and "palace," they should have similar vectors. Word2Vec uses two training strategies: **Skip-gram:** Given a target word, predict its surrounding context words. - Input: "cat" → Predict: "the", "sat", "on", "mat" - The network learns embeddings that make these predictions accurate **CBOW (Continuous Bag of Words):** Given context words, predict the target word. - Input: "the", "___", "sat", "on" → Predict: "cat" - The average of context embeddings should predict the missing word Neither approach requires labeled data — the training signal comes from the text itself. This **self-supervised learning** is why Word2Vec can be trained on billions of words from the internet.

Step 4: Semantic Similarity: Math on Meaning

With embeddings, we can measure how similar two words are using **cosine similarity** — the cosine of the angle between their vectors. It ranges from -1 (opposite) to 1 (identical). In the 2D embedding space below, you can see clear clusters: - **Royalty** (purple): king, queen, prince, princess cluster together - **People** (blue): man and woman are near royalty but distinct - **Animals** (green): cat and dog cluster far from humans The cosine similarities confirm what we see visually: - **king ↔ queen: 0.84** — very similar (both royalty) - **cat ↔ dog: 0.98** — similar (both animals) - **king ↔ cat: -0.57** — very different (royalty vs animal) This is the power of embeddings: **mathematical distance reflects semantic distance**. We can use vector math to answer questions about meaning.
cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)

Where:
  A · B   = Σᵢ Aᵢ × Bᵢ     (dot product)
  ||A||   = √(Σᵢ Aᵢ²)       (vector magnitude)

Step 5: Embedding Arithmetic: king - man + woman = queen

The most famous demonstration of embedding quality is **vector arithmetic on words**. If embeddings truly encode meaning, then algebraic operations should produce semantically meaningful results. The idea: "king" is to "man" as "queen" is to "woman." In vector space: **king - man + woman = ?** Let's compute it step by step with our 4D embeddings: - king: [0.82, 0.65, -0.20, 0.75] - man: [0.15, 0.10, -0.60, 0.20] - woman: [0.11, 0.07, 0.15, 0.15] king - man = [0.67, 0.55, 0.40, 0.55] + woman = [0.78, 0.62, 0.55, 0.70] Now find the nearest word to this result vector: **Nearest: "queen"** (similarity: 1.00) The "king - man" operation extracts the concept of "royalty" by removing "maleness." Adding "woman" then gives us "female royalty" — queen! This works because the embedding space organizes gender and royalty along consistent directions.

Step 6: Embedding Layers in Neural Networks

In practice, embeddings are implemented as a **lookup table** — a matrix where each row is a word's embedding vector. When you pass word index 0 ("king"), it simply retrieves row 0 from the matrix. No multiplication needed. The embedding matrix for our vocabulary: - Row 0 ("king"): [0.82, 0.65, -0.20, 0.75] - Row 1 ("queen"): [0.78, 0.62, 0.55, 0.70] - Row 2 ("man"): [0.15, 0.10, -0.60, 0.20] - ...and so on for all 8 words During training, the embedding values are updated by backpropagation just like any other weights. The loss signal from the task (e.g., "predict the next word") flows backward through the network and adjusts the embedding vectors so that useful words have useful representations. The diagram shows the flow: word index → embedding lookup → hidden layers → output. The "embedding layer" is just a table lookup — but it's the most important layer because it determines how the network "sees" each word.
import torch.nn as nn

# Create embedding layer: 8 words, 4 dimensions each
embedding = nn.Embedding(num_embeddings=8, embedding_dim=4)

# Look up word index 0 ("king")
word_idx = torch.tensor([0])
king_vector = embedding(word_idx)
# → tensor([[ 0.82,  0.65, -0.20,  0.75]])

# Look up a sequence: "the cat sat"
sequence = torch.tensor([0, 6, 2])  # king, cat, man
vectors = embedding(sequence)
# → tensor of shape [3, 4] — one 4D vector per word

Step 7: Transfer Learning: Standing on Giants' Shoulders

Training embeddings from scratch requires **massive amounts of text**. Word2Vec was trained on 100 billion words from Google News. GloVe was trained on 840 billion tokens from Common Crawl. Fortunately, you don't have to train your own. **Pre-trained embeddings** capture general language knowledge that transfers to your specific task: 1. Download pre-trained embeddings (Word2Vec, GloVe, FastText) 2. Initialize your model's embedding layer with these vectors 3. Fine-tune on your specific task (or freeze them if your dataset is small) This is **transfer learning** for NLP — the same concept as using pre-trained image features from ImageNet for a medical imaging task. The evolution has continued beyond static embeddings: - **ELMo (2018):** Context-dependent embeddings from a bidirectional LSTM - **BERT (2019):** Transformer-based embeddings where "bank" gets different vectors in "river bank" vs. "bank account" - **GPT / LLMs (2020+):** The entire model is the embedding — no separate embedding step Modern language models have made static embeddings less common, but the core concepts (dense vectors, semantic similarity, geometric relationships) remain foundational to how all these systems work.

Step 8: Test Your Understanding

You've learned how embeddings transform sparse one-hot vectors into dense, meaningful representations that capture semantic relationships. Let's test your understanding!

Prerequisites

  • Neural network basics (forward pass)
  • Matrix operations
  • Basic understanding of NLP concepts

Key Concepts

  • One-Hot Encoding
  • Dense Embeddings
  • Word2Vec
  • Cosine Similarity
  • Embedding Arithmetic
  • Transfer Learning