How do you teach a computer to understand that "king" and "queen" are related? That "Paris" is to "France" as "Tokyo" is to "Japan"? That "good" and "great" mean similar things but "good" and "evil" don't?
The answer is Word2Vec — an algorithm that transforms words into numbers in a way that captures meaning. Published by Tomas Mikolov at Google in 2013, it changed the field of natural language processing forever.
The Problem: Computers Don't Understand Words
To a computer, "cat" is just three characters: c-a-t. It has no concept that cats are animals, that they're similar to dogs, or that they have nothing to do with cars.
The traditional approach — one-hot encoding — assigns each word a unique vector:
cat = [1, 0, 0, 0, 0]
dog = [0, 1, 0, 0, 0]
car = [0, 0, 1, 0, 0]
king = [0, 0, 0, 1, 0]
queen= [0, 0, 0, 0, 1]
This is terrible. Every word is equally different from every other word. The cosine similarity between "cat" and "dog" is zero — the same as between "cat" and "car." All semantic information is lost.
What we want is a representation where:
- Similar words have similar vectors
- The distance between vectors reflects the distance in meaning
- Relationships between words are encoded as directions in the vector space
The Key Insight: You Know a Word by Its Company
Word2Vec is built on a beautifully simple idea from linguistics:
"You shall know a word by the company it keeps." — J.R. Firth, 1957
Words that appear in similar contexts have similar meanings. "Cat" and "dog" appear near words like "pet," "feed," "cute," and "fur." "King" and "queen" appear near "throne," "crown," "royal," and "palace."
Word2Vec learns vector representations by training a neural network to predict context — and the learned weights become the word vectors.
How Word2Vec Works: Two Approaches
Skip-gram: Predict Context from Word
Given a target word, predict the surrounding words.
For the sentence "the cat sat on the mat" with a window of 2:
| Target | Context (to predict) |
|---|---|
| the | cat, sat |
| cat | the, sat, on |
| sat | the, cat, on, the |
| on | cat, sat, the, mat |
| the | sat, on, mat |
| mat | on, the |
The neural network architecture is simple:
- Input: One-hot vector for the target word
- Hidden layer: The embedding (e.g., 300 dimensions) — no activation function
- Output: Probability distribution over the vocabulary
The hidden layer weights ARE the word vectors. Training adjusts them so that words appearing in similar contexts end up with similar vectors.
CBOW: Predict Word from Context
The reverse of skip-gram — given surrounding words, predict the target word.
| Context | Target (to predict) |
|---|---|
| [the, sat] | cat |
| [the, sat, on] | cat |
| [cat, on, the] | sat |
CBOW averages the context word vectors and predicts the missing word. It's faster than skip-gram but slightly less accurate for rare words.
In practice: Skip-gram works better for small datasets and rare words. CBOW works better for large datasets and frequent words.
The Training Process
The actual training uses a clever optimization called negative sampling to avoid computing over the entire vocabulary (which could be 100,000+ words):
- Take a real word pair from the text (e.g., "cat" → "sat") — this is a positive example
- Randomly sample 5-15 words that did NOT appear near "cat" (e.g., "cat" → "democracy") — these are negative examples
- Train the network to output 1 for the real pair and 0 for the negative pairs
This turns a massive multi-class classification problem into a simple binary classification, making training thousands of times faster.
The Magic: Vector Arithmetic
Once trained, Word2Vec vectors exhibit remarkable algebraic properties:
king - man + woman = queen
vector("king") - vector("man") + vector("woman") ≈ vector("queen")
How does this work? The vector space encodes "gender" as a direction. The difference king - man extracts "royalty" by removing the "male" component. Adding "woman" then gives "female royalty" — queen.
More Analogies
| A is to B | as C is to ? | Result |
|---|---|---|
| king → queen | man → ? | woman |
| Paris → France | Tokyo → ? | Japan |
| walk → walked | swim → ? | swam |
| good → better | big → ? | bigger |
These relationships emerge naturally from the training process. Nobody told the network about gender, geography, or grammar — it discovered these concepts from patterns in text.
What Do the Dimensions Mean?
A Word2Vec vector might have 300 dimensions. Do individual dimensions have meaning?
Not exactly. Unlike hand-crafted features, the dimensions don't correspond to interpretable concepts like "is an animal" or "is male." The meaning is distributed across many dimensions.
However, directions in the space are meaningful. There's a "gender direction," a "plural direction," a "tense direction," and so on. These directions aren't aligned with individual axes — they're diagonal paths through the 300-dimensional space.
Practical Considerations
Choosing Dimensions
- 50 dimensions: Okay for small vocabularies and simple tasks
- 100-200 dimensions: Good general purpose
- 300 dimensions: Standard for Word2Vec (what Google released)
- Beyond 300: Diminishing returns for most tasks
Window Size
- Small window (2-5): Captures syntactic similarity ("dog" ≈ "cat" — both nouns that follow "the")
- Large window (5-15): Captures semantic similarity ("dog" ≈ "puppy" — both about young canines)
Vocabulary Size
Word2Vec typically works with the top 50,000-200,000 most frequent words. Rare words get discarded or mapped to an "unknown" token.
Pre-trained Word Vectors
You don't have to train Word2Vec yourself. Google, Stanford, and Facebook have released pre-trained vectors:
| Model | Dimensions | Vocabulary | Training Data |
|---|---|---|---|
| Google Word2Vec | 300 | 3M words | Google News (100B words) |
| GloVe (Stanford) | 50-300 | 400K-2.2M | Common Crawl, Wikipedia |
| FastText (Facebook) | 300 | 2M words | Common Crawl + Wikipedia |
Download them, load into your project, and you instantly have rich word representations trained on billions of words.
Word2Vec's Legacy
Word2Vec didn't just solve word representation — it launched a revolution:
Before Word2Vec (pre-2013): NLP models used sparse, hand-crafted features. Each task required domain expertise to design features.
After Word2Vec: Dense, learned representations became the standard. The idea of learning representations from data spread to every corner of machine learning.
The evolution continues:
- GloVe (2014): Combines Word2Vec's context approach with global co-occurrence statistics
- FastText (2016): Handles subwords, so it can represent words it's never seen
- ELMo (2018): Context-dependent embeddings — "bank" gets different vectors in "river bank" vs "bank account"
- BERT (2019): Deep bidirectional context — understands entire sentences
- GPT (2018-present): The embedding IS the model — language understanding and generation unified
Every modern language model — ChatGPT, Claude, Gemini — traces its lineage back to Word2Vec's insight: meaning can be encoded as geometry.
Related Articles
- LSTM Networks Explained — How recurrent networks process sequences of word vectors
- What is a Neural Network? — The foundation behind Word2Vec's training process
- What is Machine Learning? — Understand the broader field Word2Vec belongs to
- What are CNNs? — Another architecture that benefits from learned representations
See It In Action
Our interactive visualization lets you explore word embeddings in 2D space. See how similar words cluster together, compute cosine similarity between word pairs, and perform the famous king-queen arithmetic yourself.
Watching words organize themselves by meaning in a vector space is one of those moments where AI stops being abstract and becomes tangible.