How do you teach a computer to understand that "king" and "queen" are related? That "Paris" is to "France" as "Tokyo" is to "Japan"? That "good" and "great" mean similar things but "good" and "evil" don't?

The answer is Word2Vec — an algorithm that transforms words into numbers in a way that captures meaning. Published by Tomas Mikolov at Google in 2013, it changed the field of natural language processing forever.

The Problem: Computers Don't Understand Words

To a computer, "cat" is just three characters: c-a-t. It has no concept that cats are animals, that they're similar to dogs, or that they have nothing to do with cars.

The traditional approach — one-hot encoding — assigns each word a unique vector:

cat  = [1, 0, 0, 0, 0]
dog  = [0, 1, 0, 0, 0]
car  = [0, 0, 1, 0, 0]
king = [0, 0, 0, 1, 0]
queen= [0, 0, 0, 0, 1]

This is terrible. Every word is equally different from every other word. The cosine similarity between "cat" and "dog" is zero — the same as between "cat" and "car." All semantic information is lost.

What we want is a representation where:

Similar words have similar vectors
The distance between vectors reflects the distance in meaning
Relationships between words are encoded as directions in the vector space

The Key Insight: You Know a Word by Its Company

Word2Vec is built on a beautifully simple idea from linguistics:

"You shall know a word by the company it keeps." — J.R. Firth, 1957

Words that appear in similar contexts have similar meanings. "Cat" and "dog" appear near words like "pet," "feed," "cute," and "fur." "King" and "queen" appear near "throne," "crown," "royal," and "palace."

Word2Vec learns vector representations by training a neural network to predict context — and the learned weights become the word vectors.

Words in embedding space: royalty (purple) clusters together, animals (green) cluster separately.

How Word2Vec Works: Two Approaches

Skip-gram: Predict Context from Word

Given a target word, predict the surrounding words.

For the sentence "the cat sat on the mat" with a window of 2:

Target	Context (to predict)
the	cat, sat
cat	the, sat, on
sat	the, cat, on, the
on	cat, sat, the, mat
the	sat, on, mat
mat	on, the

The neural network architecture is simple:

Input: One-hot vector for the target word
Hidden layer: The embedding (e.g., 300 dimensions) — no activation function
Output: Probability distribution over the vocabulary

The hidden layer weights ARE the word vectors. Training adjusts them so that words appearing in similar contexts end up with similar vectors.

CBOW: Predict Word from Context

The reverse of skip-gram — given surrounding words, predict the target word.

Context	Target (to predict)
[the, sat]	cat
[the, sat, on]	cat
[cat, on, the]	sat

CBOW averages the context word vectors and predicts the missing word. It's faster than skip-gram but slightly less accurate for rare words.

In practice: Skip-gram works better for small datasets and rare words. CBOW works better for large datasets and frequent words.

The Training Process

The actual training uses a clever optimization called negative sampling to avoid computing over the entire vocabulary (which could be 100,000+ words):

Take a real word pair from the text (e.g., "cat" → "sat") — this is a positive example
Randomly sample 5-15 words that did NOT appear near "cat" (e.g., "cat" → "democracy") — these are negative examples
Train the network to output 1 for the real pair and 0 for the negative pairs

This turns a massive multi-class classification problem into a simple binary classification, making training thousands of times faster.

The Magic: Vector Arithmetic

king − man + woman ≈ queen: vector arithmetic captures semantic relationships.

Once trained, Word2Vec vectors exhibit remarkable algebraic properties:

king - man + woman = queen

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

How does this work? The vector space encodes "gender" as a direction. The difference king - man extracts "royalty" by removing the "male" component. Adding "woman" then gives "female royalty" — queen.

More Analogies

A is to B	as C is to ?	Result
king → queen	man → ?	woman
Paris → France	Tokyo → ?	Japan
walk → walked	swim → ?	swam
good → better	big → ?	bigger

These relationships emerge naturally from the training process. Nobody told the network about gender, geography, or grammar — it discovered these concepts from patterns in text.

What Do the Dimensions Mean?

A Word2Vec vector might have 300 dimensions. Do individual dimensions have meaning?

Not exactly. Unlike hand-crafted features, the dimensions don't correspond to interpretable concepts like "is an animal" or "is male." The meaning is distributed across many dimensions.

However, directions in the space are meaningful. There's a "gender direction," a "plural direction," a "tense direction," and so on. These directions aren't aligned with individual axes — they're diagonal paths through the 300-dimensional space.

Practical Considerations

Choosing Dimensions

50 dimensions: Okay for small vocabularies and simple tasks
100-200 dimensions: Good general purpose
300 dimensions: Standard for Word2Vec (what Google released)
Beyond 300: Diminishing returns for most tasks

Window Size

Small window (2-5): Captures syntactic similarity ("dog" ≈ "cat" — both nouns that follow "the")
Large window (5-15): Captures semantic similarity ("dog" ≈ "puppy" — both about young canines)

Vocabulary Size

Word2Vec typically works with the top 50,000-200,000 most frequent words. Rare words get discarded or mapped to an "unknown" token.

Pre-trained Word Vectors

You don't have to train Word2Vec yourself. Google, Stanford, and Facebook have released pre-trained vectors:

Model	Dimensions	Vocabulary	Training Data
Google Word2Vec	300	3M words	Google News (100B words)
GloVe (Stanford)	50-300	400K-2.2M	Common Crawl, Wikipedia
FastText (Facebook)	300	2M words	Common Crawl + Wikipedia

Download them, load into your project, and you instantly have rich word representations trained on billions of words.

Word2Vec's Legacy

Word2Vec didn't just solve word representation — it launched a revolution:

Before Word2Vec (pre-2013): NLP models used sparse, hand-crafted features. Each task required domain expertise to design features.

After Word2Vec: Dense, learned representations became the standard. The idea of learning representations from data spread to every corner of machine learning.

The evolution continues:

GloVe (2014): Combines Word2Vec's context approach with global co-occurrence statistics
FastText (2016): Handles subwords, so it can represent words it's never seen
ELMo (2018): Context-dependent embeddings — "bank" gets different vectors in "river bank" vs "bank account"
BERT (2019): Deep bidirectional context — understands entire sentences
GPT (2018-present): The embedding IS the model — language understanding and generation unified

Every modern language model — ChatGPT, Claude, Gemini — traces its lineage back to Word2Vec's insight: meaning can be encoded as geometry.

LSTM Networks Explained — How recurrent networks process sequences of word vectors
What is a Neural Network? — The foundation behind Word2Vec's training process
What is Machine Learning? — Understand the broader field Word2Vec belongs to
What are CNNs? — Another architecture that benefits from learned representations

See It In Action

Our interactive visualization lets you explore word embeddings in 2D space. See how similar words cluster together, compute cosine similarity between word pairs, and perform the famous king-queen arithmetic yourself.

Watching words organize themselves by meaning in a vector space is one of those moments where AI stops being abstract and becomes tangible.

Word2Vec Explained: How Words Become Vectors