Back to Blog
word2vecembeddingsNLPmachine-learningAI

Word2Vec Explained: How Words Become Vectors

Word2Vec turns words into meaningful vectors where similar words are close together. Learn skip-gram, CBOW, and the famous king-queen analogy with interactive examples.

CS VisualizationsApril 29, 20268 min

Interactive Visualization

Embeddings & Representation Learning

See this concept in action with our step-by-step interactive visualization.

Try the Visualization

How do you teach a computer to understand that "king" and "queen" are related? That "Paris" is to "France" as "Tokyo" is to "Japan"? That "good" and "great" mean similar things but "good" and "evil" don't?

The answer is Word2Vec — an algorithm that transforms words into numbers in a way that captures meaning. Published by Tomas Mikolov at Google in 2013, it changed the field of natural language processing forever.

The Problem: Computers Don't Understand Words

To a computer, "cat" is just three characters: c-a-t. It has no concept that cats are animals, that they're similar to dogs, or that they have nothing to do with cars.

The traditional approach — one-hot encoding — assigns each word a unique vector:

cat  = [1, 0, 0, 0, 0]
dog  = [0, 1, 0, 0, 0]
car  = [0, 0, 1, 0, 0]
king = [0, 0, 0, 1, 0]
queen= [0, 0, 0, 0, 1]

This is terrible. Every word is equally different from every other word. The cosine similarity between "cat" and "dog" is zero — the same as between "cat" and "car." All semantic information is lost.

What we want is a representation where:

  • Similar words have similar vectors
  • The distance between vectors reflects the distance in meaning
  • Relationships between words are encoded as directions in the vector space

The Key Insight: You Know a Word by Its Company

Word2Vec is built on a beautifully simple idea from linguistics:

"You shall know a word by the company it keeps." — J.R. Firth, 1957

Words that appear in similar contexts have similar meanings. "Cat" and "dog" appear near words like "pet," "feed," "cute," and "fur." "King" and "queen" appear near "throne," "crown," "royal," and "palace."

Word2Vec learns vector representations by training a neural network to predict context — and the learned weights become the word vectors.

kingqueenmanwomanprincecatdog
Words in embedding space: royalty (purple) clusters together, animals (green) cluster separately.

How Word2Vec Works: Two Approaches

Skip-gram: Predict Context from Word

Given a target word, predict the surrounding words.

For the sentence "the cat sat on the mat" with a window of 2:

TargetContext (to predict)
thecat, sat
catthe, sat, on
satthe, cat, on, the
oncat, sat, the, mat
thesat, on, mat
maton, the

The neural network architecture is simple:

  1. Input: One-hot vector for the target word
  2. Hidden layer: The embedding (e.g., 300 dimensions) — no activation function
  3. Output: Probability distribution over the vocabulary

The hidden layer weights ARE the word vectors. Training adjusts them so that words appearing in similar contexts end up with similar vectors.

CBOW: Predict Word from Context

The reverse of skip-gram — given surrounding words, predict the target word.

ContextTarget (to predict)
[the, sat]cat
[the, sat, on]cat
[cat, on, the]sat

CBOW averages the context word vectors and predicts the missing word. It's faster than skip-gram but slightly less accurate for rare words.

In practice: Skip-gram works better for small datasets and rare words. CBOW works better for large datasets and frequent words.

The Training Process

The actual training uses a clever optimization called negative sampling to avoid computing over the entire vocabulary (which could be 100,000+ words):

  1. Take a real word pair from the text (e.g., "cat" → "sat") — this is a positive example
  2. Randomly sample 5-15 words that did NOT appear near "cat" (e.g., "cat" → "democracy") — these are negative examples
  3. Train the network to output 1 for the real pair and 0 for the negative pairs

This turns a massive multi-class classification problem into a simple binary classification, making training thousands of times faster.

The Magic: Vector Arithmetic

kingman+womanqueen
king − man + woman ≈ queen: vector arithmetic captures semantic relationships.

Once trained, Word2Vec vectors exhibit remarkable algebraic properties:

king - man + woman = queen

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

How does this work? The vector space encodes "gender" as a direction. The difference king - man extracts "royalty" by removing the "male" component. Adding "woman" then gives "female royalty" — queen.

More Analogies

A is to Bas C is to ?Result
king → queenman → ?woman
Paris → FranceTokyo → ?Japan
walk → walkedswim → ?swam
good → betterbig → ?bigger

These relationships emerge naturally from the training process. Nobody told the network about gender, geography, or grammar — it discovered these concepts from patterns in text.

What Do the Dimensions Mean?

A Word2Vec vector might have 300 dimensions. Do individual dimensions have meaning?

Not exactly. Unlike hand-crafted features, the dimensions don't correspond to interpretable concepts like "is an animal" or "is male." The meaning is distributed across many dimensions.

However, directions in the space are meaningful. There's a "gender direction," a "plural direction," a "tense direction," and so on. These directions aren't aligned with individual axes — they're diagonal paths through the 300-dimensional space.

Practical Considerations

Choosing Dimensions

  • 50 dimensions: Okay for small vocabularies and simple tasks
  • 100-200 dimensions: Good general purpose
  • 300 dimensions: Standard for Word2Vec (what Google released)
  • Beyond 300: Diminishing returns for most tasks

Window Size

  • Small window (2-5): Captures syntactic similarity ("dog" ≈ "cat" — both nouns that follow "the")
  • Large window (5-15): Captures semantic similarity ("dog" ≈ "puppy" — both about young canines)

Vocabulary Size

Word2Vec typically works with the top 50,000-200,000 most frequent words. Rare words get discarded or mapped to an "unknown" token.

Pre-trained Word Vectors

You don't have to train Word2Vec yourself. Google, Stanford, and Facebook have released pre-trained vectors:

ModelDimensionsVocabularyTraining Data
Google Word2Vec3003M wordsGoogle News (100B words)
GloVe (Stanford)50-300400K-2.2MCommon Crawl, Wikipedia
FastText (Facebook)3002M wordsCommon Crawl + Wikipedia

Download them, load into your project, and you instantly have rich word representations trained on billions of words.

Word2Vec's Legacy

Word2Vec didn't just solve word representation — it launched a revolution:

Before Word2Vec (pre-2013): NLP models used sparse, hand-crafted features. Each task required domain expertise to design features.

After Word2Vec: Dense, learned representations became the standard. The idea of learning representations from data spread to every corner of machine learning.

The evolution continues:

  • GloVe (2014): Combines Word2Vec's context approach with global co-occurrence statistics
  • FastText (2016): Handles subwords, so it can represent words it's never seen
  • ELMo (2018): Context-dependent embeddings — "bank" gets different vectors in "river bank" vs "bank account"
  • BERT (2019): Deep bidirectional context — understands entire sentences
  • GPT (2018-present): The embedding IS the model — language understanding and generation unified

Every modern language model — ChatGPT, Claude, Gemini — traces its lineage back to Word2Vec's insight: meaning can be encoded as geometry.

Related Articles

See It In Action

Our interactive visualization lets you explore word embeddings in 2D space. See how similar words cluster together, compute cosine similarity between word pairs, and perform the famous king-queen arithmetic yourself.

Watching words organize themselves by meaning in a vector space is one of those moments where AI stops being abstract and becomes tangible.

Interactive Visualization

Embeddings & Representation Learning

See this concept in action with our step-by-step interactive visualization.

Try the Visualization