Demystifying Word2vec: What It Really Learns and How

Overview

Word2vec remains one of the most influential algorithms in natural language processing. It converts words into dense vector representations that capture semantic relationships through simple vector arithmetic—famously enabling analogies like "king – man + woman ≈ queen." But for years, the inner workings of this deceptively simple model were understood only through empirical observation. A recent paper finally provides a rigorous, predictive theory: under realistic training conditions, word2vec’s learning process reduces to unweighted least-squares matrix factorization, and the final embeddings are exactly given by Principal Component Analysis (PCA). This tutorial walks through that result step by step, explaining what word2vec truly learns and why it behaves the way it does.

Demystifying Word2vec: What It Really Learns and How — Source: bair.berkeley.edu

Prerequisites

To get the most from this guide, you should be comfortable with:

Basic linear algebra – vectors, matrices, eigenvalues, singular value decomposition.
Gradient descent – how neural networks update weights to minimize a loss function.
Word embeddings – the general idea of mapping words to dense vectors (e.g., from an introductory NLP course).
Familiarity with matrix factorization (e.g., PCA, SVD) helps but isn’t strictly required.

Step-by-Step: What Word2vec Learns and How

1. Training Setup: The Minimal Language Model

Word2vec comes in two flavors: Skip-gram (predict context from target) and CBOW (predict target from context). Both train a two-layer linear neural network using a contrastive objective (negative sampling or noise-contrastive estimation). The input is a one-hot vector representing a word, the hidden layer produces a dense embedding, and the output layer predicts context words. Despite the nonlinear sigmoid in the loss, the network itself has no hidden-layer nonlinearity, making it a linear model on the embeddings.

2. The Key Insight: Small Initialization Changes Everything

The breakthrough theory assumes all embedding vectors are initialized randomly very close to the origin—effectively zero. When training begins, the embeddings are essentially zero-dimensional. Under mild approximations (e.g., ignoring the softmax nonlinearity and treating the update as gradient flow), the learning dynamics become rank-incrementing. The model learns one “concept” at a time, with each concept corresponding to an orthogonal linear subspace in the latent space. In other words, the embeddings expand from zero to a one-dimensional subspace, then to a two-dimensional subspace, and so on, until the model’s capacity is saturated.

3. Reduction to Matrix Factorization

The paper proves that, in this regime, the entire learning problem simplifies to an unweighted least-squares matrix factorization of the pointwise mutual information (PMI) matrix of the corpus (or a shifted version). Specifically, the embedding matrix W and context matrix C are trained to satisfy W C^T ≈ M, where M is related to the PMI matrix. Because the initialization is so small, the gradient flow dynamics drive the system toward a low-rank factorization, and the solution converges to the singular value decomposition (SVD) of M.

4. Closed-Form Solution: PCA on Embeddings

A remarkable consequence is that the final learned embeddings are exactly given by PCA. When you run PCA on the matrix M, the principal components (scaled by square roots of eigenvalues) become the word embeddings. This explains the observed linear structure: the embedding space is organized by the directions of greatest variance in the co-occurrence statistics. The first principal component corresponds to the most frequent conceptual direction (e.g., a broad syntactic or semantic axis), and each subsequent component adds a new orthogonal dimension.

5. Visualizing the Learning Steps

The paper includes striking visualizations: the loss decreases in discrete jumps, each corresponding to a new eigenvector being “activated.” At three time slices, the embedding vectors start near the origin, then stretch into a one-dimensional line, then expand into a plane, etc. This stepwise behavior is reminiscent of neural tangent kernel (NTK) theory and explains why word2vec often discovers interpretable linear directions—they are precisely the eigenvectors of the co-occurrence matrix.

Common Mistakes

Thinking initialization doesn’t matter: Many practitioners initialize word2vec with small random values and assume any value works. The theory shows that the scale of initialization critically influences whether the model learns in discrete steps or not. Large initialization can break the rank-incrementing dynamics.
Confusing the learned embeddings with the output layer: Word2vec has both input embeddings and output (context) embeddings. The paper focuses on the input embeddings (often called “word vectors”), but the output embeddings also participate in the factorization. Misunderstanding which matrix is being factorized leads to errors in interpretation.
Believing embeddings are truly semantic: While word2vec captures co-occurrence statistics, the linear relationships (like analogies) emerge from geometry, not explicit semantics. The theory clarifies that these directions are principal components of the PMI matrix, not necessarily cognitive concepts—though they correlate with them.
Ignoring the effect of negative sampling: The original word2vec uses negative sampling, which approximates the PMI matrix. The theory works under the assumption that the objective is a form of noise-contrastive estimation that leads to the same factorization. Changing the sampling parameters changes the target matrix.

Summary

Word2vec, when trained from very small initialization, learns word embeddings that are precisely the principal components of a shifted pointwise mutual information matrix. The learning happens in discrete, rank-incrementing steps, each adding a new orthogonal concept direction. This theory finally provides a quantitative, predictive explanation for word2vec’s success at analogies and linear representations. It bridges the gap between heuristic embedding algorithms and principled matrix factorization, offering a foundation for understanding more modern language models. By demystifying what word2vec learns, we gain a deeper appreciation of how even simple linear models can extract rich semantic structure from co-occurrence statistics.

Tags: