Word2Vec Implementation from Scratchยถ
๐ Table of Contentsยถ
- Introduction
- Data Preparation
- Creating Training Data
- Skip Gram with Negative Sampling
- CBOW with Negative Sampling
- Intrinsic Evaluation
Each section lives in its own notebook and links back here when needed. Letโs dive in!
1. Introduction to Word2Vecยถ
1.1 ๐ค What Are Word Embeddings?ยถ
Word embeddings are a way to represent words as dense vectors in a continuous space, where semantically similar words are close together.
Unlike one-hot encoding (where each word is just a binary blip in a huge sparse space), embeddings capture meaning and relationships between words.
Key features:
- Dense vectors: Typically 50-300 dimensions (we'll use 16 for faster training)
- Learned from data: Capture semantic relationships automatically
- Contextual similarity: Words used in similar contexts have similar vectors
- Algebraic properties: Support meaningful vector operations (e.g.,
king - man + woman โ queen
)
1.2 ๐ง Word2Vec Architecturesยถ
Word2Vec (Mikolov et al., 2013) introduced two neural models for learning these embeddings: Skip-gram and CBOW. Let's break them down:
๐ Skip-gramยถ
- Objective: Predict context words from the target word
- Input: A single target word
- Output: Probability distribution over possible context words
- Training pairs: (target_word, context_word)
- Strengths:
- Performs well with small training datasets
- Better at representing rare words
- Can capture multiple meanings per word
๐งบ CBOW (Continuous Bag of Words)ยถ
- Objective: Predict target word from context words
- Input: Multiple context words (averaged together)
- Output: Probability distribution over possible target words
- Training pairs: (context_words, target_word)
- Strengths:
- Trains faster than Skip-gram
- Slightly better accuracy for frequent words
- Smoother word representations
1.3 ๐ค Why Build Word2Vec Ourselves?ยถ
While pre-trained embeddings like GloVe or Word2Vec are readily available, building Word2Vec from scratch offers several benefits:
- Understanding the mechanics: See exactly how embeddings arise from word co-occurrence patterns
- Customization: Tailor the architecture and hyperparameters to specific needs
- Domain adaptation: Train on domain-specific text for specialized embeddings
- Educational value: Gain insights into neural network training and optimization
In the following notebooks, we'll implement both Skip-gram and CBOW architectures and evaluate their performance on various intrinsic tasks. We'll use a smaller embedding dimension (16 instead of the typical 300) to speed up training while still capturing meaningful semantic relationships.
2. ๐งน Preparing the Dataยถ
2.1 ๐ Loading and Preprocessing Textยถ
The quality of word embeddings depends heavily on the quality and quantity of training data. For this implementation, we're using the Text8 dataset, which contains cleaned Wikipedia text.
The preprocessing steps are crucial:
- Tokenization: Split the text into words
- Case normalization: Convert all text to lowercase (already done in Text8)
- Special character handling: Remove or replace special characters
- Handling rare words: Decide how to treat infrequent terms
Let's examine our dataset and perform the necessary preprocessing:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import re
import random
from tqdm.notebook import tqdm
# Download and load a text corpus (smaller dataset for faster processing)
!wget -nc http://mattmahoney.net/dc/text8.zip
!unzip -o text8.zip
# Load the data
with open('text8', 'r') as f:
text = f.read()[:5000000]
print(f"Total characters: {len(text)}")
print(f"First 500 characters: {text[:500]}")
Fichier ยซย text8.zipย ยป dรฉjร prรฉsentย ; pas de rรฉcupรฉration. Archive: text8.zip inflating: text8 Total characters: 5000000 First 500 characters: anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philoso
2.2 ๐ง Building the Vocabularyยถ
Natural language isn't fair โ a few words get used a lot, and most words barely show up. This phenomenon is known as Zipfโs Law: word frequency follows a power-law distribution.
๐ You can see this in the plot above โ a small number of words (like "the", "and", "of") dominate, while the majority appear rarely.
This skewed distribution creates a few challenges:
- Common words flood the training process but donโt add much semantic insight
- Rare words donโt appear enough to learn meaningful embeddings
- A huge vocabulary makes training slower and memory-heavy
๐ ๏ธ Our Strategyยถ
To make training more efficient and meaningful, weโll:
- ๐ซ Remove very rare words โ filter out those below a minimum frequency
- ๐งน (Optionally) Subsample very frequent words โ reduce their impact during training
- ๐ข Build a fast lookup table โ map each word to a unique index for efficient computation
With this cleaned-up vocabulary, weโre ready to turn raw text into training-ready data!
# Tokenize the text
words = text.split()
print(f"Total words: {len(words)}")
print(f"First 20 words: {words[:20]}")
# Build vocabulary
word_counts = Counter(words)
print(f"Unique words: {len(word_counts)}")
# Plot word frequency distribution
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
word_freq = sorted(word_counts.values(), reverse=True)
plt.plot(word_freq[:100])
plt.title('Top 100 Word Frequencies')
plt.xlabel('Word Rank')
plt.ylabel('Frequency')
plt.grid(True)
plt.subplot(1, 2, 2)
plt.plot(word_freq)
plt.title('All Word Frequencies (Log Scale)')
plt.xlabel('Word Rank')
plt.ylabel('Frequency')
plt.yscale('log')
plt.xscale('log')
plt.grid(True)
plt.tight_layout()
plt.show()
Total words: 846987 First 20 words: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english'] Unique words: 47046
2.3 ๐ Filtering the Vocabularyยถ
To keep things manageable, weโve set a minimum frequency threshold of 5 โ meaning any word that appears fewer than 5 times gets filtered out.
This has a big impact: our vocabulary size drops significantly, but we still keep good coverage of the dataset.
โ๏ธ Why This Mattersยถ
Thereโs a tradeoff here:
- Higher threshold โ Smaller vocab = faster training, but more "unknown" words
- Lower threshold โ Better coverage, but slower training and noisy embeddings for rare words
In real-world applications, thresholds like 20โ100 are common. But for our learning-focused setup, 5 hits a nice sweet spot.
๐ Even with this filter, our vocabulary still covers 90%+ of the text โ proof that a small set of words carries most of the meaning!
# Filter vocabulary based on word frequency
min_count = 5 # Minimum word frequency
vocabulary = {word: count for word, count in word_counts.items() if count >= min_count}
print(f"Vocabulary size after filtering: {len(vocabulary)}")
# Analyze the coverage of our vocabulary
total_word_count = len(words)
covered_word_count = sum(word_counts[word] for word in vocabulary)
coverage_percentage = (covered_word_count / total_word_count) * 100
print(f"Original vocabulary size: {len(word_counts)}")
print(f"Filtered vocabulary size: {len(vocabulary)}")
print(f"Words retained: {covered_word_count:,} out of {total_word_count:,} ({coverage_percentage:.2f}%)")
# Plot the effect of different min_count thresholds
thresholds = [1, 2, 5, 10, 20, 50, 100]
vocab_sizes = []
coverage_percentages = []
for threshold in thresholds:
filtered_vocab = {word: count for word, count in word_counts.items() if count >= threshold}
vocab_sizes.append(len(filtered_vocab))
covered = sum(word_counts[word] for word in filtered_vocab)
coverage_percentages.append((covered / total_word_count) * 100)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(thresholds, vocab_sizes, 'o-')
plt.title('Vocabulary Size vs Min Count')
plt.xlabel('Minimum Count Threshold')
plt.ylabel('Vocabulary Size')
plt.grid(True)
plt.subplot(1, 2, 2)
plt.plot(thresholds, coverage_percentages, 'o-')
plt.title('Text Coverage vs Min Count')
plt.xlabel('Minimum Count Threshold')
plt.ylabel('Coverage (%)')
plt.grid(True)
plt.tight_layout()
plt.show()
Vocabulary size after filtering: 12515 Original vocabulary size: 47046 Filtered vocabulary size: 12515 Words retained: 791,689 out of 846,987 (93.47%)
2.4 ๐๏ธ Creating Word Mappingsยถ
After filtering, weโve reduced the vocabulary from 47,046 words down to 12,515 โ thatโs a ~73% reduction!
Yet weโve still retained 93.47% of the total words in the text (791,689 out of 846,987). ๐ฏ
This shows just how concentrated language usage is โ most of the meaning comes from a relatively small set of words.
To work with our Word2Vec model, we now need two key mappings:
๐ข word2idx
ยถ
Maps each word to a unique integer index
โ
Used to turn words into numbers for training
โ
Enables fast lookup during model computation
๐ idx2word
ยถ
Maps each index back to the original word
โ
Useful for inspecting predictions and visualizations
โ
Helps convert model outputs into readable text
Why this matters:
- Vocabulary size determines the size of the model's output layer
- It directly affects training speed, memory usage, and overall efficiency
We'll use these mappings to generate training pairs for both the Skip-gram and CBOW architectures.
# Create word-to-index and index-to-word mappings
word2idx = {word: i for i, word in enumerate(vocabulary.keys())}
idx2word = {i: word for word, i in word2idx.items()}
vocab_size = len(word2idx)
# Filter the corpus to include only words in our vocabulary
filtered_words = [word for word in words if word in vocabulary]
print(f"Words retained after filtering: {len(filtered_words)} ({len(filtered_words)/len(words):.2%})")
Words retained after filtering: 791689 (93.47%)
3. ๐๏ธ Creating Training Examplesยถ
3.1 ๐ฏ Understanding the Context Windowยถ
Word2Vec learns meaning from context โ the idea is simple:
Words that appear in similar surroundings tend to have similar meanings.
To capture this, we use a sliding context window to extract training pairs from our text.
๐งฐ How it works:ยถ
- For each word in the corpus (the target word)
- We look at a window of surrounding words (the context words)
- The window size controls how many words before and after are included
- For every target word, we pair it with each context word within that window
๐ Example (Window Size = 2)ยถ
Text: "the cat sat on the mat"
Target word: "sat"
Context window: ["the", "cat", "on", "the"]
Training pairs:
("sat", "the") ("sat", "cat") ("sat", "on") ("sat", "the")
Weโll generate these pairs for every word in our (filtered) corpus. This is the training data that powers both Skip-gram and CBOW models.
๐ Up next: weโll implement the code that creates these training examples.
# Demonstrate the context window
def visualize_context_window(text, window_size=2):
words = text.split()
examples = []
for i in range(len(words)):
target = words[i]
# Calculate context range (window_size words on each side)
context_start = max(0, i - window_size)
context_end = min(len(words), i + window_size + 1)
# Get context words (excluding the target)
context = words[context_start:i] + words[i+1:context_end]
examples.append((target, context))
return examples
# Example text
sample_text = "the quick brown fox jumps over the lazy dog"
examples = visualize_context_window(sample_text, window_size=2)
# Create a visualization
plt.figure(figsize=(8, 10))
plt.title("Context Window Visualization (Window Size = 2)")
plt.axis('off')
for i, (target, context) in enumerate(examples):
# Display target word
plt.text(0.3, -i/len(examples), target, fontsize=10,
ha='center', bbox=dict(facecolor='lightblue', alpha=0.5, pad=5))
# Display context words
for j, word in enumerate(context):
x_offset = (j * 0.2)
plt.text(x_offset, -i/len(examples), word, fontsize=10,
ha='center', bbox=dict(facecolor='lightyellow', alpha=0.5, pad=5))
plt.tight_layout()
plt.show()
3.2 ๐ Generating Skip-gram Training Pairsยถ
In the Skip-gram model, our goal is to predict context words from a given target word.
That means for every target word, we create multiple training pairs โ one for each context word that appears within a sliding window.
๐งฑ How It Works:ยถ
- Slide a context window over the corpus
- For each word at the center (our target)...
- Grab the words around it within the window (these are the context words)
- Create training pairs in the form:
(target_word, context_word)
๐ Exampleยถ
If our window size is 2 and the center word is "sat"
, we generate:
("sat", "the") ("sat", "cat") ("sat", "on") ("sat", "the")
โ๏ธ Pros & Tradeoffsยถ
โ
Great for rare words and capturing subtle semantic differences
โ But... it produces lots of training examples, which can slow down training
Letโs implement this logic in code and prepare the dataset for training our Skip-gram model.
def generate_skipgram_pairs(corpus, window_size, word2idx):
"""
Generate training pairs for the Skip-gram model.
Args:
corpus: List of tokenized words
window_size: Number of words to consider as context on each side
word2idx: Word to index mapping
Returns:
pairs: List of (target_idx, context_idx) pairs
"""
pairs = []
# Loop through each word in corpus
for i in tqdm(range(len(corpus)), desc="Generating Skip-gram pairs"):
target = word2idx[corpus[i]]
# Get context words within the window
context_start = max(0, i - window_size)
context_end = min(len(corpus), i + window_size + 1)
for j in range(context_start, context_end):
if i != j: # Skip the target word itself
if corpus[j] in word2idx: # Check if context word is in vocabulary
context = word2idx[corpus[j]]
pairs.append((target, context))
return np.array(pairs)
# Generate Skip-gram training pairs
window_size = 2 # Context window of 2 words on each side
skipgram_pairs = generate_skipgram_pairs(filtered_words, window_size, word2idx)
print(f"Generated {len(skipgram_pairs)} Skip-gram training pairs")
print(f"Sample pairs (word indices):")
for i in range(5):
target_idx, context_idx = skipgram_pairs[i]
print(f" Target: {idx2word[target_idx]} ({target_idx}) โ Context: {idx2word[context_idx]} ({context_idx})")
Generating Skip-gram pairs: 0%| | 0/791689 [00:00<?, ?it/s]
Generated 3166750 Skip-gram training pairs Sample pairs (word indices): Target: anarchism (0) โ Context: originated (1) Target: anarchism (0) โ Context: as (2) Target: originated (1) โ Context: anarchism (0) Target: originated (1) โ Context: as (2) Target: originated (1) โ Context: a (3)
3.3 ๐งบ Generating CBOW Training Pairsยถ
The CBOW (Continuous Bag of Words) model flips the Skip-gram idea on its head:
Instead of predicting context from a word, we predict the target word from its context.
This means we group the surrounding words together and try to guess the word in the middle.
๐งฑ How It Works:ยถ
- Slide a context window over the text
- For each position, treat the center word as the target
- Take the surrounding words as the context
- Create a training example like:
(context_words, target_word)
๐ Exampleยถ
With a window size of 2 and the center word "sat"
:
Context: ["the", "cat", "on", "the"] Target: "sat" Training pair: (["the", "cat", "on", "the"], "sat")
โ๏ธ Pros & Tradeoffsยถ
โ
Faster training than Skip-gram
โ
Performs well on frequent words
โ Might struggle with rare or ambiguous words
Letโs now implement the CBOW training data generation logic in code.
def generate_cbow_pairs(corpus, window_size, word2idx):
"""
Generate training pairs for the CBOW model.
Args:
corpus: List of tokenized words
window_size: Number of words to consider as context on each side
word2idx: Word to index mapping
Returns:
X: Context word indices (each row contains 2*window_size indices)
y: Target word indices
"""
X = [] # Context words
y = [] # Target words
# Loop through each word in corpus
for i in tqdm(range(len(corpus)), desc="Generating CBOW pairs"):
if corpus[i] not in word2idx:
continue
target = word2idx[corpus[i]]
# Get context words within the window
context_start = max(0, i - window_size)
context_end = min(len(corpus), i + window_size + 1)
# Get context word indices
context_indices = []
for j in range(context_start, context_end):
if i != j and corpus[j] in word2idx: # Skip target & ensure word is in vocabulary
context_indices.append(word2idx[corpus[j]])
# Only use examples with sufficient context
if len(context_indices) == 2 * window_size: # Full context window
X.append(context_indices)
y.append(target)
return np.array(X), np.array(y)
# Generate CBOW training pairs
window_size = 2 # Context window of 2 words on each side
X_cbow, y_cbow = generate_cbow_pairs(filtered_words, window_size, word2idx)
print(f"Generated {len(X_cbow)} CBOW training examples")
print(f"Shape of X_cbow: {X_cbow.shape}")
print(f"Shape of y_cbow: {y_cbow.shape}")
# Show sample CBOW training examples
print("\nSample CBOW examples:")
for i in range(5):
context_indices = X_cbow[i]
target_idx = y_cbow[i]
context_words = [idx2word[idx] for idx in context_indices]
target_word = idx2word[target_idx]
print(f" Context: {context_words} โ Target: {target_word}")
Generating CBOW pairs: 0%| | 0/791689 [00:00<?, ?it/s]
Generated 791685 CBOW training examples Shape of X_cbow: (791685, 4) Shape of y_cbow: (791685,) Sample CBOW examples: Context: ['anarchism', 'originated', 'a', 'term'] โ Target: as Context: ['originated', 'as', 'term', 'of'] โ Target: a Context: ['as', 'a', 'of', 'abuse'] โ Target: term Context: ['a', 'term', 'abuse', 'first'] โ Target: of Context: ['term', 'of', 'first', 'used'] โ Target: abuse
3.4 โ๏ธ Comparing Skip-gram vs. CBOWยถ
Before we move on, letโs pause and compare the two training approaches side by side:
๐ก Aspect | ๐ Skip-gram | ๐งบ CBOW |
---|---|---|
Prediction Direction | Target โ Context | Context โ Target |
Input | One word | Multiple words |
Output | One context word at a time | One target word |
Training Examples | More (one per context word) | Fewer (one per target) |
Training Speed | Slower | Faster |
Good for Rare Words | โ Yes | ๐ซ Not as good |
Captures Word Senses | โ Better | ๐ซ Limited |
๐ง So Which One Should You Use?ยถ
Both models have their strengths:
- Skip-gram: Great for small datasets, rare words, and learning multiple meanings of words
- CBOW: Faster and often slightly better for frequent words and large corpora
In real-world use (e.g., the original Word2Vec paper), Skip-gram is usually preferred for its superior performance on semantic tasks โ even if it takes a bit longer to train.
Up next: weโll define the neural architectures that turn these training pairs into powerful word embeddings!
4. โ๏ธ Skip-gram Model Implementationยถ
4.1 ๐ง The Skip-gram Architecture with Negative Samplingยถ
Now that weโve generated our training pairs, itโs time to move into modeling โ starting with Skip-gram, enhanced with negative sampling.
๐ Quick Recapยถ
Earlier, we generated Skip-gram training pairs in the vanilla style โ where each (target, context) pair would be used to predict the full probability distribution over the entire vocabulary.
Thatโs perfectly valid, but it's computationally expensive, especially with vocabularies of 10,000+ words.
So instead, weโll now switch to a more efficient and scalable approach:
Skip-gram with Negative Sampling
This requires rethinking how we generate training data and how we structure the model โ but itโll make everything much faster and more practical, even on large datasets.
๐ก What Is Negative Sampling?ยถ
Instead of trying to predict the entire vocabulary, we reframe the problem as a binary classification task:
- Is this word a true context word for the target? โ
- Or is it just a random (negative) sample? โ
๐งช How It Works:ยถ
- For each positive pair
(target, context)
- Sample a few negative pairs: words that are not actual context
- Train the model to:
- Push positive pairs closer in embedding space
- Push negative pairs apart
โ๏ธ Why Use It?ยถ
- ๐ Faster training: Only a few dot products per example instead of tens of thousands
- ๐ Better results: Especially for rare words and semantic similarity
- ๐ผ Scalable: Works well even for massive corpora
๐งฎ Training Objectiveยถ
We optimize the following loss for each training pair:
$$ \log(\sigma(v_{target} \cdot v_{context})) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)}[\log(\sigma(-v_{target} \cdot v_{w_i}))] $$
Where:
- $\sigma$ = sigmoid function
- $v_{target}$ = embedding of the center (input) word
- $v_{context}$ = embedding of a true context word
- $v_{w_i}$ = embeddings of negative samples
- $k$ = number of negative samples
- $P_n(w)$ = noise distribution (usually unigram freq raised to the 3/4 power)
๐ ๏ธ Note: Since our earlier training pairs were built for the vanilla Skip-gram model, weโll now generate a new version of the training dataset tailored for negative sampling. The implementation will be a bit trickier โ but way more efficient.
Letโs dive in.
4.2 ๐ ๏ธ Implementing the Skip-gram Model with Negative Samplingยถ
Now that we understand the theory behind Skip-gram with negative sampling, letโs implement it in TensorFlow/Keras!
๐ฅ Input Formatยถ
Our model takes in three components:
- Target word index (center word)
- Positive context word index
- Negative word indices (randomly sampled "noise" words)
๐งฑ Model Structureยถ
Weโve built a custom Keras model with the following components:
target_embedding
: Learns vector representations for input (center) wordscontext_embedding
: Learns vector representations for context words (used for both positive and negative samples)- Dot products are computed between target and context embeddings:
- Positive dot product for real context words
- Negative dot products for sampled non-context words
These are passed to a custom loss function that encourages the model to:
- Assign high similarity to real
(target, context)
pairs - Assign low similarity to randomly sampled negatives
๐งช Loss Functionยถ
We're using a custom negative sampling loss, which:
- Applies sigmoid cross-entropy
- Labels the first value in each output vector as
1
(positive pair) - Labels all others as
0
(negatives)
This simplifies our objective into a set of binary classification tasks per training example.
๐งฐ Preparing Training Dataยถ
Because negative sampling changes our training format, we regenerate the dataset using a helper function:
targets, contexts, negatives = prepare_training_data(skipgram_pairs, vocab_size, num_ns=4)
We then wrap it all in a tf.data.Dataset
pipeline:
- Batching
- Shuffling
- Prefetching for performance
๐ง Embedding Dimensionยถ
Weโre using a small embedding size of 16 dimensions to keep training fast during experimentation.
(In practice, 100โ300 dimensions are more common.)
โ Model Compilationยถ
We compile the model using:
- Adam optimizer for efficient gradient updates
- Our custom loss function tailored for negative sampling
Once compiled, we can start training!
Letโs check out the model summary to see the architecture.
from tensorflow import keras
from keras import layers
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from tqdm.notebook import tqdm
import time
class Word2VecNegativeSampling(keras.Model):
def __init__(self, vocab_size, embedding_dim, num_ns=4):
super(Word2VecNegativeSampling, self).__init__()
self.target_embedding = layers.Embedding(
vocab_size,
embedding_dim,
name="w2v_embedding",
)
self.context_embedding = layers.Embedding(
vocab_size,
embedding_dim,
name="context_embedding"
)
self.num_ns = num_ns
def call(self, inputs):
target_inputs, context_inputs, negative_inputs = inputs
# Get embeddings
# Shape: (batch_size, 1, embedding_dim)
target_embedding = self.target_embedding(target_inputs)
# Shape: (batch_size, 1, embedding_dim)
context_embedding = self.context_embedding(context_inputs)
# Shape: (batch_size, num_ns, embedding_dim)
negative_embedding = self.context_embedding(negative_inputs)
# Remove the extra dimension from target and context
# Shape: (batch_size, embedding_dim)
target_embedding_flat = tf.squeeze(target_embedding, axis=1)
# Shape: (batch_size, embedding_dim)
context_embedding_flat = tf.squeeze(context_embedding, axis=1)
# Compute positive dot product (target ยท context)
# Shape: (batch_size, 1)
positive_dots = tf.reduce_sum(
tf.multiply(target_embedding_flat, context_embedding_flat),
axis=1, keepdims=True
)
# Reshape target for negative samples
# Shape: (batch_size, 1, embedding_dim)
target_for_negatives = tf.expand_dims(target_embedding_flat, axis=1)
# Compute negative dot products (target ยท negative samples)
# Broadcasting handles the shapes: (batch_size, 1, embedding_dim) * (batch_size, num_ns, embedding_dim)
# Result shape: (batch_size, num_ns)
negative_dots = tf.reduce_sum(
tf.multiply(target_for_negatives, negative_embedding),
axis=2
)
# Combine positive and negative dots
# Shape: (batch_size, 1 + num_ns)
dots = tf.concat([positive_dots, negative_dots], axis=1)
return dots
# Custom loss function to handle the combined logits
def w2v_negative_sampling_loss(y_true, y_pred):
"""
Word2Vec negative sampling loss.
y_pred contains dot products for positive (first column) and negative samples (remaining columns)
"""
# Create labels: 1 for positive sample, 0 for negative samples
batch_size = tf.shape(y_pred)[0]
# Shape: (batch_size, 1 + num_ns)
labels = tf.concat([
tf.ones((batch_size, 1), dtype=tf.float32),
tf.zeros((batch_size, tf.shape(y_pred)[1] - 1), dtype=tf.float32)
], axis=1)
# Apply sigmoid cross entropy
return tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=y_pred)
)
# Function to prepare training data
def prepare_training_data(skipgram_pairs, vocab_size, num_ns=4, seed=42):
"""Generate properly formatted data for negative sampling"""
np.random.seed(seed)
# Extract target and context words
targets, contexts = skipgram_pairs[:, 0], skipgram_pairs[:, 1]
# Generate negative samples
negatives = []
for target, context in zip(targets, contexts):
negative_samples = []
while len(negative_samples) < num_ns:
neg = np.random.randint(0, vocab_size)
if neg != target and neg != context:
negative_samples.append(neg)
negatives.append(negative_samples)
# Convert to arrays and reshape
targets = np.array(targets, dtype=np.int32)[:, np.newaxis]
contexts = np.array(contexts, dtype=np.int32)[:, np.newaxis]
negatives = np.array(negatives, dtype=np.int32)
return targets, contexts, negatives
# Prepare the training data (assuming skipgram_pairs is already defined)
targets, contexts, negatives = prepare_training_data(
skipgram_pairs, vocab_size, num_ns=4
)
print(f"Prepared training data:")
print(f" Targets shape: {targets.shape}")
print(f" Contexts shape: {contexts.shape}")
print(f" Negatives shape: {negatives.shape}")
# Create TensorFlow dataset
dataset = tf.data.Dataset.from_tensor_slices(
((targets, contexts, negatives), np.zeros(len(targets)))
)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(512)
dataset = dataset.prefetch(tf.data.AUTOTUNE)
# Create and compile model
embedding_dim = 16
num_ns = 4
model = Word2VecNegativeSampling(vocab_size, embedding_dim, num_ns)
# Compile with loss function
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.01),
loss=w2v_negative_sampling_loss
)
model.summary()
Prepared training data: Targets shape: (3166750, 1) Contexts shape: (3166750, 1) Negatives shape: (3166750, 4)
Model: "word2_vec_negative_sampling"
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโ โ Layer (type) โ Output Shape โ Param # โ โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ โ w2v_embedding (Embedding) โ ? โ 0 (unbuilt) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโค โ context_embedding (Embedding) โ ? โ 0 (unbuilt) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโ
Total params: 0 (0.00 B)
Trainable params: 0 (0.00 B)
Non-trainable params: 0 (0.00 B)
# Train model with early stopping
early_stopping = tf.keras.callbacks.EarlyStopping(
monitor='loss', patience=5, restore_best_weights=True
)
print("Beginning training...")
history = model.fit(
dataset,
steps_per_epoch=1,
epochs=1,
verbose=1,
callbacks=[early_stopping]
)
skipgram_initial_embeddings = model.target_embedding.get_weights()[0]
history = model.fit(
dataset,
epochs=20,
verbose=1,
callbacks=[early_stopping]
)
# Plot loss history
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'])
plt.title('Word2Vec Negative Sampling Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True)
plt.show()
Beginning training... 1/1 โโโโโโโโโโโโโโโโโโโโ 0s 389ms/step - loss: 0.6932 Epoch 1/20 6186/6186 โโโโโโโโโโโโโโโโโโโโ 14s 2ms/step - loss: 0.2820 Epoch 2/20 6186/6186 โโโโโโโโโโโโโโโโโโโโ 14s 2ms/step - loss: 0.2471 Epoch 3/20 6186/6186 โโโโโโโโโโโโโโโโโโโโ 15s 2ms/step - loss: 0.2424 Epoch 4/20 6186/6186 โโโโโโโโโโโโโโโโโโโโ 15s 2ms/step - loss: 0.2408 Epoch 5/20 6186/6186 โโโโโโโโโโโโโโโโโโโโ 13s 2ms/step - loss: 0.2402 Epoch 6/20 6186/6186 โโโโโโโโโโโโโโโโโโโโ 13s 2ms/step - loss: 0.2398 Epoch 7/20 6186/6186 โโโโโโโโโโโโโโโโโโโโ 13s 2ms/step - loss: 0.2396 Epoch 8/20 6186/6186 โโโโโโโโโโโโโโโโโโโโ 13s 2ms/step - loss: 0.2397 Epoch 9/20 6186/6186 โโโโโโโโโโโโโโโโโโโโ 13s 2ms/step - loss: 0.2394 Epoch 10/20 6186/6186 โโโโโโโโโโโโโโโโโโโโ 13s 2ms/step - loss: 0.2394 Epoch 11/20 6186/6186 โโโโโโโโโโโโโโโโโโโโ 14s 2ms/step - loss: 0.2395 Epoch 12/20 6186/6186 โโโโโโโโโโโโโโโโโโโโ 14s 2ms/step - loss: 0.2395 Epoch 13/20 6186/6186 โโโโโโโโโโโโโโโโโโโโ 14s 2ms/step - loss: 0.2398 Epoch 14/20 6186/6186 โโโโโโโโโโโโโโโโโโโโ 14s 2ms/step - loss: 0.2396
# Extract learned embeddings
skipgram_learned_embeddings = model.target_embedding.get_weights()[0]
print(f"Learned embeddings shape: {skipgram_learned_embeddings.shape}")
def find_similar_words(word, embeddings, word2idx, idx2word, top_n=5):
"""Find most similar words based on cosine similarity"""
if word not in word2idx:
return []
word_idx = word2idx[word]
word_vec = embeddings[word_idx]
# Normalize vectors for cosine similarity
norm_word = np.linalg.norm(word_vec)
if norm_word > 0:
word_vec = word_vec / norm_word
# Normalize all embeddings
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
norms = np.maximum(norms, 1e-10) # Avoid division by zero
normalized_embeddings = embeddings / norms
# Calculate similarities
similarities = np.dot(normalized_embeddings, word_vec)
# Get top n similar words
most_similar = []
for idx in similarities.argsort()[-top_n-1:][::-1]:
if idx != word_idx:
most_similar.append((idx2word[idx], similarities[idx]))
return most_similar
# Test some example words
test_words = ["king", "france", "computer", "dog"]
for word in test_words:
if word in word2idx:
similar_words = find_similar_words(word, skipgram_initial_embeddings, word2idx, idx2word)
print(f"\nWords similar to '{word} - initial embeddings':")
for similar_word, similarity in similar_words:
print(f" {similar_word}: {similarity:.4f}")
similar_words = find_similar_words(word, skipgram_learned_embeddings, word2idx, idx2word)
print(f"\nWords similar to '{word} - learned embeddings':")
for similar_word, similarity in similar_words:
print(f" {similar_word}: {similarity:.4f}")
print('---')
Learned embeddings shape: (12515, 16) Words similar to 'king - initial embeddings': semantic: 0.7762 atheistic: 0.7647 commentator: 0.7542 pot: 0.7325 advanced: 0.7275 Words similar to 'king - learned embeddings': bishop: 0.8909 john: 0.8590 res: 0.8484 edward: 0.8416 henry: 0.8413 --- Words similar to 'france - initial embeddings': estate: 0.8324 optional: 0.7671 carried: 0.7653 ufos: 0.7577 serious: 0.7473 Words similar to 'france - learned embeddings': serbia: 0.8250 baptiste: 0.8215 clifford: 0.8102 de: 0.8012 italy: 0.7880 --- Words similar to 'computer - initial embeddings': syrian: 0.8098 discusses: 0.8062 venues: 0.7890 distant: 0.7887 tiger: 0.7569 Words similar to 'computer - learned embeddings': internet: 0.8837 eleftherios: 0.8740 pages: 0.8211 convention: 0.8198 code: 0.8186 --- Words similar to 'dog - initial embeddings': frogs: 0.7953 impairments: 0.7868 dealing: 0.7792 hitherto: 0.7478 feed: 0.7392 Words similar to 'dog - learned embeddings': prairie: 0.8507 got: 0.8158 rapper: 0.8102 graduating: 0.7931 johnson: 0.7864 ---
4.3 ๐ Training the Skip-gram Modelยถ
Weโve now trained our Skip-gram model using negative sampling, which reframed word prediction as a series of binary classification problems.
๐ Training Insightsยถ
Training ran for 20 full epochs, processing over 6,000 batches per epoch.
The modelโs loss decreased steadily from 0.6932
to around 0.2374
, which shows consistent learning over time.
Even though the loss might seem small and the "accuracy" isnโt tracked here, thatโs expected โ this task isnโt about exact predictions, but about learning a good embedding space.
Why? Because context can be fuzzy โ there are usually many valid context words for any target. What matters is whether similar words end up with similar vectors.
๐ง What the Model Learnsยถ
Through backpropagation, the model shifts embeddings so that:
- Words that appear in similar contexts are pulled closer together
- Irrelevant or unrelated words are pushed apart
This way, even without explicit semantic labels, the model organizes the vocabulary in a way that reflects meaning.
4.4 ๐ฆ Extracting and Interpreting Word Embeddingsยถ
After training, we extract the learned word vectors from the modelโs Embedding
layer.
Each word is now represented as a 16-dimensional vector (youโll typically see 100โ300 in real applications, but weโve chosen 16 for faster training).
๐ What Do These Vectors Mean?ยถ
- Each wordโs vector encodes semantic and syntactic relationships
- Similar vectors โ similar meanings, measured via cosine similarity
- These vectors also support semantic arithmetic like:
king - man + woman โ queen
๐ Before vs. After Trainingยถ
Letโs look at a few examples of cosine similarity before and after training:
๐ฐ "king"
ยถ
- Before: semantic, atheistic, commentator
- After: bishop, john, edward, henry
๐ "france"
ยถ
- Before: estate, ufos, optional
- After: serbia, italy, baptiste
๐ป "computer"
ยถ
- Before: syrian, venues, tiger
- After: internet, code, pages
๐ถ "dog"
ยถ
- Before: frogs, impairments, feed
- After: prairie, rapper, johnson
As you can see, the post-training neighbors are much more semantically relevant โ indicating that our model has learned useful representations!
These embeddings are now ready for visualization, clustering, or use in downstream NLP tasks.
Up next: letโs evaluate them further with intrinsic tasks and maybe even visualize the embedding space!
5. ๐งบ CBOW Model Implementationยถ
5.1 ๐ง The CBOW Architecture (with Negative Sampling)ยถ
The Continuous Bag of Words (CBOW) model takes a different approach from Skip-gram:
It predicts a target word from its surrounding context.
Just like with Skip-gram, weโll enhance the CBOW model with negative sampling to make it efficient and scalable.
๐ How CBOW Works (Recap)ยถ
- For each word in the corpus, we:
- Take the surrounding context words as input
- Use them to predict the center word (the target)
- Instead of treating the context individually, we average their embeddings into a single vector
๐งฑ Model Structureยถ
Weโve implemented CBOW with the following architecture:
word_embedding
: Learns embeddings for context wordstarget_embedding
: Learns embeddings for target words- For each example:
- Compute the average of all context embeddings
- Dot it with the target word embedding (positive sample)
- Dot it with several negative samples (noise words)
These dot products are passed into a custom loss to distinguish positive vs. negative examples.
๐ฅ Input Formatยถ
Each training instance includes:
- Context word indices (multiple, e.g., 4 words)
- Target word index (center word)
- Negative sample indices (randomly chosen non-target words)
๐งช Loss Functionยถ
We use the same custom binary cross-entropy loss as in Skip-gram:
- The first dot product is treated as a positive example (
label = 1
) - The remaining dot products are negatives (
label = 0
) - The loss encourages:
- High similarity with the correct target
- Low similarity with random words
๐งฐ Preparing Training Dataยถ
To build the training set, we slide a window across the corpus:
contexts, targets, negatives = prepare_cbow_training_data(
filtered_words, word2idx, context_size=2, num_ns=4
)
Each row of contexts
contains 4 word indices (2 before, 2 after), and for each pair, we generate 4 negative samples.
We package it all into a tf.data.Dataset
pipeline for efficient batching and shuffling.
๐ Embedding Dimensionยถ
As before, weโre using an embedding size of 16 for quick training.
Larger dimensions (like 100โ300) are often used in production.
โ Model Compilationยถ
We compile the model using:
- Adam optimizer for stable and adaptive learning
- Our custom CBOW negative sampling loss
Once compiled, weโre ready to train!
class CBOWNegativeSampling(keras.Model):
def __init__(self, vocab_size, embedding_dim, context_size=4, num_ns=4):
super(CBOWNegativeSampling, self).__init__()
self.word_embedding = layers.Embedding(
vocab_size,
embedding_dim,
name="cbow_embedding",
)
self.target_embedding = layers.Embedding(
vocab_size,
embedding_dim,
name="target_embedding"
)
self.context_size = context_size
self.num_ns = num_ns
def call(self, inputs):
context_inputs, target_inputs, negative_inputs = inputs
# Get embeddings
# Shape: (batch_size, context_size, embedding_dim)
context_embedding = self.word_embedding(context_inputs)
# Shape: (batch_size, 1, embedding_dim)
target_embedding = self.target_embedding(target_inputs)
# Shape: (batch_size, num_ns, embedding_dim)
negative_embedding = self.target_embedding(negative_inputs)
# Average the context embeddings
# Shape: (batch_size, embedding_dim)
context_embedding_avg = tf.reduce_mean(context_embedding, axis=1)
# Remove the extra dimension from target
# Shape: (batch_size, embedding_dim)
target_embedding_flat = tf.squeeze(target_embedding, axis=1)
# Compute positive dot product (context_avg ยท target)
# Shape: (batch_size, 1)
positive_dots = tf.reduce_sum(
tf.multiply(context_embedding_avg, target_embedding_flat),
axis=1, keepdims=True
)
# Reshape context_avg for negative samples
# Shape: (batch_size, 1, embedding_dim)
context_for_negatives = tf.expand_dims(context_embedding_avg, axis=1)
# Compute negative dot products (context_avg ยท negative samples)
# Broadcasting handles the shapes
# Result shape: (batch_size, num_ns)
negative_dots = tf.reduce_sum(
tf.multiply(context_for_negatives, negative_embedding),
axis=2
)
# Combine positive and negative dots
# Shape: (batch_size, 1 + num_ns)
dots = tf.concat([positive_dots, negative_dots], axis=1)
return dots
# Custom loss function (same as Word2Vec)
def cbow_negative_sampling_loss(y_true, y_pred):
"""
CBOW negative sampling loss.
y_pred contains dot products for positive (first column) and negative samples (remaining columns)
"""
# Create labels: 1 for positive sample, 0 for negative samples
batch_size = tf.shape(y_pred)[0]
# Shape: (batch_size, 1 + num_ns)
labels = tf.concat([
tf.ones((batch_size, 1), dtype=tf.float32),
tf.zeros((batch_size, tf.shape(y_pred)[1] - 1), dtype=tf.float32)
], axis=1)
# Apply sigmoid cross entropy
return tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=y_pred)
)
# Function to prepare CBOW training data
def prepare_cbow_training_data(corpus, word2idx, context_size=2, num_ns=4, seed=42):
"""Generate training data for CBOW with negative sampling"""
np.random.seed(seed)
# Create context-target pairs
contexts = []
targets = []
# Loop through each word in the corpus
for i in tqdm(range(context_size, len(corpus) - context_size), desc="Generating CBOW pairs"):
if corpus[i] not in word2idx:
continue
# Get target word
target = word2idx[corpus[i]]
# Get context words (context_size words on each side)
context = []
for j in range(i - context_size, i + context_size + 1):
if j != i and corpus[j] in word2idx: # Skip target itself
context.append(word2idx[corpus[j]])
# Only use examples with full context
if len(context) == context_size * 2: # Full context window
contexts.append(context)
targets.append(target)
# Convert to arrays
contexts = np.array(contexts, dtype=np.int32)
targets = np.array(targets, dtype=np.int32)[:, np.newaxis]
# Generate negative samples
negatives = []
vocab_size = len(word2idx)
for target in targets:
negative_samples = []
while len(negative_samples) < num_ns:
neg = np.random.randint(0, vocab_size)
if neg != target[0]: # Don't include the target word
negative_samples.append(neg)
negatives.append(negative_samples)
negatives = np.array(negatives, dtype=np.int32)
return contexts, targets, negatives
# Generate CBOW training data
context_size = 2 # 2 words on each side = 4 total context words
num_ns = 4 # Number of negative samples per positive example
contexts, targets, negatives = prepare_cbow_training_data(
filtered_words, word2idx, context_size=context_size, num_ns=num_ns
)
print(f"Prepared CBOW training data:")
print(f" Contexts shape: {contexts.shape} (each row contains {context_size*2} context words)")
print(f" Targets shape: {targets.shape}")
print(f" Negatives shape: {negatives.shape}")
# Create TensorFlow dataset
dataset = tf.data.Dataset.from_tensor_slices(
((contexts, targets, negatives), np.zeros(len(targets)))
)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(512)
dataset = dataset.prefetch(tf.data.AUTOTUNE)
# Create and compile model
embedding_dim = 16
model = CBOWNegativeSampling(
vocab_size, embedding_dim, context_size=context_size*2, num_ns=num_ns
)
# Compile with loss function
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.01),
loss=cbow_negative_sampling_loss
)
# Print model summary
model.summary()
Generating CBOW pairs: 0%| | 0/791685 [00:00<?, ?it/s]
Prepared CBOW training data: Contexts shape: (791685, 4) (each row contains 4 context words) Targets shape: (791685, 1) Negatives shape: (791685, 4)
Model: "cbow_negative_sampling"
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโ โ Layer (type) โ Output Shape โ Param # โ โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ โ cbow_embedding (Embedding) โ ? โ 0 (unbuilt) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโค โ target_embedding (Embedding) โ ? โ 0 (unbuilt) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโ
Total params: 0 (0.00 B)
Trainable params: 0 (0.00 B)
Non-trainable params: 0 (0.00 B)
# Add callbacks
early_stopping = keras.callbacks.EarlyStopping(
monitor='loss', patience=5, restore_best_weights=True
)
# Train the model
print("\nTraining CBOW model with negative sampling...")
history = model.fit(
dataset,
epochs=1,
steps_per_epoch=1,
verbose=1,
callbacks=[early_stopping]
)
cbow_initial_embeddings = model.word_embedding.get_weights()[0]
history = model.fit(
dataset,
epochs=20,
verbose=1,
callbacks=[early_stopping]
)
cbow_learned_embeddings = model.word_embedding.get_weights()[0]
# Plot training history
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'])
plt.title('CBOW with Negative Sampling - Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True)
plt.show()
Training CBOW model with negative sampling... 1/1 โโโโโโโโโโโโโโโโโโโโ 0s 390ms/step - loss: 0.6931 Epoch 1/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 4s 2ms/step - loss: 0.3169 Epoch 2/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 3s 2ms/step - loss: 0.2334 Epoch 3/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 3s 2ms/step - loss: 0.2063 Epoch 4/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 4s 2ms/step - loss: 0.1885 Epoch 5/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 3s 2ms/step - loss: 0.1769 Epoch 6/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 3s 2ms/step - loss: 0.1692 Epoch 7/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 3s 2ms/step - loss: 0.1634 Epoch 8/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 3s 2ms/step - loss: 0.1589 Epoch 9/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 3s 2ms/step - loss: 0.1554 Epoch 10/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 3s 2ms/step - loss: 0.1524 Epoch 11/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 4s 2ms/step - loss: 0.1499 Epoch 12/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 4s 2ms/step - loss: 0.1477 Epoch 13/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 4s 2ms/step - loss: 0.1457 Epoch 14/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 3s 2ms/step - loss: 0.1441 Epoch 15/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 3s 2ms/step - loss: 0.1425 Epoch 16/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 3s 2ms/step - loss: 0.1412 Epoch 17/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 4s 2ms/step - loss: 0.1400 Epoch 18/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 6s 4ms/step - loss: 0.1387 Epoch 19/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 4s 2ms/step - loss: 0.1376 Epoch 20/20 1547/1547 โโโโโโโโโโโโโโโโโโโโ 4s 2ms/step - loss: 0.1368
# Find similar words using trained embeddings
def find_similar_words(word, embeddings, word2idx, idx2word, top_n=5):
"""Find most similar words based on cosine similarity"""
if word not in word2idx:
return []
word_idx = word2idx[word]
word_vec = embeddings[word_idx]
# Normalize vectors for cosine similarity
norm_word = np.linalg.norm(word_vec)
if norm_word > 0:
word_vec = word_vec / norm_word
# Normalize all embeddings
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
norms = np.maximum(norms, 1e-10) # Avoid division by zero
normalized_embeddings = embeddings / norms
# Calculate similarities
similarities = np.dot(normalized_embeddings, word_vec)
# Get top n similar words
most_similar = []
for idx in similarities.argsort()[-top_n-1:][::-1]:
if idx != word_idx:
most_similar.append((idx2word[idx], similarities[idx]))
return most_similar
# Test some example words
test_words = ["king", "france", "computer", "dog"]
for word in test_words:
if word in word2idx:
similar_words = find_similar_words(word, cbow_initial_embeddings, word2idx, idx2word)
print(f"\nWords similar to '{word} - initial embeddings':")
for similar_word, similarity in similar_words:
print(f" {similar_word}: {similarity:.4f}")
similar_words = find_similar_words(word, cbow_learned_embeddings, word2idx, idx2word)
print(f"\nWords similar to '{word} - learned embeddings':")
for similar_word, similarity in similar_words:
print(f" {similar_word}: {similarity:.4f}")
print('---')
Words similar to 'king - initial embeddings': allosaurus: 0.8045 tendencies: 0.7713 eighty: 0.7701 drew: 0.7664 asteraceae: 0.7544 Words similar to 'king - learned embeddings': marched: 0.8228 father: 0.8155 benedict: 0.8062 son: 0.8026 oliver: 0.7886 --- Words similar to 'france - initial embeddings': kandahar: 0.7864 candidate: 0.7847 logically: 0.7753 educated: 0.7746 revisionism: 0.7686 Words similar to 'france - learned embeddings': merchants: 0.8654 youngest: 0.8219 salisbury: 0.7975 sailed: 0.7945 mauritania: 0.7905 --- Words similar to 'computer - initial embeddings': imports: 0.8938 supplying: 0.8409 transatlantic: 0.8304 moved: 0.8095 followed: 0.7946 Words similar to 'computer - learned embeddings': anime: 0.8789 inventors: 0.8317 ary: 0.8098 animation: 0.7912 portable: 0.7875 --- Words similar to 'dog - initial embeddings': critic: 0.8114 lift: 0.7804 near: 0.7781 postulated: 0.7679 immortal: 0.7678 Words similar to 'dog - learned embeddings': mario: 0.8316 lisa: 0.8145 dead: 0.7892 germans: 0.7813 mysteries: 0.7751 ---
5.2 ๐ Training the CBOW Modelยถ
Weโve now trained our CBOW model using negative sampling, where the model learns to predict the center word from its surrounding context.
โก Training Insightsยถ
CBOW trains significantly faster than Skip-gram โ in our case, nearly 3ร faster:
- Just 1,547 batches per epoch, compared to over 6,000 in Skip-gram
- Training completed in under 80 seconds for all 20 epochs
The loss decreased from ~0.6931 (random guess) to ~0.1368, showing smooth and effective learning across the board.
โ CBOW is particularly efficient when training on large corpora and works well with frequent words.
๐ง What the Model Learnsยถ
Here, the model learns to average the embeddings of surrounding words and use that signal to adjust the target word's vector:
- If a group of context words tends to co-occur with a target, the target gets pulled closer to them
- Noise samples get pushed further away, improving discrimination
This results in embeddings that capture shared semantics across similar contexts.
5.3 ๐ฆ Extracting and Interpreting Word Embeddingsยถ
As with Skip-gram, we extract the final vectors from the modelโs learned embeddings.
Each word is now represented as a 16-dimensional vector, trained to reflect the context in which that word appears.
๐ What Do These Vectors Represent?ยถ
- Encoded semantic relationships derived from context
- Similar meanings = similar vectors
- Supports analogy-style reasoning (e.g.,
paris - france + italy โ rome
)
๐ Before vs. After Trainingยถ
Hereโs a look at how well the CBOW model captured meaningful associations โ measured via cosine similarity:
๐ฐ "king"
ยถ
- Before: allosaurus, tendencies, eighty
- After: marched, father, benedict, son
๐ "france"
ยถ
- Before: kandahar, candidate, revisionism
- After: merchants, salisbury, mauritania
๐ป "computer"
ยถ
- Before: imports, supplying, transatlantic
- After: anime, inventors, animation, portable
๐ถ "dog"
ยถ
- Before: critic, lift, postulated
- After: mario, lisa, dead, mysteries
You can clearly see how the model reshapes the embedding space โ similar words move closer, and noise disappears from the top neighbors. But it looks like the model does not capture the semantic relationships as well as Skip-gram.
These embeddings are now ready to be explored further via visualization, clustering, or downstream NLP tasks.
Letโs now compare Skip-gram and CBOW embeddings more directly!
6. ๐งช Intrinsic Evaluation of Word Embeddingsยถ
6.1 ๐ค What Is Intrinsic Evaluation?ยถ
Now that we've trained our Skip-gram and CBOW models, it's time to inspect what theyโve actually learned.
Intrinsic evaluation focuses on testing the embeddings themselves, rather than plugging them into downstream tasks like classification or translation.
๐ What Are We Evaluating?ยถ
Weโll look at whether our embeddings capture:
- Word Similarity โ Do semantically related words have similar vectors?
- Word Analogies โ Can the model solve analogies like
"king - man + woman โ queen"
? - Semantic Clusters โ Do related words group together in embedding space?
- Visual Patterns โ Do we see interpretable structures when reducing to 2D?
These tests give us direct insight into how well our model has captured the structure of language.
๐ก Why This Mattersยถ
Intrinsic evaluations help us understand:
- Whether embeddings reflect real semantic meaning
- How different architectures (Skip-gram vs. CBOW) perform
- What types of relationships the model is good at learning
- How much the model improved over its random initialization
Itโs like opening up the modelโs brain and asking:
โHow well do you understand the relationships between words?โ
Letโs now implement functions for each of these evaluation methods and compare the results across models.
def find_similar_words(word, embedding_matrix, word2idx, idx2word, top_n=10):
"""
Find the most similar words to a given word using cosine similarity.
Args:
word: Query word
embedding_matrix: Matrix of word embeddings
word2idx: Word to index mapping
idx2word: Index to word mapping
top_n: Number of similar words to return
Returns:
List of (word, similarity) tuples
"""
if word not in word2idx:
return []
word_idx = word2idx[word]
word_embedding = embedding_matrix[word_idx]
# Calculate cosine similarity
word_embedding = word_embedding / np.linalg.norm(word_embedding) # Normalize
similarities = np.dot(embedding_matrix, word_embedding) / (
np.linalg.norm(embedding_matrix, axis=1) + 1e-8) # Avoid division by zero
# Get top similar words (excluding the query word)
similar_indices = similarities.argsort()[-top_n-1:-1][::-1]
similar_words = [(idx2word[idx], similarities[idx]) for idx in similar_indices]
return similar_words
def evaluate_word_similarity(test_words, embedding_matrices, names, word2idx, idx2word):
"""
Evaluate and compare word similarities across different embedding matrices.
Args:
test_words: List of words to test
embedding_matrices: List of embedding matrices to compare
names: Names for each embedding matrix
word2idx, idx2word: Mappings between words and indices
"""
results = []
for word in test_words:
if word not in word2idx:
print(f"'{word}' not in vocabulary, skipping")
continue
word_results = {'word': word}
for embedding_matrix, name in zip(embedding_matrices, names):
similar_words = find_similar_words(word, embedding_matrix, word2idx, idx2word)
word_results[name] = ', '.join([f"{w} ({s:.2f})" for w, s in similar_words[:5]])
results.append(word_results)
return pd.DataFrame(results)
def create_similarity_html(similarity_results):
html = "<table style='width:100%; border-collapse:collapse;'>"
# Header row
html += "<tr style='background-color:#f2f2f2;'>"
for col in similarity_results.columns:
html += f"<th style='padding:8px; text-align:center; border:1px solid #ddd;'>{col}</th>"
html += "</tr>"
# Data rows
for _, row in similarity_results.iterrows():
html += "<tr>"
for i, col in enumerate(similarity_results.columns):
value = row[col]
if i > 0: # For similarity columns
items = value.split(', ')
value = "<br>".join(items)
html += f"<td style='padding:8px; border:1px solid #ddd; vertical-align:top;'>{value}</td>"
html += "</tr>"
html += "</table>"
return html
# Test words for similarity
test_words = [
# Common nouns
"king", "woman", "man", "queen", "child",
# Countries and cities
"france", "paris", "germany", "berlin", "london",
# Verbs
"walk", "run", "eat", "think", "play",
# Adjectives
"good", "bad", "beautiful", "strong", "happy",
# Time-related
"today", "yesterday", "year", "month", "week"
]
# Compare initial vs trained embeddings
similarity_results = evaluate_word_similarity(
test_words,
[skipgram_initial_embeddings, skipgram_learned_embeddings, cbow_initial_embeddings, cbow_learned_embeddings],
["Initial Random", "Skip-gram Trained", "CBOW Initial", "CBOW Trained"],
word2idx, idx2word
)
# Display the results
print("Word Similarity Comparison:")
# Display the custom HTML table
from IPython.display import HTML
display(HTML(create_similarity_html(similarity_results)))
{col} | " html += "
---|
{value} | " html += "
'yesterday' not in vocabulary, skipping Word Similarity Comparison:
word | Initial Random | Skip-gram Trained | CBOW Initial | CBOW Trained |
---|---|---|---|---|
king | semantic (0.78) atheistic (0.76) commentator (0.75) pot (0.73) advanced (0.73) | bishop (0.89) john (0.86) res (0.85) edward (0.84) henry (0.84) | allosaurus (0.80) tendencies (0.77) eighty (0.77) drew (0.77) asteraceae (0.75) | marched (0.82) father (0.82) benedict (0.81) son (0.80) oliver (0.79) |
woman | hermes (0.89) shores (0.81) halogens (0.77) positively (0.77) antecedent (0.75) | surprised (0.82) love (0.82) eros (0.80) blocked (0.80) her (0.80) | qur (0.81) issues (0.80) voltage (0.76) sickness (0.76) rent (0.74) | ptolemy (0.84) eduard (0.80) love (0.80) me (0.79) eloquence (0.78) |
man | performing (0.84) navy (0.80) increase (0.78) mass (0.76) word (0.76) | hatred (0.84) coming (0.82) contribution (0.81) seriously (0.79) ill (0.79) | kj (0.79) posted (0.78) zur (0.76) brilliant (0.75) acknowledge (0.74) | get (0.89) battles (0.83) strict (0.82) walks (0.79) teacher (0.79) |
queen | grew (0.83) allegedly (0.82) mnd (0.82) subjects (0.79) outer (0.77) | captain (0.82) nicholas (0.82) camp (0.81) paris (0.80) young (0.79) | nontheism (0.76) colonial (0.76) limits (0.76) spaced (0.74) decimal (0.73) | ulysses (0.88) william (0.84) oliver (0.82) joseph (0.82) pierce (0.82) |
child | sd (0.78) uzala (0.74) surrendered (0.74) whole (0.74) discovery (0.73) | herself (0.85) taxes (0.82) torn (0.82) messages (0.80) warned (0.78) | real (0.83) school (0.82) sent (0.76) temporary (0.75) choir (0.74) | eschatological (0.83) invented (0.82) olympias (0.82) objections (0.81) ernst (0.79) |
france | estate (0.83) optional (0.77) carried (0.77) ufos (0.76) serious (0.75) | serbia (0.83) baptiste (0.82) clifford (0.81) de (0.80) italy (0.79) | kandahar (0.79) candidate (0.78) logically (0.78) educated (0.77) revisionism (0.77) | merchants (0.87) youngest (0.82) salisbury (0.80) sailed (0.79) mauritania (0.79) |
paris | handling (0.83) languages (0.77) share (0.76) vertigo (0.76) program (0.75) | illinois (0.85) founded (0.84) benedict (0.83) julius (0.83) rhode (0.83) | gravitation (0.84) defeats (0.81) collectively (0.79) to (0.79) defines (0.77) | illinois (0.87) sociologist (0.85) statue (0.82) century (0.82) chicago (0.82) |
germany | sclerosis (0.80) synthesized (0.76) tree (0.76) pearl (0.75) readily (0.74) | harbor (0.82) occupied (0.81) tent (0.80) local (0.79) navajo (0.79) | athenians (0.79) getting (0.79) hydraulic (0.78) bourgeois (0.77) ambition (0.77) | louisiana (0.83) virginia (0.83) border (0.82) crashes (0.82) marched (0.82) |
berlin | unpopular (0.80) memory (0.77) phylogenetic (0.76) certainly (0.73) procedure (0.73) | weimar (0.87) publication (0.81) das (0.79) norway (0.78) liberation (0.78) | smithsonian (0.78) launched (0.76) subdivision (0.74) councils (0.74) alternatives (0.74) | corinth (0.81) commemoration (0.81) rating (0.81) administration (0.81) tribune (0.80) |
london | formulas (0.83) tactic (0.76) coincide (0.75) derleth (0.75) remained (0.75) | august (0.90) shortly (0.86) chapter (0.83) ad (0.82) lightfoot (0.81) | brakes (0.82) ken (0.79) aerospace (0.77) meditation (0.77) executed (0.76) | thomson (0.89) maine (0.86) slams (0.82) gettysburg (0.81) tony (0.81) |
walk | quarter (0.81) moral (0.77) animator (0.77) create (0.75) couples (0.75) | probable (0.84) wishing (0.81) servants (0.80) men (0.79) temporal (0.79) | simpsons (0.80) sink (0.78) supreme (0.77) bligh (0.75) europeans (0.74) | abbey (0.91) federation (0.82) intellectuals (0.82) chaos (0.82) wiesenthal (0.81) |
run | author (0.92) politically (0.87) richardson (0.83) logan (0.81) sparsely (0.78) | greatly (0.91) summit (0.89) low (0.81) stores (0.79) aisles (0.78) | minds (0.78) winnie (0.75) motions (0.75) faire (0.74) metro (0.74) | goto (0.82) camera (0.81) car (0.81) game (0.80) mtv (0.80) |
eat | similarly (0.84) uruguay (0.82) musician (0.80) proudhon (0.79) youthful (0.75) | arise (0.84) coke (0.79) pleas (0.78) mind (0.77) mel (0.77) | rectus (0.79) lieutenant (0.76) exchanged (0.76) abbreviated (0.75) asceticism (0.75) | distilled (0.83) leaf (0.81) progresses (0.81) disbelief (0.80) spiritual (0.79) |
think | arches (0.83) joint (0.76) compilation (0.75) going (0.74) academy (0.73) | anyone (0.87) ghetto (0.85) shocked (0.82) prove (0.81) dispute (0.81) | definitive (0.84) leaving (0.81) opposite (0.81) survives (0.77) narrative (0.76) | parade (0.87) nothing (0.83) expression (0.82) stirner (0.82) lorenzo (0.80) |
play | orange (0.82) crowd (0.78) problematic (0.76) sentiments (0.76) negative (0.75) | hour (0.85) drivers (0.84) event (0.83) meant (0.80) shift (0.79) | flying (0.83) radio (0.80) addressed (0.78) faa (0.78) hydride (0.77) | trucks (0.89) singles (0.88) match (0.86) climb (0.79) step (0.78) |
good | collections (0.77) ecology (0.77) coloured (0.77) cartoons (0.75) pdas (0.74) | effectively (0.88) without (0.86) act (0.83) use (0.81) termed (0.80) | recommend (0.77) joachim (0.76) niger (0.75) emigration (0.74) wrongly (0.73) | measure (0.77) eros (0.77) want (0.77) trouble (0.76) recognise (0.76) |
bad | mathbf (0.81) circa (0.78) urgell (0.77) blocking (0.76) bruno (0.75) | aware (0.88) architectures (0.83) faithful (0.83) consequences (0.82) farmer (0.82) | pigs (0.85) calculations (0.82) akm (0.80) papyrus (0.76) visiting (0.76) | dvd (0.81) voltage (0.79) animator (0.78) keen (0.77) desire (0.75) |
beautiful | returned (0.79) thousand (0.78) photon (0.74) scientists (0.73) beta (0.73) | devon (0.85) fine (0.85) turned (0.85) beneath (0.83) glands (0.82) | anatoly (0.81) charleston (0.78) shores (0.75) consisting (0.73) sketch (0.73) | stretching (0.83) unnecessary (0.81) resolve (0.78) crushed (0.77) active (0.77) |
strong | threatening (0.85) therein (0.80) virulent (0.79) investors (0.76) exceptionally (0.76) | case (0.86) nuclear (0.86) fisherman (0.83) power (0.83) seabed (0.82) | naval (0.84) princeps (0.82) object (0.78) dramatic (0.78) cow (0.77) | binding (0.87) agreements (0.86) aikidoka (0.83) hearing (0.82) autistic (0.80) |
happy | mouch (0.79) derived (0.78) eumc (0.77) album (0.76) euro (0.76) | burnside (0.87) missouri (0.83) hitler (0.82) permitted (0.82) colorado (0.82) | privileges (0.85) bi (0.81) satisfies (0.79) balloon (0.79) historians (0.77) | sanctioned (0.80) resumed (0.80) zurich (0.79) rape (0.79) myself (0.77) |
today | marx (0.76) reverence (0.76) invisible (0.76) comics (0.74) commentary (0.72) | some (0.85) well (0.85) benzene (0.85) ie (0.84) thought (0.84) | recorded (0.88) timothy (0.83) maj (0.79) appreciate (0.76) automorphism (0.76) | sahrawi (0.82) loosely (0.79) austroasiatic (0.78) valid (0.78) inconclusive (0.75) |
year | building (0.80) great (0.79) electromagnetism (0.77) typography (0.74) director (0.74) | next (0.87) round (0.84) completed (0.83) per (0.82) september (0.82) | ladder (0.85) hypothetical (0.82) intuitive (0.81) revival (0.77) akshara (0.77) | days (0.84) propagation (0.80) mild (0.80) acquiring (0.80) minutes (0.79) |
month | motive (0.78) tells (0.77) abortion (0.72) expressions (0.72) fullback (0.71) | breach (0.86) hall (0.85) exceeded (0.83) wavelength (0.82) mining (0.80) | creating (0.87) gide (0.86) routledge (0.83) formerly (0.77) iii (0.76) | anatoly (0.83) brings (0.79) banquet (0.78) mali (0.76) hits (0.74) |
week | overall (0.83) beating (0.82) lucius (0.78) rare (0.76) waiting (0.75) | successes (0.80) celebrating (0.78) month (0.76) flood (0.76) headquartered (0.76) | convince (0.80) brooke (0.77) famous (0.76) invasions (0.76) pamphlet (0.76) | assessments (0.86) passion (0.84) fascination (0.84) marries (0.81) masters (0.80) |
6.2 ๐ Word Similarity Evaluationยถ
The table above compares the top 5 most similar words (based on cosine similarity) for each of our test words across four embedding sets:
- Initial Random Embeddings: Before any training (baseline)
- Skip-gram Trained Embeddings
- CBOW Initial Embeddings
- CBOW Trained Embeddings
๐ง Key Takeawaysยถ
1. Initial vs. Trained Embeddingsยถ
- The initial embeddings (both for Skip-gram and CBOW) show no meaningful semantic grouping โ top matches are random or nonsensical (e.g.,
"woman" โ halogens
,"queen" โ mnd
). - After training, both models clearly organize the space:
"king"
is now surrounded by names like"edward"
and"henry"
,"france"
by"serbia"
and"mauritania"
, and so on.
2. Skip-gram vs. CBOWยถ
- Skip-gram often captures sharper, more specific associations โ
"france" โ serbia (0.83) baptiste (0.82), italy (0.79)"
,"berlin" โ "weimar" (0.87), "das" (0.79), "norway" (0.78), "liberation" (0.78)"
. - CBOW finds more broadly related concepts, with smoother clusters and more general co-occurrence patterns.
- In some cases, CBOW even returns slightly more abstract matches (e.g.,
"year" โ days, minutes, propagation"
).
โ Overall Evaluationยถ
Both models have successfully transformed random vectors into semantically meaningful embeddings, even with low training data. We could imagine much better results with more training data and more epochs. The Skip-gram model leans toward precise relationships, while CBOW captures softer semantic themes.
This confirms that even with a small embedding dimension and moderate training time, our models learn to encode real-world language patterns, all by just observing word co-occurrence.
Up next: letโs push further with analogies and visualizations!
def word_analogy(word1, word2, word3, embedding_matrix, word2idx, idx2word, top_n=5):
"""
Solve word analogies like "king - man + woman = queen".
Args:
word1, word2, word3: Words in the analogy "word1 - word2 + word3 = ?"
embedding_matrix: Matrix of word embeddings
word2idx: Word to index mapping
idx2word: Index to word mapping
top_n: Number of results to return
Returns:
List of (word, similarity) tuples
"""
for word in [word1, word2, word3]:
if word not in word2idx:
print(f"Word '{word}' not in vocabulary")
return []
# Get word vectors
vec1 = embedding_matrix[word2idx[word1]]
vec2 = embedding_matrix[word2idx[word2]]
vec3 = embedding_matrix[word2idx[word3]]
# Calculate target vector: vec1 - vec2 + vec3
target_vec = vec1 - vec2 + vec3
target_vec = target_vec / np.linalg.norm(target_vec) # Normalize
# Calculate similarities
similarities = np.dot(embedding_matrix, target_vec) / (
np.linalg.norm(embedding_matrix, axis=1) + 1e-8)
# Get top results (excluding input words)
results = []
sorted_indices = similarities.argsort()[::-1]
for idx in sorted_indices:
word = idx2word[idx]
if word != word1 and word != word2 and word != word3:
results.append((word, similarities[idx]))
if len(results) >= top_n:
break
return results
def evaluate_word_analogies(analogies, embedding_matrices, names, word2idx, idx2word):
"""
Evaluate word analogies across different embedding matrices.
Args:
analogies: List of (word1, word2, word3, expected) tuples
embedding_matrices: List of embedding matrices to compare
names: Names for each embedding matrix
word2idx, idx2word: Mappings between words and indices
"""
results = []
for word1, word2, word3, expected in analogies:
analogy_str = f"{word1} - {word2} + {word3} = ?"
result = {'analogy': analogy_str, 'expected': expected}
for embedding_matrix, name in zip(embedding_matrices, names):
# Skip initial random embeddings for analogies (they won't be meaningful)
if 'Initial' in name:
result[name] = 'N/A'
continue
answers = word_analogy(word1, word2, word3, embedding_matrix, word2idx, idx2word)
if answers:
# Check if expected word is in top results
expected_in_top = any(word == expected for word, _ in answers)
top_answers = ', '.join([f"{w} ({s:.2f})" for w, s in answers[:3]])
result[name] = top_answers + (f" โ" if expected_in_top else "")
else:
result[name] = "Words not in vocabulary"
results.append(result)
return pd.DataFrame(results)
# Test analogies
analogies = [
# Capital cities
("france", "paris", "italy", "rome"),
("germany", "berlin", "england", "london"),
# Gender relationships
("king", "man", "woman", "queen"),
("man", "he", "she", "woman"),
# Verb tenses
("walking", "walk", "run", "running"),
("good", "better", "bad", "worse"),
# Semantic relationships
("small", "smaller", "big", "bigger"),
("eat", "food", "drink", "water"),
# Comparative and superlative
("good", "best", "bad", "worst"),
("big", "biggest", "small", "smallest")
]
# Evaluate analogies on trained embeddings
analogy_results = evaluate_word_analogies(
analogies,
[skipgram_learned_embeddings, cbow_learned_embeddings],
["Skip-gram", "CBOW"],
word2idx, idx2word
)
# Display the results
print("Word Analogy Evaluation:")
display(analogy_results)
Word Analogy Evaluation:
analogy | expected | Skip-gram | CBOW | |
---|---|---|---|---|
0 | france - paris + italy = ? | rome | emirates (0.80), expelled (0.77), farmers (0.76) | rescued (0.85), ruler (0.83), motto (0.83) |
1 | germany - berlin + england = ? | london | constantinople (0.81), ornaments (0.80), until... | tent (0.81), then (0.80), nile (0.79) |
2 | king - man + woman = ? | queen | henry (0.83), connecticut (0.82), signer (0.80) | rider (0.84), satrap (0.83), martyr (0.79) |
3 | man - he + she = ? | woman | aphrodite (0.86), loved (0.86), adonis (0.84) | get (0.82), sura (0.77), going (0.76) |
4 | walking - walk + run = ? | running | indicator (0.74), slopes (0.70), photos (0.69) | autobiographical (0.69), filmed (0.66), kalash... |
5 | good - better + bad = ? | worse | architectures (0.92), maia (0.86), addicted (0... | obsession (0.78), gibson (0.77), chair (0.77) |
6 | small - smaller + big = ? | bigger | fourth (0.79), couple (0.79), fly (0.78) | stewart (0.82), leopold (0.80), clarke (0.78) |
7 | eat - food + drink = ? | water | drinking (0.82), coke (0.80), lead (0.78) | gothic (0.89), blockers (0.83), composers (0.82) |
8 | good - best + bad = ? | worst | tired (0.85), admiration (0.84), entity (0.81) | cease (0.84), glorious (0.84), altruist (0.83) |
9 | big - biggest + small = ? | smallest | long (0.86), photon (0.80), aminoacids (0.79) | recounts (0.82), christie (0.78), granite (0.71) |
6.3 ๐ง Word Analogy Evaluationยถ
Word analogy tasks go beyond simple similarity โ they test whether embeddings capture relational structures in a way that allows consistent vector arithmetic.
The classic example is:
"king - man + woman = ?"
โ Ideally:"queen"
๐ What Are We Testing?ยถ
These evaluations check if:
- Semantic relationships are encoded as vector differences
- The same "offset" (e.g., gender, country-capital) applies across different word pairs
- The model can retrieve the correct word via nearest-neighbor search in the embedding space
๐ Key Observations from Our Resultsยถ
โ 1. Models Learn Some Relational Structureยถ
- Both Skip-gram and CBOW capture certain relationships, though not always the expected word appears at the top.
- Examples like
"eat - food + drink"
returning"drinking"
(Skip-gram) suggest partial relational understanding. - Analogies involving verb tenses or comparatives are harder, but sometimes the models retrieve related concepts (e.g.,
"good - best + bad"
โ"tired"
or"cease"
).
๐ 2. Skip-gram vs. CBOWยถ
- Skip-gram tends to perform better on analogies, especially for precise relationships like gender or verb forms.
- CBOW retrieves more abstract or loosely related terms, occasionally hitting the right direction but missing specificity.
- Neither model perfectly solves all analogies โ which is expected given the data size and low embedding dimensionality (16d).
โ ๏ธ 3. Limitations to Keep in Mindยถ
- The vocabulary is limited, so many ideal targets (e.g.,
"rome"
,"queen"
,"running"
) may be missing or underrepresented. - Embedding quality improves with larger corpora, longer training, and higher dimensions.
- Still, these results give us a powerful diagnostic tool to evaluate what the model is capturing.
๐ Conclusionยถ
While weโre not seeing state-of-the-art analogy solving here, this evaluation demonstrates that even small Word2Vec models can encode meaningful relationships.
โ
Some analogies are partially captured
โ
Embeddings reflect directional meaning shifts
โ
We now have a way to benchmark embedding quality intrinsically
As a next step, we can visualize the embedding space to get an intuitive sense of the structure that these analogies hint at.
def plot_semantic_clusters(words_by_category, embedding_matrix, word2idx, title):
"""
Plot words from different semantic categories in 2D space.
Args:
words_by_category: Dictionary mapping categories to lists of words
embedding_matrix: Word embedding matrix
word2idx: Word to index mapping
title: Plot title
"""
from sklearn.manifold import TSNE
import matplotlib.colors as mcolors
# Colors for different categories
colors = list(mcolors.TABLEAU_COLORS)
# Collect embeddings and labels for words that are in vocabulary
embeddings = []
labels = []
categories = []
for category, words in words_by_category.items():
for word in words:
if word in word2idx:
embeddings.append(embedding_matrix[word2idx[word]])
labels.append(word)
categories.append(category)
if not embeddings:
print("No words found in vocabulary")
return
# Convert to array
embeddings = np.array(embeddings)
# Use t-SNE to reduce to 2 dimensions
tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(embeddings)-1))
reduced_embeddings = tsne.fit_transform(embeddings)
# Plot
plt.figure(figsize=(12, 8))
category_handles = []
# Get unique categories and sort them
unique_categories = sorted(set(categories))
for i, category in enumerate(unique_categories):
color = colors[i % len(colors)]
indices = [j for j, cat in enumerate(categories) if cat == category]
scatter = plt.scatter(
reduced_embeddings[indices, 0],
reduced_embeddings[indices, 1],
label=category,
color=color,
alpha=0.7
)
category_handles.append(scatter)
# Add labels for each point
for j in indices:
plt.annotate(
labels[j],
(reduced_embeddings[j, 0], reduced_embeddings[j, 1]),
xytext=(5, 2),
textcoords='offset points',
fontsize=9
)
plt.title(title)
plt.legend(handles=category_handles, loc='best')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()
# Define semantic categories for testing
semantic_categories = {
'Countries': ['france', 'germany', 'italy', 'spain', 'england', 'russia', 'china', 'japan', 'canada', 'australia'],
'Cities': ['paris', 'berlin', 'rome', 'madrid', 'london', 'moscow', 'beijing', 'tokyo', 'toronto', 'sydney'],
'Numbers': ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten'],
'Family': ['father', 'mother', 'son', 'daughter', 'brother', 'sister', 'uncle', 'aunt', 'grandfather', 'grandmother'],
'Colors': ['red', 'blue', 'green', 'yellow', 'black', 'white', 'purple', 'orange', 'brown', 'pink'],
'Transportation': ['car', 'bus', 'train', 'plane', 'boat', 'bicycle', 'truck', 'motorcycle', 'ship', 'helicopter']
}
# Plot for initial vs trained embeddings
print("Initial Random Embeddings Clustering:")
plot_semantic_clusters(semantic_categories, skipgram_initial_embeddings, word2idx, "Initial Random Embeddings - Semantic Categories")
print("\nSkip-gram Trained Embeddings Clustering:")
plot_semantic_clusters(semantic_categories, skipgram_learned_embeddings, word2idx, "Skip-gram Trained Embeddings - Semantic Categories")
print("\nCBOW Trained Embeddings Clustering:")
plot_semantic_clusters(semantic_categories, cbow_learned_embeddings, word2idx, "CBOW Trained Embeddings - Semantic Categories")
Initial Random Embeddings Clustering:
Skip-gram Trained Embeddings Clustering:
CBOW Trained Embeddings Clustering:
6.4 ๐งญ Semantic Clustering Analysisยถ
The plots above visualize how word embeddings evolve before and after training, using t-SNE to project them into 2D space. Each word belongs to a predefined semantic category (e.g., cities, family, colors), and colors indicate groupings.
๐ What We're Looking For:ยถ
- Category Clustering โ Do words from the same group stay close together?
- Semantic Proximity โ Are related categories near each other in the space?
- Transformation Through Training โ Does the embedding space become more meaningful?
๐ชจ Initial Random Embeddingsยถ
- Words are randomly scattered, with no visible structure
- No evidence of clustering by meaning or category
- Semantic categories are completely blended and unorganized
๐ง Skip-gram Trained Embeddingsยถ
- Clear clustering by semantic category (e.g., countries, numbers, colors, transportation)
- Related categories (e.g., cities and countries) are positioned near each other
- Numbers form a tight group:
"one"
,"two"
,"three"
,"four"
,"five"
,"six"
,"seven"
,"eight"
,"nine"
,"ten"
- Shows sharper boundaries between categories, reflecting Skip-gramโs tendency for more precise local relationships
๐ CBOW Trained Embeddingsยถ
- Strong semantic grouping, though slightly more blended transitions between clusters
- Categories like colors, family, and transportation still stand out
- Cities and countries are often close โ but with softer separation than Skip-gram
- This reflects CBOWโs nature: it smooths across larger context windows, producing more globally coherent embeddings
โ Takeawayยถ
These plots visually confirm that:
- Training drastically transforms the embedding space from random noise to meaningful structure
- Both models learn semantics from context alone
- Proximity now encodes meaning, and category membership becomes visually detectable
This clustering is a strong sign that our Word2Vec models have successfully learned to represent semantic relationships in vector space.