Word2Vec Implementation from Scratch¶

📚 Table of Contents¶

Introduction
Data Preparation
Creating Training Data
Skip Gram with Negative Sampling
CBOW with Negative Sampling
Intrinsic Evaluation

Each section lives in its own notebook and links back here when needed. Let’s dive in!

1. Introduction to Word2Vec¶

1.1 🔤 What Are Word Embeddings?¶

Word embeddings are a way to represent words as dense vectors in a continuous space, where semantically similar words are close together.

Unlike one-hot encoding (where each word is just a binary blip in a huge sparse space), embeddings capture meaning and relationships between words.

Key features:

Dense vectors: Typically 50-300 dimensions (we'll use 16 for faster training)
Learned from data: Capture semantic relationships automatically
Contextual similarity: Words used in similar contexts have similar vectors
Algebraic properties: Support meaningful vector operations (e.g., king - man + woman ≈ queen)

1.2 🧠 Word2Vec Architectures¶

Word2Vec (Mikolov et al., 2013) introduced two neural models for learning these embeddings: Skip-gram and CBOW. Let's break them down:

🔁 Skip-gram¶

Skip-gram Architecture

Objective: Predict context words from the target word
Input: A single target word
Output: Probability distribution over possible context words
Training pairs: (target_word, context_word)
Strengths:
- Performs well with small training datasets
- Better at representing rare words
- Can capture multiple meanings per word

🧺 CBOW (Continuous Bag of Words)¶

CBOW Architecture

Objective: Predict target word from context words
Input: Multiple context words (averaged together)
Output: Probability distribution over possible target words
Training pairs: (context_words, target_word)
Strengths:
- Trains faster than Skip-gram
- Slightly better accuracy for frequent words
- Smoother word representations

1.3 🤔 Why Build Word2Vec Ourselves?¶

While pre-trained embeddings like GloVe or Word2Vec are readily available, building Word2Vec from scratch offers several benefits:

Understanding the mechanics: See exactly how embeddings arise from word co-occurrence patterns
Customization: Tailor the architecture and hyperparameters to specific needs
Domain adaptation: Train on domain-specific text for specialized embeddings
Educational value: Gain insights into neural network training and optimization

In the following notebooks, we'll implement both Skip-gram and CBOW architectures and evaluate their performance on various intrinsic tasks. We'll use a smaller embedding dimension (16 instead of the typical 300) to speed up training while still capturing meaningful semantic relationships.

2. 🧹 Preparing the Data¶

2.1 📝 Loading and Preprocessing Text¶

The quality of word embeddings depends heavily on the quality and quantity of training data. For this implementation, we're using the Text8 dataset, which contains cleaned Wikipedia text.

The preprocessing steps are crucial:

Tokenization: Split the text into words
Case normalization: Convert all text to lowercase (already done in Text8)
Special character handling: Remove or replace special characters
Handling rare words: Decide how to treat infrequent terms

Let's examine our dataset and perform the necessary preprocessing:

In [1]:

Copied!





import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import re
import random
from tqdm.notebook import tqdm

# Download and load a text corpus (smaller dataset for faster processing)
!wget -nc http://mattmahoney.net/dc/text8.zip
!unzip -o text8.zip

# Load the data
with open('text8', 'r') as f:
    text = f.read()[:5000000]

print(f"Total characters: {len(text)}")
print(f"First 500 characters: {text[:500]}")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import re
import random
from tqdm.notebook import tqdm

# Download and load a text corpus (smaller dataset for faster processing)
!wget -nc http://mattmahoney.net/dc/text8.zip
!unzip -o text8.zip

# Load the data
with open('text8', 'r') as f:
    text = f.read()[:5000000]

print(f"Total characters: {len(text)}")
print(f"First 500 characters: {text[:500]}")

Fichier « text8.zip » déjà présent ; pas de récupération.

Archive:  text8.zip
  inflating: text8                   
Total characters: 5000000
First 500 characters:  anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philoso

2.2 🧠 Building the Vocabulary¶

Natural language isn't fair — a few words get used a lot, and most words barely show up. This phenomenon is known as Zipf’s Law: word frequency follows a power-law distribution.

🔍 You can see this in the plot above — a small number of words (like "the", "and", "of") dominate, while the majority appear rarely.

This skewed distribution creates a few challenges:

Common words flood the training process but don’t add much semantic insight
Rare words don’t appear enough to learn meaningful embeddings
A huge vocabulary makes training slower and memory-heavy

🛠️ Our Strategy¶

To make training more efficient and meaningful, we’ll:

🚫 Remove very rare words — filter out those below a minimum frequency
🧹 (Optionally) Subsample very frequent words — reduce their impact during training
🔢 Build a fast lookup table — map each word to a unique index for efficient computation

With this cleaned-up vocabulary, we’re ready to turn raw text into training-ready data!

In [2]:

Copied!





# Tokenize the text
words = text.split()
print(f"Total words: {len(words)}")
print(f"First 20 words: {words[:20]}")

# Build vocabulary
word_counts = Counter(words)
print(f"Unique words: {len(word_counts)}")

# Plot word frequency distribution
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
word_freq = sorted(word_counts.values(), reverse=True)
plt.plot(word_freq[:100])
plt.title('Top 100 Word Frequencies')
plt.xlabel('Word Rank')
plt.ylabel('Frequency')
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(word_freq)
plt.title('All Word Frequencies (Log Scale)')
plt.xlabel('Word Rank')
plt.ylabel('Frequency')
plt.yscale('log')
plt.xscale('log')
plt.grid(True)
plt.tight_layout()
plt.show()
# Tokenize the text
words = text.split()
print(f"Total words: {len(words)}")
print(f"First 20 words: {words[:20]}")

# Build vocabulary
word_counts = Counter(words)
print(f"Unique words: {len(word_counts)}")

# Plot word frequency distribution
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
word_freq = sorted(word_counts.values(), reverse=True)
plt.plot(word_freq[:100])
plt.title('Top 100 Word Frequencies')
plt.xlabel('Word Rank')
plt.ylabel('Frequency')
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(word_freq)
plt.title('All Word Frequencies (Log Scale)')
plt.xlabel('Word Rank')
plt.ylabel('Frequency')
plt.yscale('log')
plt.xscale('log')
plt.grid(True)
plt.tight_layout()
plt.show()

Total words: 846987
First 20 words: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english']
Unique words: 47046

No description has been provided for this image

2.3 🔍 Filtering the Vocabulary¶

To keep things manageable, we’ve set a minimum frequency threshold of 5 — meaning any word that appears fewer than 5 times gets filtered out.

This has a big impact: our vocabulary size drops significantly, but we still keep good coverage of the dataset.

⚖️ Why This Matters¶

There’s a tradeoff here:

Higher threshold → Smaller vocab = faster training, but more "unknown" words
Lower threshold → Better coverage, but slower training and noisy embeddings for rare words

In real-world applications, thresholds like 20–100 are common. But for our learning-focused setup, 5 hits a nice sweet spot.

📊 Even with this filter, our vocabulary still covers 90%+ of the text — proof that a small set of words carries most of the meaning!

In [3]:

Copied!





# Filter vocabulary based on word frequency
min_count = 5  # Minimum word frequency
vocabulary = {word: count for word, count in word_counts.items() if count >= min_count}
print(f"Vocabulary size after filtering: {len(vocabulary)}")

# Analyze the coverage of our vocabulary
total_word_count = len(words)
covered_word_count = sum(word_counts[word] for word in vocabulary)
coverage_percentage = (covered_word_count / total_word_count) * 100

print(f"Original vocabulary size: {len(word_counts)}")
print(f"Filtered vocabulary size: {len(vocabulary)}")
print(f"Words retained: {covered_word_count:,} out of {total_word_count:,} ({coverage_percentage:.2f}%)")

# Plot the effect of different min_count thresholds
thresholds = [1, 2, 5, 10, 20, 50, 100]
vocab_sizes = []
coverage_percentages = []

for threshold in thresholds:
    filtered_vocab = {word: count for word, count in word_counts.items() if count >= threshold}
    vocab_sizes.append(len(filtered_vocab))
    covered = sum(word_counts[word] for word in filtered_vocab)
    coverage_percentages.append((covered / total_word_count) * 100)

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(thresholds, vocab_sizes, 'o-')
plt.title('Vocabulary Size vs Min Count')
plt.xlabel('Minimum Count Threshold')
plt.ylabel('Vocabulary Size')
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(thresholds, coverage_percentages, 'o-')
plt.title('Text Coverage vs Min Count')
plt.xlabel('Minimum Count Threshold')
plt.ylabel('Coverage (%)')
plt.grid(True)
plt.tight_layout()
plt.show()
# Filter vocabulary based on word frequency
min_count = 5  # Minimum word frequency
vocabulary = {word: count for word, count in word_counts.items() if count >= min_count}
print(f"Vocabulary size after filtering: {len(vocabulary)}")

# Analyze the coverage of our vocabulary
total_word_count = len(words)
covered_word_count = sum(word_counts[word] for word in vocabulary)
coverage_percentage = (covered_word_count / total_word_count) * 100

print(f"Original vocabulary size: {len(word_counts)}")
print(f"Filtered vocabulary size: {len(vocabulary)}")
print(f"Words retained: {covered_word_count:,} out of {total_word_count:,} ({coverage_percentage:.2f}%)")

# Plot the effect of different min_count thresholds
thresholds = [1, 2, 5, 10, 20, 50, 100]
vocab_sizes = []
coverage_percentages = []

for threshold in thresholds:
    filtered_vocab = {word: count for word, count in word_counts.items() if count >= threshold}
    vocab_sizes.append(len(filtered_vocab))
    covered = sum(word_counts[word] for word in filtered_vocab)
    coverage_percentages.append((covered / total_word_count) * 100)

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(thresholds, vocab_sizes, 'o-')
plt.title('Vocabulary Size vs Min Count')
plt.xlabel('Minimum Count Threshold')
plt.ylabel('Vocabulary Size')
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(thresholds, coverage_percentages, 'o-')
plt.title('Text Coverage vs Min Count')
plt.xlabel('Minimum Count Threshold')
plt.ylabel('Coverage (%)')
plt.grid(True)
plt.tight_layout()
plt.show()

Vocabulary size after filtering: 12515
Original vocabulary size: 47046
Filtered vocabulary size: 12515
Words retained: 791,689 out of 846,987 (93.47%)

2.4 🗂️ Creating Word Mappings¶

After filtering, we’ve reduced the vocabulary from 47,046 words down to 12,515 — that’s a ~73% reduction!

Yet we’ve still retained 93.47% of the total words in the text (791,689 out of 846,987). 🎯
This shows just how concentrated language usage is — most of the meaning comes from a relatively small set of words.

To work with our Word2Vec model, we now need two key mappings:

🔢 `word2idx`¶

Maps each word to a unique integer index
✅ Used to turn words into numbers for training
✅ Enables fast lookup during model computation

🔁 `idx2word`¶

Maps each index back to the original word
✅ Useful for inspecting predictions and visualizations
✅ Helps convert model outputs into readable text

Why this matters:

Vocabulary size determines the size of the model's output layer
It directly affects training speed, memory usage, and overall efficiency

We'll use these mappings to generate training pairs for both the Skip-gram and CBOW architectures.

In [4]:

Copied!





# Create word-to-index and index-to-word mappings
word2idx = {word: i for i, word in enumerate(vocabulary.keys())}
idx2word = {i: word for word, i in word2idx.items()}
vocab_size = len(word2idx)

# Filter the corpus to include only words in our vocabulary
filtered_words = [word for word in words if word in vocabulary]
print(f"Words retained after filtering: {len(filtered_words)} ({len(filtered_words)/len(words):.2%})")
# Create word-to-index and index-to-word mappings
word2idx = {word: i for i, word in enumerate(vocabulary.keys())}
idx2word = {i: word for word, i in word2idx.items()}
vocab_size = len(word2idx)

# Filter the corpus to include only words in our vocabulary
filtered_words = [word for word in words if word in vocabulary]
print(f"Words retained after filtering: {len(filtered_words)} ({len(filtered_words)/len(words):.2%})")

Words retained after filtering: 791689 (93.47%)

3. 🏗️ Creating Training Examples¶

3.1 🎯 Understanding the Context Window¶

Word2Vec learns meaning from context — the idea is simple:

Words that appear in similar surroundings tend to have similar meanings.

To capture this, we use a sliding context window to extract training pairs from our text.

🧰 How it works:¶

For each word in the corpus (the target word)
We look at a window of surrounding words (the context words)
The window size controls how many words before and after are included
For every target word, we pair it with each context word within that window

📏 Example (Window Size = 2)¶

Text: "the cat sat on the mat"
Target word: "sat"
Context window: ["the", "cat", "on", "the"]
Training pairs: ("sat", "the") ("sat", "cat") ("sat", "on") ("sat", "the")

We’ll generate these pairs for every word in our (filtered) corpus. This is the training data that powers both Skip-gram and CBOW models.

👀 Up next: we’ll implement the code that creates these training examples.

In [5]:

Copied!





# Demonstrate the context window
def visualize_context_window(text, window_size=2):
    words = text.split()
    examples = []
    
    for i in range(len(words)):
        target = words[i]
        # Calculate context range (window_size words on each side)
        context_start = max(0, i - window_size)
        context_end = min(len(words), i + window_size + 1)
        
        # Get context words (excluding the target)
        context = words[context_start:i] + words[i+1:context_end]
        examples.append((target, context))
    
    return examples

# Example text
sample_text = "the quick brown fox jumps over the lazy dog"
examples = visualize_context_window(sample_text, window_size=2)

# Create a visualization
plt.figure(figsize=(8, 10))
plt.title("Context Window Visualization (Window Size = 2)")
plt.axis('off')

for i, (target, context) in enumerate(examples):
    # Display target word
    plt.text(0.3, -i/len(examples), target, fontsize=10, 
             ha='center', bbox=dict(facecolor='lightblue', alpha=0.5, pad=5))
    
    # Display context words
    for j, word in enumerate(context):
        x_offset = (j * 0.2)
        plt.text(x_offset, -i/len(examples), word, fontsize=10, 
                 ha='center', bbox=dict(facecolor='lightyellow', alpha=0.5, pad=5))

plt.tight_layout()
plt.show()
# Demonstrate the context window
def visualize_context_window(text, window_size=2):
    words = text.split()
    examples = []
    
    for i in range(len(words)):
        target = words[i]
        # Calculate context range (window_size words on each side)
        context_start = max(0, i - window_size)
        context_end = min(len(words), i + window_size + 1)
        
        # Get context words (excluding the target)
        context = words[context_start:i] + words[i+1:context_end]
        examples.append((target, context))
    
    return examples

# Example text
sample_text = "the quick brown fox jumps over the lazy dog"
examples = visualize_context_window(sample_text, window_size=2)

# Create a visualization
plt.figure(figsize=(8, 10))
plt.title("Context Window Visualization (Window Size = 2)")
plt.axis('off')

for i, (target, context) in enumerate(examples):
    # Display target word
    plt.text(0.3, -i/len(examples), target, fontsize=10, 
             ha='center', bbox=dict(facecolor='lightblue', alpha=0.5, pad=5))
    
    # Display context words
    for j, word in enumerate(context):
        x_offset = (j * 0.2)
        plt.text(x_offset, -i/len(examples), word, fontsize=10, 
                 ha='center', bbox=dict(facecolor='lightyellow', alpha=0.5, pad=5))

plt.tight_layout()
plt.show()

3.2 🔁 Generating Skip-gram Training Pairs¶

In the Skip-gram model, our goal is to predict context words from a given target word.

That means for every target word, we create multiple training pairs — one for each context word that appears within a sliding window.

🧱 How It Works:¶

Slide a context window over the corpus
For each word at the center (our target)...
Grab the words around it within the window (these are the context words)
Create training pairs in the form: (target_word, context_word)

📌 Example¶

If our window size is 2 and the center word is "sat", we generate: ("sat", "the") ("sat", "cat") ("sat", "on") ("sat", "the")

⚖️ Pros & Tradeoffs¶

✅ Great for rare words and capturing subtle semantic differences
❗ But... it produces lots of training examples, which can slow down training

Let’s implement this logic in code and prepare the dataset for training our Skip-gram model.

In [6]:

Copied!





def generate_skipgram_pairs(corpus, window_size, word2idx):
    """
    Generate training pairs for the Skip-gram model.
    
    Args:
        corpus: List of tokenized words
        window_size: Number of words to consider as context on each side
        word2idx: Word to index mapping
    
    Returns:
        pairs: List of (target_idx, context_idx) pairs
    """
    pairs = []
    
    # Loop through each word in corpus
    for i in tqdm(range(len(corpus)), desc="Generating Skip-gram pairs"):
        target = word2idx[corpus[i]]
        
        # Get context words within the window
        context_start = max(0, i - window_size)
        context_end = min(len(corpus), i + window_size + 1)
        
        for j in range(context_start, context_end):
            if i != j:  # Skip the target word itself
                if corpus[j] in word2idx:  # Check if context word is in vocabulary
                    context = word2idx[corpus[j]]
                    pairs.append((target, context))
    
    return np.array(pairs)

# Generate Skip-gram training pairs
window_size = 2  # Context window of 2 words on each side
skipgram_pairs = generate_skipgram_pairs(filtered_words, window_size, word2idx)

print(f"Generated {len(skipgram_pairs)} Skip-gram training pairs")
print(f"Sample pairs (word indices):")
for i in range(5):
    target_idx, context_idx = skipgram_pairs[i]
    print(f"  Target: {idx2word[target_idx]} ({target_idx}) → Context: {idx2word[context_idx]} ({context_idx})")
def generate_skipgram_pairs(corpus, window_size, word2idx):
    """
    Generate training pairs for the Skip-gram model.
    
    Args:
        corpus: List of tokenized words
        window_size: Number of words to consider as context on each side
        word2idx: Word to index mapping
    
    Returns:
        pairs: List of (target_idx, context_idx) pairs
    """
    pairs = []
    
    # Loop through each word in corpus
    for i in tqdm(range(len(corpus)), desc="Generating Skip-gram pairs"):
        target = word2idx[corpus[i]]
        
        # Get context words within the window
        context_start = max(0, i - window_size)
        context_end = min(len(corpus), i + window_size + 1)
        
        for j in range(context_start, context_end):
            if i != j:  # Skip the target word itself
                if corpus[j] in word2idx:  # Check if context word is in vocabulary
                    context = word2idx[corpus[j]]
                    pairs.append((target, context))
    
    return np.array(pairs)

# Generate Skip-gram training pairs
window_size = 2  # Context window of 2 words on each side
skipgram_pairs = generate_skipgram_pairs(filtered_words, window_size, word2idx)

print(f"Generated {len(skipgram_pairs)} Skip-gram training pairs")
print(f"Sample pairs (word indices):")
for i in range(5):
    target_idx, context_idx = skipgram_pairs[i]
    print(f"  Target: {idx2word[target_idx]} ({target_idx}) → Context: {idx2word[context_idx]} ({context_idx})")

Generating Skip-gram pairs:   0%|          | 0/791689 [00:00<?, ?it/s]

Generated 3166750 Skip-gram training pairs
Sample pairs (word indices):
  Target: anarchism (0) → Context: originated (1)
  Target: anarchism (0) → Context: as (2)
  Target: originated (1) → Context: anarchism (0)
  Target: originated (1) → Context: as (2)
  Target: originated (1) → Context: a (3)

3.3 🧺 Generating CBOW Training Pairs¶

The CBOW (Continuous Bag of Words) model flips the Skip-gram idea on its head:

Instead of predicting context from a word, we predict the target word from its context.

This means we group the surrounding words together and try to guess the word in the middle.

🧱 How It Works:¶

Slide a context window over the text
For each position, treat the center word as the target
Take the surrounding words as the context
Create a training example like: (context_words, target_word)

📌 Example¶

With a window size of 2 and the center word "sat": Context: ["the", "cat", "on", "the"] Target: "sat" Training pair: (["the", "cat", "on", "the"], "sat")

⚖️ Pros & Tradeoffs¶

✅ Faster training than Skip-gram
✅ Performs well on frequent words
❗ Might struggle with rare or ambiguous words

Let’s now implement the CBOW training data generation logic in code.

In [7]:

Copied!





def generate_cbow_pairs(corpus, window_size, word2idx):
    """
    Generate training pairs for the CBOW model.
    
    Args:
        corpus: List of tokenized words
        window_size: Number of words to consider as context on each side
        word2idx: Word to index mapping
    
    Returns:
        X: Context word indices (each row contains 2*window_size indices)
        y: Target word indices
    """
    X = []  # Context words
    y = []  # Target words
    
    # Loop through each word in corpus
    for i in tqdm(range(len(corpus)), desc="Generating CBOW pairs"):
        if corpus[i] not in word2idx:
            continue
            
        target = word2idx[corpus[i]]
        
        # Get context words within the window
        context_start = max(0, i - window_size)
        context_end = min(len(corpus), i + window_size + 1)
        
        # Get context word indices
        context_indices = []
        for j in range(context_start, context_end):
            if i != j and corpus[j] in word2idx:  # Skip target & ensure word is in vocabulary
                context_indices.append(word2idx[corpus[j]])
        
        # Only use examples with sufficient context
        if len(context_indices) == 2 * window_size:  # Full context window
            X.append(context_indices)
            y.append(target)
    
    return np.array(X), np.array(y)

# Generate CBOW training pairs
window_size = 2  # Context window of 2 words on each side
X_cbow, y_cbow = generate_cbow_pairs(filtered_words, window_size, word2idx)

print(f"Generated {len(X_cbow)} CBOW training examples")
print(f"Shape of X_cbow: {X_cbow.shape}")
print(f"Shape of y_cbow: {y_cbow.shape}")

# Show sample CBOW training examples
print("\nSample CBOW examples:")
for i in range(5):
    context_indices = X_cbow[i]
    target_idx = y_cbow[i]
    
    context_words = [idx2word[idx] for idx in context_indices]
    target_word = idx2word[target_idx]
    
    print(f"  Context: {context_words} → Target: {target_word}")
def generate_cbow_pairs(corpus, window_size, word2idx):
    """
    Generate training pairs for the CBOW model.
    
    Args:
        corpus: List of tokenized words
        window_size: Number of words to consider as context on each side
        word2idx: Word to index mapping
    
    Returns:
        X: Context word indices (each row contains 2*window_size indices)
        y: Target word indices
    """
    X = []  # Context words
    y = []  # Target words
    
    # Loop through each word in corpus
    for i in tqdm(range(len(corpus)), desc="Generating CBOW pairs"):
        if corpus[i] not in word2idx:
            continue
            
        target = word2idx[corpus[i]]
        
        # Get context words within the window
        context_start = max(0, i - window_size)
        context_end = min(len(corpus), i + window_size + 1)
        
        # Get context word indices
        context_indices = []
        for j in range(context_start, context_end):
            if i != j and corpus[j] in word2idx:  # Skip target & ensure word is in vocabulary
                context_indices.append(word2idx[corpus[j]])
        
        # Only use examples with sufficient context
        if len(context_indices) == 2 * window_size:  # Full context window
            X.append(context_indices)
            y.append(target)
    
    return np.array(X), np.array(y)

# Generate CBOW training pairs
window_size = 2  # Context window of 2 words on each side
X_cbow, y_cbow = generate_cbow_pairs(filtered_words, window_size, word2idx)

print(f"Generated {len(X_cbow)} CBOW training examples")
print(f"Shape of X_cbow: {X_cbow.shape}")
print(f"Shape of y_cbow: {y_cbow.shape}")

# Show sample CBOW training examples
print("\nSample CBOW examples:")
for i in range(5):
    context_indices = X_cbow[i]
    target_idx = y_cbow[i]
    
    context_words = [idx2word[idx] for idx in context_indices]
    target_word = idx2word[target_idx]
    
    print(f"  Context: {context_words} → Target: {target_word}")

Generating CBOW pairs:   0%|          | 0/791689 [00:00<?, ?it/s]

Generated 791685 CBOW training examples
Shape of X_cbow: (791685, 4)
Shape of y_cbow: (791685,)

Sample CBOW examples:
  Context: ['anarchism', 'originated', 'a', 'term'] → Target: as
  Context: ['originated', 'as', 'term', 'of'] → Target: a
  Context: ['as', 'a', 'of', 'abuse'] → Target: term
  Context: ['a', 'term', 'abuse', 'first'] → Target: of
  Context: ['term', 'of', 'first', 'used'] → Target: abuse

3.4 ⚖️ Comparing Skip-gram vs. CBOW¶

Before we move on, let’s pause and compare the two training approaches side by side:

💡 Aspect	🔁 Skip-gram	🧺 CBOW
Prediction Direction	Target → Context	Context → Target
Input	One word	Multiple words
Output	One context word at a time	One target word
Training Examples	More (one per context word)	Fewer (one per target)
Training Speed	Slower	Faster
Good for Rare Words	✅ Yes	🚫 Not as good
Captures Word Senses	✅ Better	🚫 Limited

🧠 So Which One Should You Use?¶

Both models have their strengths:

Skip-gram: Great for small datasets, rare words, and learning multiple meanings of words
CBOW: Faster and often slightly better for frequent words and large corpora

In real-world use (e.g., the original Word2Vec paper), Skip-gram is usually preferred for its superior performance on semantic tasks — even if it takes a bit longer to train.

Up next: we’ll define the neural architectures that turn these training pairs into powerful word embeddings!

4. ⚙️ Skip-gram Model Implementation¶

4.1 🧠 The Skip-gram Architecture with Negative Sampling¶

Now that we’ve generated our training pairs, it’s time to move into modeling — starting with Skip-gram, enhanced with negative sampling.

🔄 Quick Recap¶

Earlier, we generated Skip-gram training pairs in the vanilla style — where each (target, context) pair would be used to predict the full probability distribution over the entire vocabulary.

That’s perfectly valid, but it's computationally expensive, especially with vocabularies of 10,000+ words.

So instead, we’ll now switch to a more efficient and scalable approach:

Skip-gram with Negative Sampling

This requires rethinking how we generate training data and how we structure the model — but it’ll make everything much faster and more practical, even on large datasets.

💡 What Is Negative Sampling?¶

Instead of trying to predict the entire vocabulary, we reframe the problem as a binary classification task:

Is this word a true context word for the target? ✅
Or is it just a random (negative) sample? ❌

🧪 How It Works:¶

For each positive pair (target, context)
Sample a few negative pairs: words that are not actual context
Train the model to:
- Push positive pairs closer in embedding space
- Push negative pairs apart

⚙️ Why Use It?¶

🚀 Faster training: Only a few dot products per example instead of tens of thousands
📈 Better results: Especially for rare words and semantic similarity
💼 Scalable: Works well even for massive corpora

🧮 Training Objective¶

We optimize the following loss for each training pair:

$$ \log(\sigma(v_{target} \cdot v_{context})) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)}[\log(\sigma(-v_{target} \cdot v_{w_i}))] $$

Where:

$\sigma$ = sigmoid function
$v_{target}$ = embedding of the center (input) word
$v_{context}$ = embedding of a true context word
$v_{w_i}$ = embeddings of negative samples
$k$ = number of negative samples
$P_n(w)$ = noise distribution (usually unigram freq raised to the 3/4 power)

🛠️ Note: Since our earlier training pairs were built for the vanilla Skip-gram model, we’ll now generate a new version of the training dataset tailored for negative sampling. The implementation will be a bit trickier — but way more efficient.

Let’s dive in.

4.2 🛠️ Implementing the Skip-gram Model with Negative Sampling¶

Now that we understand the theory behind Skip-gram with negative sampling, let’s implement it in TensorFlow/Keras!

📥 Input Format¶

Our model takes in three components:

Target word index (center word)
Positive context word index
Negative word indices (randomly sampled "noise" words)

🧱 Model Structure¶

We’ve built a custom Keras model with the following components:

target_embedding: Learns vector representations for input (center) words
context_embedding: Learns vector representations for context words (used for both positive and negative samples)
Dot products are computed between target and context embeddings:
- Positive dot product for real context words
- Negative dot products for sampled non-context words

These are passed to a custom loss function that encourages the model to:

Assign high similarity to real (target, context) pairs
Assign low similarity to randomly sampled negatives

🧪 Loss Function¶

We're using a custom negative sampling loss, which:

Applies sigmoid cross-entropy
Labels the first value in each output vector as 1 (positive pair)
Labels all others as 0 (negatives)

This simplifies our objective into a set of binary classification tasks per training example.

🧰 Preparing Training Data¶

Because negative sampling changes our training format, we regenerate the dataset using a helper function:

targets, contexts, negatives = prepare_training_data(skipgram_pairs, vocab_size, num_ns=4)

We then wrap it all in a tf.data.Dataset pipeline:

Batching
Shuffling
Prefetching for performance

🧠 Embedding Dimension¶

We’re using a small embedding size of 16 dimensions to keep training fast during experimentation.
(In practice, 100–300 dimensions are more common.)

✅ Model Compilation¶

We compile the model using:

Adam optimizer for efficient gradient updates
Our custom loss function tailored for negative sampling

Once compiled, we can start training!

Let’s check out the model summary to see the architecture.

In [8]:

Copied!





from tensorflow import keras
from keras import layers
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from tqdm.notebook import tqdm
import time

class Word2VecNegativeSampling(keras.Model):
    def __init__(self, vocab_size, embedding_dim, num_ns=4):
        super(Word2VecNegativeSampling, self).__init__()
        self.target_embedding = layers.Embedding(
            vocab_size, 
            embedding_dim,
            name="w2v_embedding",
        )
        self.context_embedding = layers.Embedding(
            vocab_size, 
            embedding_dim,
            name="context_embedding"
        )
        self.num_ns = num_ns
        
    def call(self, inputs):
        target_inputs, context_inputs, negative_inputs = inputs
        
        # Get embeddings
        # Shape: (batch_size, 1, embedding_dim)
        target_embedding = self.target_embedding(target_inputs)
        # Shape: (batch_size, 1, embedding_dim)
        context_embedding = self.context_embedding(context_inputs)
        # Shape: (batch_size, num_ns, embedding_dim)
        negative_embedding = self.context_embedding(negative_inputs)
        
        # Remove the extra dimension from target and context
        # Shape: (batch_size, embedding_dim)
        target_embedding_flat = tf.squeeze(target_embedding, axis=1)
        # Shape: (batch_size, embedding_dim)
        context_embedding_flat = tf.squeeze(context_embedding, axis=1)
        
        # Compute positive dot product (target · context)
        # Shape: (batch_size, 1)
        positive_dots = tf.reduce_sum(
            tf.multiply(target_embedding_flat, context_embedding_flat), 
            axis=1, keepdims=True
        )
        
        # Reshape target for negative samples
        # Shape: (batch_size, 1, embedding_dim)
        target_for_negatives = tf.expand_dims(target_embedding_flat, axis=1)
        
        # Compute negative dot products (target · negative samples)
        # Broadcasting handles the shapes: (batch_size, 1, embedding_dim) * (batch_size, num_ns, embedding_dim)
        # Result shape: (batch_size, num_ns)
        negative_dots = tf.reduce_sum(
            tf.multiply(target_for_negatives, negative_embedding),
            axis=2
        )
        
        # Combine positive and negative dots
        # Shape: (batch_size, 1 + num_ns)
        dots = tf.concat([positive_dots, negative_dots], axis=1)
        
        return dots

# Custom loss function to handle the combined logits
def w2v_negative_sampling_loss(y_true, y_pred):
    """
    Word2Vec negative sampling loss.
    y_pred contains dot products for positive (first column) and negative samples (remaining columns)
    """
    # Create labels: 1 for positive sample, 0 for negative samples
    batch_size = tf.shape(y_pred)[0]
    # Shape: (batch_size, 1 + num_ns)
    labels = tf.concat([
        tf.ones((batch_size, 1), dtype=tf.float32),
        tf.zeros((batch_size, tf.shape(y_pred)[1] - 1), dtype=tf.float32)
    ], axis=1)
    
    # Apply sigmoid cross entropy
    return tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=y_pred)
    )

# Function to prepare training data
def prepare_training_data(skipgram_pairs, vocab_size, num_ns=4, seed=42):
    """Generate properly formatted data for negative sampling"""
    np.random.seed(seed)
    
    # Extract target and context words
    targets, contexts = skipgram_pairs[:, 0], skipgram_pairs[:, 1]
    
    # Generate negative samples
    negatives = []
    for target, context in zip(targets, contexts):
        negative_samples = []
        while len(negative_samples) < num_ns:
            neg = np.random.randint(0, vocab_size)
            if neg != target and neg != context:
                negative_samples.append(neg)
        negatives.append(negative_samples)
    
    # Convert to arrays and reshape
    targets = np.array(targets, dtype=np.int32)[:, np.newaxis]
    contexts = np.array(contexts, dtype=np.int32)[:, np.newaxis]
    negatives = np.array(negatives, dtype=np.int32)
    
    return targets, contexts, negatives

# Prepare the training data (assuming skipgram_pairs is already defined)
targets, contexts, negatives = prepare_training_data(
    skipgram_pairs, vocab_size, num_ns=4
)

print(f"Prepared training data:")
print(f"  Targets shape: {targets.shape}")
print(f"  Contexts shape: {contexts.shape}")
print(f"  Negatives shape: {negatives.shape}")

# Create TensorFlow dataset
dataset = tf.data.Dataset.from_tensor_slices(
    ((targets, contexts, negatives), np.zeros(len(targets)))
)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(512)
dataset = dataset.prefetch(tf.data.AUTOTUNE)

# Create and compile model
embedding_dim = 16
num_ns = 4
model = Word2VecNegativeSampling(vocab_size, embedding_dim, num_ns)

# Compile with loss function
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.01),
    loss=w2v_negative_sampling_loss
)

model.summary()
from tensorflow import keras
from keras import layers
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from tqdm.notebook import tqdm
import time

class Word2VecNegativeSampling(keras.Model):
    def __init__(self, vocab_size, embedding_dim, num_ns=4):
        super(Word2VecNegativeSampling, self).__init__()
        self.target_embedding = layers.Embedding(
            vocab_size, 
            embedding_dim,
            name="w2v_embedding",
        )
        self.context_embedding = layers.Embedding(
            vocab_size, 
            embedding_dim,
            name="context_embedding"
        )
        self.num_ns = num_ns
        
    def call(self, inputs):
        target_inputs, context_inputs, negative_inputs = inputs
        
        # Get embeddings
        # Shape: (batch_size, 1, embedding_dim)
        target_embedding = self.target_embedding(target_inputs)
        # Shape: (batch_size, 1, embedding_dim)
        context_embedding = self.context_embedding(context_inputs)
        # Shape: (batch_size, num_ns, embedding_dim)
        negative_embedding = self.context_embedding(negative_inputs)
        
        # Remove the extra dimension from target and context
        # Shape: (batch_size, embedding_dim)
        target_embedding_flat = tf.squeeze(target_embedding, axis=1)
        # Shape: (batch_size, embedding_dim)
        context_embedding_flat = tf.squeeze(context_embedding, axis=1)
        
        # Compute positive dot product (target · context)
        # Shape: (batch_size, 1)
        positive_dots = tf.reduce_sum(
            tf.multiply(target_embedding_flat, context_embedding_flat), 
            axis=1, keepdims=True
        )
        
        # Reshape target for negative samples
        # Shape: (batch_size, 1, embedding_dim)
        target_for_negatives = tf.expand_dims(target_embedding_flat, axis=1)
        
        # Compute negative dot products (target · negative samples)
        # Broadcasting handles the shapes: (batch_size, 1, embedding_dim) * (batch_size, num_ns, embedding_dim)
        # Result shape: (batch_size, num_ns)
        negative_dots = tf.reduce_sum(
            tf.multiply(target_for_negatives, negative_embedding),
            axis=2
        )
        
        # Combine positive and negative dots
        # Shape: (batch_size, 1 + num_ns)
        dots = tf.concat([positive_dots, negative_dots], axis=1)
        
        return dots

# Custom loss function to handle the combined logits
def w2v_negative_sampling_loss(y_true, y_pred):
    """
    Word2Vec negative sampling loss.
    y_pred contains dot products for positive (first column) and negative samples (remaining columns)
    """
    # Create labels: 1 for positive sample, 0 for negative samples
    batch_size = tf.shape(y_pred)[0]
    # Shape: (batch_size, 1 + num_ns)
    labels = tf.concat([
        tf.ones((batch_size, 1), dtype=tf.float32),
        tf.zeros((batch_size, tf.shape(y_pred)[1] - 1), dtype=tf.float32)
    ], axis=1)
    
    # Apply sigmoid cross entropy
    return tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=y_pred)
    )

# Function to prepare training data
def prepare_training_data(skipgram_pairs, vocab_size, num_ns=4, seed=42):
    """Generate properly formatted data for negative sampling"""
    np.random.seed(seed)
    
    # Extract target and context words
    targets, contexts = skipgram_pairs[:, 0], skipgram_pairs[:, 1]
    
    # Generate negative samples
    negatives = []
    for target, context in zip(targets, contexts):
        negative_samples = []
        while len(negative_samples) < num_ns:
            neg = np.random.randint(0, vocab_size)
            if neg != target and neg != context:
                negative_samples.append(neg)
        negatives.append(negative_samples)
    
    # Convert to arrays and reshape
    targets = np.array(targets, dtype=np.int32)[:, np.newaxis]
    contexts = np.array(contexts, dtype=np.int32)[:, np.newaxis]
    negatives = np.array(negatives, dtype=np.int32)
    
    return targets, contexts, negatives

# Prepare the training data (assuming skipgram_pairs is already defined)
targets, contexts, negatives = prepare_training_data(
    skipgram_pairs, vocab_size, num_ns=4
)

print(f"Prepared training data:")
print(f"  Targets shape: {targets.shape}")
print(f"  Contexts shape: {contexts.shape}")
print(f"  Negatives shape: {negatives.shape}")

# Create TensorFlow dataset
dataset = tf.data.Dataset.from_tensor_slices(
    ((targets, contexts, negatives), np.zeros(len(targets)))
)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(512)
dataset = dataset.prefetch(tf.data.AUTOTUNE)

# Create and compile model
embedding_dim = 16
num_ns = 4
model = Word2VecNegativeSampling(vocab_size, embedding_dim, num_ns)

# Compile with loss function
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.01),
    loss=w2v_negative_sampling_loss
)

model.summary()

Prepared training data:
  Targets shape: (3166750, 1)
  Contexts shape: (3166750, 1)
  Negatives shape: (3166750, 4)

Model: "word2_vec_negative_sampling"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ w2v_embedding (Embedding)       │ ?                      │   0 (unbuilt) │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ context_embedding (Embedding)   │ ?                      │   0 (unbuilt) │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 0 (0.00 B)

 Trainable params: 0 (0.00 B)

 Non-trainable params: 0 (0.00 B)

In [9]:

Copied!





# Train model with early stopping
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='loss', patience=5, restore_best_weights=True
)

print("Beginning training...")
history = model.fit(
    dataset, 
    steps_per_epoch=1,
    epochs=1,
    verbose=1,
    callbacks=[early_stopping]
)

skipgram_initial_embeddings = model.target_embedding.get_weights()[0]

history = model.fit(
    dataset, 
    epochs=20,
    verbose=1,
    callbacks=[early_stopping]
)

# Plot loss history
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'])
plt.title('Word2Vec Negative Sampling Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True)
plt.show()
# Train model with early stopping
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='loss', patience=5, restore_best_weights=True
)

print("Beginning training...")
history = model.fit(
    dataset, 
    steps_per_epoch=1,
    epochs=1,
    verbose=1,
    callbacks=[early_stopping]
)

skipgram_initial_embeddings = model.target_embedding.get_weights()[0]

history = model.fit(
    dataset, 
    epochs=20,
    verbose=1,
    callbacks=[early_stopping]
)

# Plot loss history
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'])
plt.title('Word2Vec Negative Sampling Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True)
plt.show()

Beginning training...
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 389ms/step - loss: 0.6932
Epoch 1/20
6186/6186 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - loss: 0.2820
Epoch 2/20
6186/6186 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - loss: 0.2471
Epoch 3/20
6186/6186 ━━━━━━━━━━━━━━━━━━━━ 15s 2ms/step - loss: 0.2424
Epoch 4/20
6186/6186 ━━━━━━━━━━━━━━━━━━━━ 15s 2ms/step - loss: 0.2408
Epoch 5/20
6186/6186 ━━━━━━━━━━━━━━━━━━━━ 13s 2ms/step - loss: 0.2402
Epoch 6/20
6186/6186 ━━━━━━━━━━━━━━━━━━━━ 13s 2ms/step - loss: 0.2398
Epoch 7/20
6186/6186 ━━━━━━━━━━━━━━━━━━━━ 13s 2ms/step - loss: 0.2396
Epoch 8/20
6186/6186 ━━━━━━━━━━━━━━━━━━━━ 13s 2ms/step - loss: 0.2397
Epoch 9/20
6186/6186 ━━━━━━━━━━━━━━━━━━━━ 13s 2ms/step - loss: 0.2394
Epoch 10/20
6186/6186 ━━━━━━━━━━━━━━━━━━━━ 13s 2ms/step - loss: 0.2394
Epoch 11/20
6186/6186 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - loss: 0.2395
Epoch 12/20
6186/6186 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - loss: 0.2395
Epoch 13/20
6186/6186 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - loss: 0.2398
Epoch 14/20
6186/6186 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - loss: 0.2396

In [10]:

Copied!





# Extract learned embeddings
skipgram_learned_embeddings = model.target_embedding.get_weights()[0]
print(f"Learned embeddings shape: {skipgram_learned_embeddings.shape}")

def find_similar_words(word, embeddings, word2idx, idx2word, top_n=5):
    """Find most similar words based on cosine similarity"""
    if word not in word2idx:
        return []
        
    word_idx = word2idx[word]
    word_vec = embeddings[word_idx]
    
    # Normalize vectors for cosine similarity
    norm_word = np.linalg.norm(word_vec)
    if norm_word > 0:
        word_vec = word_vec / norm_word
    
    # Normalize all embeddings
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    norms = np.maximum(norms, 1e-10)  # Avoid division by zero
    normalized_embeddings = embeddings / norms
    
    # Calculate similarities
    similarities = np.dot(normalized_embeddings, word_vec)
    
    # Get top n similar words
    most_similar = []
    for idx in similarities.argsort()[-top_n-1:][::-1]:
        if idx != word_idx:
            most_similar.append((idx2word[idx], similarities[idx]))
            
    return most_similar

# Test some example words
test_words = ["king", "france", "computer", "dog"]
for word in test_words:
    if word in word2idx:
        similar_words = find_similar_words(word, skipgram_initial_embeddings, word2idx, idx2word)
        print(f"\nWords similar to '{word} - initial embeddings':")
        for similar_word, similarity in similar_words:
            print(f"  {similar_word}: {similarity:.4f}")
        similar_words = find_similar_words(word, skipgram_learned_embeddings, word2idx, idx2word)
        print(f"\nWords similar to '{word} - learned embeddings':")
        for similar_word, similarity in similar_words:
            print(f"  {similar_word}: {similarity:.4f}")
        print('---')
# Extract learned embeddings
skipgram_learned_embeddings = model.target_embedding.get_weights()[0]
print(f"Learned embeddings shape: {skipgram_learned_embeddings.shape}")

def find_similar_words(word, embeddings, word2idx, idx2word, top_n=5):
    """Find most similar words based on cosine similarity"""
    if word not in word2idx:
        return []
        
    word_idx = word2idx[word]
    word_vec = embeddings[word_idx]
    
    # Normalize vectors for cosine similarity
    norm_word = np.linalg.norm(word_vec)
    if norm_word > 0:
        word_vec = word_vec / norm_word
    
    # Normalize all embeddings
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    norms = np.maximum(norms, 1e-10)  # Avoid division by zero
    normalized_embeddings = embeddings / norms
    
    # Calculate similarities
    similarities = np.dot(normalized_embeddings, word_vec)
    
    # Get top n similar words
    most_similar = []
    for idx in similarities.argsort()[-top_n-1:][::-1]:
        if idx != word_idx:
            most_similar.append((idx2word[idx], similarities[idx]))
            
    return most_similar

# Test some example words
test_words = ["king", "france", "computer", "dog"]
for word in test_words:
    if word in word2idx:
        similar_words = find_similar_words(word, skipgram_initial_embeddings, word2idx, idx2word)
        print(f"\nWords similar to '{word} - initial embeddings':")
        for similar_word, similarity in similar_words:
            print(f"  {similar_word}: {similarity:.4f}")
        similar_words = find_similar_words(word, skipgram_learned_embeddings, word2idx, idx2word)
        print(f"\nWords similar to '{word} - learned embeddings':")
        for similar_word, similarity in similar_words:
            print(f"  {similar_word}: {similarity:.4f}")
        print('---')

Learned embeddings shape: (12515, 16)

Words similar to 'king - initial embeddings':
  semantic: 0.7762
  atheistic: 0.7647
  commentator: 0.7542
  pot: 0.7325
  advanced: 0.7275

Words similar to 'king - learned embeddings':
  bishop: 0.8909
  john: 0.8590
  res: 0.8484
  edward: 0.8416
  henry: 0.8413
---

Words similar to 'france - initial embeddings':
  estate: 0.8324
  optional: 0.7671
  carried: 0.7653
  ufos: 0.7577
  serious: 0.7473

Words similar to 'france - learned embeddings':
  serbia: 0.8250
  baptiste: 0.8215
  clifford: 0.8102
  de: 0.8012
  italy: 0.7880
---

Words similar to 'computer - initial embeddings':
  syrian: 0.8098
  discusses: 0.8062
  venues: 0.7890
  distant: 0.7887
  tiger: 0.7569

Words similar to 'computer - learned embeddings':
  internet: 0.8837
  eleftherios: 0.8740
  pages: 0.8211
  convention: 0.8198
  code: 0.8186
---

Words similar to 'dog - initial embeddings':
  frogs: 0.7953
  impairments: 0.7868
  dealing: 0.7792
  hitherto: 0.7478
  feed: 0.7392

Words similar to 'dog - learned embeddings':
  prairie: 0.8507
  got: 0.8158
  rapper: 0.8102
  graduating: 0.7931
  johnson: 0.7864
---

4.3 🚂 Training the Skip-gram Model¶

We’ve now trained our Skip-gram model using negative sampling, which reframed word prediction as a series of binary classification problems.

📉 Training Insights¶

Training ran for 20 full epochs, processing over 6,000 batches per epoch.
The model’s loss decreased steadily from 0.6932 to around 0.2374, which shows consistent learning over time.

Even though the loss might seem small and the "accuracy" isn’t tracked here, that’s expected — this task isn’t about exact predictions, but about learning a good embedding space.

Why? Because context can be fuzzy — there are usually many valid context words for any target. What matters is whether similar words end up with similar vectors.

🧠 What the Model Learns¶

Through backpropagation, the model shifts embeddings so that:

Words that appear in similar contexts are pulled closer together
Irrelevant or unrelated words are pushed apart

This way, even without explicit semantic labels, the model organizes the vocabulary in a way that reflects meaning.

4.4 📦 Extracting and Interpreting Word Embeddings¶

After training, we extract the learned word vectors from the model’s Embedding layer.
Each word is now represented as a 16-dimensional vector (you’ll typically see 100–300 in real applications, but we’ve chosen 16 for faster training).

🔍 What Do These Vectors Mean?¶

Each word’s vector encodes semantic and syntactic relationships
Similar vectors → similar meanings, measured via cosine similarity
These vectors also support semantic arithmetic like:

king - man + woman ≈ queen

📊 Before vs. After Training¶

Let’s look at a few examples of cosine similarity before and after training:

🏰 `"king"`¶

Before: semantic, atheistic, commentator
After: bishop, john, edward, henry

🌍 `"france"`¶

Before: estate, ufos, optional
After: serbia, italy, baptiste

💻 `"computer"`¶

Before: syrian, venues, tiger
After: internet, code, pages

🐶 `"dog"`¶

Before: frogs, impairments, feed
After: prairie, rapper, johnson

As you can see, the post-training neighbors are much more semantically relevant — indicating that our model has learned useful representations!

These embeddings are now ready for visualization, clustering, or use in downstream NLP tasks.

Up next: let’s evaluate them further with intrinsic tasks and maybe even visualize the embedding space!

5. 🧺 CBOW Model Implementation¶

5.1 🧠 The CBOW Architecture (with Negative Sampling)¶

The Continuous Bag of Words (CBOW) model takes a different approach from Skip-gram:

It predicts a target word from its surrounding context.

Just like with Skip-gram, we’ll enhance the CBOW model with negative sampling to make it efficient and scalable.

🔁 How CBOW Works (Recap)¶

For each word in the corpus, we:
1. Take the surrounding context words as input
2. Use them to predict the center word (the target)
Instead of treating the context individually, we average their embeddings into a single vector

🧱 Model Structure¶

We’ve implemented CBOW with the following architecture:

word_embedding: Learns embeddings for context words
target_embedding: Learns embeddings for target words
For each example:
- Compute the average of all context embeddings
- Dot it with the target word embedding (positive sample)
- Dot it with several negative samples (noise words)

These dot products are passed into a custom loss to distinguish positive vs. negative examples.

📥 Input Format¶

Each training instance includes:

Context word indices (multiple, e.g., 4 words)
Target word index (center word)
Negative sample indices (randomly chosen non-target words)

🧪 Loss Function¶

We use the same custom binary cross-entropy loss as in Skip-gram:

The first dot product is treated as a positive example (label = 1)
The remaining dot products are negatives (label = 0)
The loss encourages:
- High similarity with the correct target
- Low similarity with random words

🧰 Preparing Training Data¶

To build the training set, we slide a window across the corpus:

contexts, targets, negatives = prepare_cbow_training_data(
    filtered_words, word2idx, context_size=2, num_ns=4
)

Each row of contexts contains 4 word indices (2 before, 2 after), and for each pair, we generate 4 negative samples.

We package it all into a tf.data.Dataset pipeline for efficient batching and shuffling.

📐 Embedding Dimension¶

As before, we’re using an embedding size of 16 for quick training.
Larger dimensions (like 100–300) are often used in production.

✅ Model Compilation¶

We compile the model using:

Adam optimizer for stable and adaptive learning
Our custom CBOW negative sampling loss

Once compiled, we’re ready to train!

In [11]:

Copied!





class CBOWNegativeSampling(keras.Model):
    def __init__(self, vocab_size, embedding_dim, context_size=4, num_ns=4):
        super(CBOWNegativeSampling, self).__init__()
        self.word_embedding = layers.Embedding(
            vocab_size, 
            embedding_dim,
            name="cbow_embedding",
        )
        self.target_embedding = layers.Embedding(
            vocab_size, 
            embedding_dim,
            name="target_embedding"
        )
        self.context_size = context_size
        self.num_ns = num_ns
        
    def call(self, inputs):
        context_inputs, target_inputs, negative_inputs = inputs
        
        # Get embeddings
        # Shape: (batch_size, context_size, embedding_dim)
        context_embedding = self.word_embedding(context_inputs)
        # Shape: (batch_size, 1, embedding_dim)
        target_embedding = self.target_embedding(target_inputs)
        # Shape: (batch_size, num_ns, embedding_dim)
        negative_embedding = self.target_embedding(negative_inputs)
        
        # Average the context embeddings
        # Shape: (batch_size, embedding_dim)
        context_embedding_avg = tf.reduce_mean(context_embedding, axis=1)
        
        # Remove the extra dimension from target
        # Shape: (batch_size, embedding_dim)
        target_embedding_flat = tf.squeeze(target_embedding, axis=1)
        
        # Compute positive dot product (context_avg · target)
        # Shape: (batch_size, 1)
        positive_dots = tf.reduce_sum(
            tf.multiply(context_embedding_avg, target_embedding_flat), 
            axis=1, keepdims=True
        )
        
        # Reshape context_avg for negative samples
        # Shape: (batch_size, 1, embedding_dim)
        context_for_negatives = tf.expand_dims(context_embedding_avg, axis=1)
        
        # Compute negative dot products (context_avg · negative samples)
        # Broadcasting handles the shapes
        # Result shape: (batch_size, num_ns)
        negative_dots = tf.reduce_sum(
            tf.multiply(context_for_negatives, negative_embedding),
            axis=2
        )
        
        # Combine positive and negative dots
        # Shape: (batch_size, 1 + num_ns)
        dots = tf.concat([positive_dots, negative_dots], axis=1)
        
        return dots

# Custom loss function (same as Word2Vec)
def cbow_negative_sampling_loss(y_true, y_pred):
    """
    CBOW negative sampling loss.
    y_pred contains dot products for positive (first column) and negative samples (remaining columns)
    """
    # Create labels: 1 for positive sample, 0 for negative samples
    batch_size = tf.shape(y_pred)[0]
    # Shape: (batch_size, 1 + num_ns)
    labels = tf.concat([
        tf.ones((batch_size, 1), dtype=tf.float32),
        tf.zeros((batch_size, tf.shape(y_pred)[1] - 1), dtype=tf.float32)
    ], axis=1)
    
    # Apply sigmoid cross entropy
    return tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=y_pred)
    )

# Function to prepare CBOW training data
def prepare_cbow_training_data(corpus, word2idx, context_size=2, num_ns=4, seed=42):
    """Generate training data for CBOW with negative sampling"""
    np.random.seed(seed)
    
    # Create context-target pairs
    contexts = []
    targets = []
    
    # Loop through each word in the corpus
    for i in tqdm(range(context_size, len(corpus) - context_size), desc="Generating CBOW pairs"):
        if corpus[i] not in word2idx:
            continue
            
        # Get target word
        target = word2idx[corpus[i]]
        
        # Get context words (context_size words on each side)
        context = []
        for j in range(i - context_size, i + context_size + 1):
            if j != i and corpus[j] in word2idx:  # Skip target itself
                context.append(word2idx[corpus[j]])
        
        # Only use examples with full context
        if len(context) == context_size * 2:  # Full context window
            contexts.append(context)
            targets.append(target)
    
    # Convert to arrays
    contexts = np.array(contexts, dtype=np.int32)
    targets = np.array(targets, dtype=np.int32)[:, np.newaxis]
    
    # Generate negative samples
    negatives = []
    vocab_size = len(word2idx)
    
    for target in targets:
        negative_samples = []
        while len(negative_samples) < num_ns:
            neg = np.random.randint(0, vocab_size)
            if neg != target[0]:  # Don't include the target word
                negative_samples.append(neg)
        negatives.append(negative_samples)
    
    negatives = np.array(negatives, dtype=np.int32)
    
    return contexts, targets, negatives

# Generate CBOW training data
context_size = 2  # 2 words on each side = 4 total context words
num_ns = 4  # Number of negative samples per positive example

contexts, targets, negatives = prepare_cbow_training_data(
    filtered_words, word2idx, context_size=context_size, num_ns=num_ns
)

print(f"Prepared CBOW training data:")
print(f"  Contexts shape: {contexts.shape} (each row contains {context_size*2} context words)")
print(f"  Targets shape: {targets.shape}")
print(f"  Negatives shape: {negatives.shape}")

# Create TensorFlow dataset
dataset = tf.data.Dataset.from_tensor_slices(
    ((contexts, targets, negatives), np.zeros(len(targets)))
)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(512)
dataset = dataset.prefetch(tf.data.AUTOTUNE)

# Create and compile model
embedding_dim = 16
model = CBOWNegativeSampling(
    vocab_size, embedding_dim, context_size=context_size*2, num_ns=num_ns
)

# Compile with loss function
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.01),
    loss=cbow_negative_sampling_loss
)

# Print model summary
model.summary()
class CBOWNegativeSampling(keras.Model):
    def __init__(self, vocab_size, embedding_dim, context_size=4, num_ns=4):
        super(CBOWNegativeSampling, self).__init__()
        self.word_embedding = layers.Embedding(
            vocab_size, 
            embedding_dim,
            name="cbow_embedding",
        )
        self.target_embedding = layers.Embedding(
            vocab_size, 
            embedding_dim,
            name="target_embedding"
        )
        self.context_size = context_size
        self.num_ns = num_ns
        
    def call(self, inputs):
        context_inputs, target_inputs, negative_inputs = inputs
        
        # Get embeddings
        # Shape: (batch_size, context_size, embedding_dim)
        context_embedding = self.word_embedding(context_inputs)
        # Shape: (batch_size, 1, embedding_dim)
        target_embedding = self.target_embedding(target_inputs)
        # Shape: (batch_size, num_ns, embedding_dim)
        negative_embedding = self.target_embedding(negative_inputs)
        
        # Average the context embeddings
        # Shape: (batch_size, embedding_dim)
        context_embedding_avg = tf.reduce_mean(context_embedding, axis=1)
        
        # Remove the extra dimension from target
        # Shape: (batch_size, embedding_dim)
        target_embedding_flat = tf.squeeze(target_embedding, axis=1)
        
        # Compute positive dot product (context_avg · target)
        # Shape: (batch_size, 1)
        positive_dots = tf.reduce_sum(
            tf.multiply(context_embedding_avg, target_embedding_flat), 
            axis=1, keepdims=True
        )
        
        # Reshape context_avg for negative samples
        # Shape: (batch_size, 1, embedding_dim)
        context_for_negatives = tf.expand_dims(context_embedding_avg, axis=1)
        
        # Compute negative dot products (context_avg · negative samples)
        # Broadcasting handles the shapes
        # Result shape: (batch_size, num_ns)
        negative_dots = tf.reduce_sum(
            tf.multiply(context_for_negatives, negative_embedding),
            axis=2
        )
        
        # Combine positive and negative dots
        # Shape: (batch_size, 1 + num_ns)
        dots = tf.concat([positive_dots, negative_dots], axis=1)
        
        return dots

# Custom loss function (same as Word2Vec)
def cbow_negative_sampling_loss(y_true, y_pred):
    """
    CBOW negative sampling loss.
    y_pred contains dot products for positive (first column) and negative samples (remaining columns)
    """
    # Create labels: 1 for positive sample, 0 for negative samples
    batch_size = tf.shape(y_pred)[0]
    # Shape: (batch_size, 1 + num_ns)
    labels = tf.concat([
        tf.ones((batch_size, 1), dtype=tf.float32),
        tf.zeros((batch_size, tf.shape(y_pred)[1] - 1), dtype=tf.float32)
    ], axis=1)
    
    # Apply sigmoid cross entropy
    return tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=y_pred)
    )

# Function to prepare CBOW training data
def prepare_cbow_training_data(corpus, word2idx, context_size=2, num_ns=4, seed=42):
    """Generate training data for CBOW with negative sampling"""
    np.random.seed(seed)
    
    # Create context-target pairs
    contexts = []
    targets = []
    
    # Loop through each word in the corpus
    for i in tqdm(range(context_size, len(corpus) - context_size), desc="Generating CBOW pairs"):
        if corpus[i] not in word2idx:
            continue
            
        # Get target word
        target = word2idx[corpus[i]]
        
        # Get context words (context_size words on each side)
        context = []
        for j in range(i - context_size, i + context_size + 1):
            if j != i and corpus[j] in word2idx:  # Skip target itself
                context.append(word2idx[corpus[j]])
        
        # Only use examples with full context
        if len(context) == context_size * 2:  # Full context window
            contexts.append(context)
            targets.append(target)
    
    # Convert to arrays
    contexts = np.array(contexts, dtype=np.int32)
    targets = np.array(targets, dtype=np.int32)[:, np.newaxis]
    
    # Generate negative samples
    negatives = []
    vocab_size = len(word2idx)
    
    for target in targets:
        negative_samples = []
        while len(negative_samples) < num_ns:
            neg = np.random.randint(0, vocab_size)
            if neg != target[0]:  # Don't include the target word
                negative_samples.append(neg)
        negatives.append(negative_samples)
    
    negatives = np.array(negatives, dtype=np.int32)
    
    return contexts, targets, negatives

# Generate CBOW training data
context_size = 2  # 2 words on each side = 4 total context words
num_ns = 4  # Number of negative samples per positive example

contexts, targets, negatives = prepare_cbow_training_data(
    filtered_words, word2idx, context_size=context_size, num_ns=num_ns
)

print(f"Prepared CBOW training data:")
print(f"  Contexts shape: {contexts.shape} (each row contains {context_size*2} context words)")
print(f"  Targets shape: {targets.shape}")
print(f"  Negatives shape: {negatives.shape}")

# Create TensorFlow dataset
dataset = tf.data.Dataset.from_tensor_slices(
    ((contexts, targets, negatives), np.zeros(len(targets)))
)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(512)
dataset = dataset.prefetch(tf.data.AUTOTUNE)

# Create and compile model
embedding_dim = 16
model = CBOWNegativeSampling(
    vocab_size, embedding_dim, context_size=context_size*2, num_ns=num_ns
)

# Compile with loss function
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.01),
    loss=cbow_negative_sampling_loss
)

# Print model summary
model.summary()

Generating CBOW pairs:   0%|          | 0/791685 [00:00<?, ?it/s]

Prepared CBOW training data:
  Contexts shape: (791685, 4) (each row contains 4 context words)
  Targets shape: (791685, 1)
  Negatives shape: (791685, 4)

Model: "cbow_negative_sampling"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ cbow_embedding (Embedding)      │ ?                      │   0 (unbuilt) │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ target_embedding (Embedding)    │ ?                      │   0 (unbuilt) │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 0 (0.00 B)

 Trainable params: 0 (0.00 B)

 Non-trainable params: 0 (0.00 B)

In [12]:

Copied!





# Add callbacks
early_stopping = keras.callbacks.EarlyStopping(
    monitor='loss', patience=5, restore_best_weights=True
)

# Train the model
print("\nTraining CBOW model with negative sampling...")
history = model.fit(
    dataset,
    epochs=1,
    steps_per_epoch=1,
    verbose=1,
    callbacks=[early_stopping]
)

cbow_initial_embeddings = model.word_embedding.get_weights()[0]

history = model.fit(
    dataset,
    epochs=20,
    verbose=1,
    callbacks=[early_stopping]
)

cbow_learned_embeddings = model.word_embedding.get_weights()[0]

# Plot training history
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'])
plt.title('CBOW with Negative Sampling - Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True)
plt.show()
# Add callbacks
early_stopping = keras.callbacks.EarlyStopping(
    monitor='loss', patience=5, restore_best_weights=True
)

# Train the model
print("\nTraining CBOW model with negative sampling...")
history = model.fit(
    dataset,
    epochs=1,
    steps_per_epoch=1,
    verbose=1,
    callbacks=[early_stopping]
)

cbow_initial_embeddings = model.word_embedding.get_weights()[0]

history = model.fit(
    dataset,
    epochs=20,
    verbose=1,
    callbacks=[early_stopping]
)

cbow_learned_embeddings = model.word_embedding.get_weights()[0]

# Plot training history
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'])
plt.title('CBOW with Negative Sampling - Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True)
plt.show()

Training CBOW model with negative sampling...
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 390ms/step - loss: 0.6931
Epoch 1/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - loss: 0.3169
Epoch 2/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - loss: 0.2334
Epoch 3/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - loss: 0.2063
Epoch 4/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - loss: 0.1885
Epoch 5/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - loss: 0.1769
Epoch 6/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - loss: 0.1692
Epoch 7/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - loss: 0.1634
Epoch 8/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - loss: 0.1589
Epoch 9/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - loss: 0.1554
Epoch 10/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - loss: 0.1524
Epoch 11/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - loss: 0.1499
Epoch 12/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - loss: 0.1477
Epoch 13/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - loss: 0.1457
Epoch 14/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - loss: 0.1441
Epoch 15/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - loss: 0.1425
Epoch 16/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - loss: 0.1412
Epoch 17/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - loss: 0.1400
Epoch 18/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 6s 4ms/step - loss: 0.1387
Epoch 19/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - loss: 0.1376
Epoch 20/20
1547/1547 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - loss: 0.1368

In [13]:

Copied!





# Find similar words using trained embeddings
def find_similar_words(word, embeddings, word2idx, idx2word, top_n=5):
    """Find most similar words based on cosine similarity"""
    if word not in word2idx:
        return []
        
    word_idx = word2idx[word]
    word_vec = embeddings[word_idx]
    
    # Normalize vectors for cosine similarity
    norm_word = np.linalg.norm(word_vec)
    if norm_word > 0:
        word_vec = word_vec / norm_word
    
    # Normalize all embeddings
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    norms = np.maximum(norms, 1e-10)  # Avoid division by zero
    normalized_embeddings = embeddings / norms
    
    # Calculate similarities
    similarities = np.dot(normalized_embeddings, word_vec)
    
    # Get top n similar words
    most_similar = []
    for idx in similarities.argsort()[-top_n-1:][::-1]:
        if idx != word_idx:
            most_similar.append((idx2word[idx], similarities[idx]))
            
    return most_similar

# Test some example words
test_words = ["king", "france", "computer", "dog"]
for word in test_words:
    if word in word2idx:
        similar_words = find_similar_words(word, cbow_initial_embeddings, word2idx, idx2word)
        print(f"\nWords similar to '{word} - initial embeddings':")
        for similar_word, similarity in similar_words:
            print(f"  {similar_word}: {similarity:.4f}")
        similar_words = find_similar_words(word, cbow_learned_embeddings, word2idx, idx2word)
        print(f"\nWords similar to '{word} - learned embeddings':")
        for similar_word, similarity in similar_words:
            print(f"  {similar_word}: {similarity:.4f}")
        print('---')
# Find similar words using trained embeddings
def find_similar_words(word, embeddings, word2idx, idx2word, top_n=5):
    """Find most similar words based on cosine similarity"""
    if word not in word2idx:
        return []
        
    word_idx = word2idx[word]
    word_vec = embeddings[word_idx]
    
    # Normalize vectors for cosine similarity
    norm_word = np.linalg.norm(word_vec)
    if norm_word > 0:
        word_vec = word_vec / norm_word
    
    # Normalize all embeddings
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    norms = np.maximum(norms, 1e-10)  # Avoid division by zero
    normalized_embeddings = embeddings / norms
    
    # Calculate similarities
    similarities = np.dot(normalized_embeddings, word_vec)
    
    # Get top n similar words
    most_similar = []
    for idx in similarities.argsort()[-top_n-1:][::-1]:
        if idx != word_idx:
            most_similar.append((idx2word[idx], similarities[idx]))
            
    return most_similar

# Test some example words
test_words = ["king", "france", "computer", "dog"]
for word in test_words:
    if word in word2idx:
        similar_words = find_similar_words(word, cbow_initial_embeddings, word2idx, idx2word)
        print(f"\nWords similar to '{word} - initial embeddings':")
        for similar_word, similarity in similar_words:
            print(f"  {similar_word}: {similarity:.4f}")
        similar_words = find_similar_words(word, cbow_learned_embeddings, word2idx, idx2word)
        print(f"\nWords similar to '{word} - learned embeddings':")
        for similar_word, similarity in similar_words:
            print(f"  {similar_word}: {similarity:.4f}")
        print('---')

Words similar to 'king - initial embeddings':
  allosaurus: 0.8045
  tendencies: 0.7713
  eighty: 0.7701
  drew: 0.7664
  asteraceae: 0.7544

Words similar to 'king - learned embeddings':
  marched: 0.8228
  father: 0.8155
  benedict: 0.8062
  son: 0.8026
  oliver: 0.7886
---

Words similar to 'france - initial embeddings':
  kandahar: 0.7864
  candidate: 0.7847
  logically: 0.7753
  educated: 0.7746
  revisionism: 0.7686

Words similar to 'france - learned embeddings':
  merchants: 0.8654
  youngest: 0.8219
  salisbury: 0.7975
  sailed: 0.7945
  mauritania: 0.7905
---

Words similar to 'computer - initial embeddings':
  imports: 0.8938
  supplying: 0.8409
  transatlantic: 0.8304
  moved: 0.8095
  followed: 0.7946

Words similar to 'computer - learned embeddings':
  anime: 0.8789
  inventors: 0.8317
  ary: 0.8098
  animation: 0.7912
  portable: 0.7875
---

Words similar to 'dog - initial embeddings':
  critic: 0.8114
  lift: 0.7804
  near: 0.7781
  postulated: 0.7679
  immortal: 0.7678

Words similar to 'dog - learned embeddings':
  mario: 0.8316
  lisa: 0.8145
  dead: 0.7892
  germans: 0.7813
  mysteries: 0.7751
---

5.2 🚀 Training the CBOW Model¶

We’ve now trained our CBOW model using negative sampling, where the model learns to predict the center word from its surrounding context.

⚡ Training Insights¶

CBOW trains significantly faster than Skip-gram — in our case, nearly 3× faster:

Just 1,547 batches per epoch, compared to over 6,000 in Skip-gram
Training completed in under 80 seconds for all 20 epochs

The loss decreased from ~0.6931 (random guess) to ~0.1368, showing smooth and effective learning across the board.

✅ CBOW is particularly efficient when training on large corpora and works well with frequent words.

🧠 What the Model Learns¶

Here, the model learns to average the embeddings of surrounding words and use that signal to adjust the target word's vector:

If a group of context words tends to co-occur with a target, the target gets pulled closer to them
Noise samples get pushed further away, improving discrimination

This results in embeddings that capture shared semantics across similar contexts.

5.3 📦 Extracting and Interpreting Word Embeddings¶

As with Skip-gram, we extract the final vectors from the model’s learned embeddings.
Each word is now represented as a 16-dimensional vector, trained to reflect the context in which that word appears.

🔍 What Do These Vectors Represent?¶

Encoded semantic relationships derived from context
Similar meanings = similar vectors
Supports analogy-style reasoning (e.g., paris - france + italy ≈ rome)

📊 Before vs. After Training¶

Here’s a look at how well the CBOW model captured meaningful associations — measured via cosine similarity:

🏰 `"king"`¶

Before: allosaurus, tendencies, eighty
After: marched, father, benedict, son

🌍 `"france"`¶

Before: kandahar, candidate, revisionism
After: merchants, salisbury, mauritania

💻 `"computer"`¶

Before: imports, supplying, transatlantic
After: anime, inventors, animation, portable

🐶 `"dog"`¶

Before: critic, lift, postulated
After: mario, lisa, dead, mysteries

You can clearly see how the model reshapes the embedding space — similar words move closer, and noise disappears from the top neighbors. But it looks like the model does not capture the semantic relationships as well as Skip-gram.

These embeddings are now ready to be explored further via visualization, clustering, or downstream NLP tasks.

Let’s now compare Skip-gram and CBOW embeddings more directly!

6. 🧪 Intrinsic Evaluation of Word Embeddings¶

6.1 🤔 What Is Intrinsic Evaluation?¶

Now that we've trained our Skip-gram and CBOW models, it's time to inspect what they’ve actually learned.

Intrinsic evaluation focuses on testing the embeddings themselves, rather than plugging them into downstream tasks like classification or translation.

🔍 What Are We Evaluating?¶

We’ll look at whether our embeddings capture:

Word Similarity – Do semantically related words have similar vectors?
Word Analogies – Can the model solve analogies like "king - man + woman ≈ queen"?
Semantic Clusters – Do related words group together in embedding space?
Visual Patterns – Do we see interpretable structures when reducing to 2D?

These tests give us direct insight into how well our model has captured the structure of language.

💡 Why This Matters¶

Intrinsic evaluations help us understand:

Whether embeddings reflect real semantic meaning
How different architectures (Skip-gram vs. CBOW) perform
What types of relationships the model is good at learning
How much the model improved over its random initialization

It’s like opening up the model’s brain and asking:
“How well do you understand the relationships between words?”

Let’s now implement functions for each of these evaluation methods and compare the results across models.

In [19]:

Copied!





def find_similar_words(word, embedding_matrix, word2idx, idx2word, top_n=10):
    """
    Find the most similar words to a given word using cosine similarity.
    
    Args:
        word: Query word
        embedding_matrix: Matrix of word embeddings
        word2idx: Word to index mapping
        idx2word: Index to word mapping
        top_n: Number of similar words to return
    
    Returns:
        List of (word, similarity) tuples
    """
    if word not in word2idx:
        return []
    
    word_idx = word2idx[word]
    word_embedding = embedding_matrix[word_idx]
    
    # Calculate cosine similarity
    word_embedding = word_embedding / np.linalg.norm(word_embedding)  # Normalize
    similarities = np.dot(embedding_matrix, word_embedding) / (
        np.linalg.norm(embedding_matrix, axis=1) + 1e-8)  # Avoid division by zero
    
    # Get top similar words (excluding the query word)
    similar_indices = similarities.argsort()[-top_n-1:-1][::-1]
    similar_words = [(idx2word[idx], similarities[idx]) for idx in similar_indices]
    
    return similar_words

def evaluate_word_similarity(test_words, embedding_matrices, names, word2idx, idx2word):
    """
    Evaluate and compare word similarities across different embedding matrices.
    
    Args:
        test_words: List of words to test
        embedding_matrices: List of embedding matrices to compare
        names: Names for each embedding matrix
        word2idx, idx2word: Mappings between words and indices
    """
    results = []
    
    for word in test_words:
        if word not in word2idx:
            print(f"'{word}' not in vocabulary, skipping")
            continue
            
        word_results = {'word': word}
        
        for embedding_matrix, name in zip(embedding_matrices, names):
            similar_words = find_similar_words(word, embedding_matrix, word2idx, idx2word)
            word_results[name] = ', '.join([f"{w} ({s:.2f})" for w, s in similar_words[:5]])
        
        results.append(word_results)
    
    return pd.DataFrame(results)

def create_similarity_html(similarity_results):
    html = "<table style='width:100%; border-collapse:collapse;'>"
    
    # Header row
    html += "<tr style='background-color:#f2f2f2;'>"
    for col in similarity_results.columns:
        html += f"<th style='padding:8px; text-align:center; border:1px solid #ddd;'>{col}</th>"
    html += "</tr>"
    
    # Data rows
    for _, row in similarity_results.iterrows():
        html += "<tr>"
        for i, col in enumerate(similarity_results.columns):
            value = row[col]
            if i > 0:  # For similarity columns
                items = value.split(', ')
                value = "<br>".join(items)
            
            html += f"<td style='padding:8px; border:1px solid #ddd; vertical-align:top;'>{value}</td>"
        html += "</tr>"
    
    html += "</table>"
    return html


# Test words for similarity
test_words = [
    # Common nouns
    "king", "woman", "man", "queen", "child", 
    # Countries and cities
    "france", "paris", "germany", "berlin", "london",
    # Verbs
    "walk", "run", "eat", "think", "play",
    # Adjectives
    "good", "bad", "beautiful", "strong", "happy",
    # Time-related
    "today", "yesterday", "year", "month", "week"
]

# Compare initial vs trained embeddings
similarity_results = evaluate_word_similarity(
    test_words,
    [skipgram_initial_embeddings, skipgram_learned_embeddings, cbow_initial_embeddings, cbow_learned_embeddings],
    ["Initial Random", "Skip-gram Trained", "CBOW Initial", "CBOW Trained"],
    word2idx, idx2word
)

# Display the results
print("Word Similarity Comparison:")
# Display the custom HTML table
from IPython.display import HTML
display(HTML(create_similarity_html(similarity_results)))
def find_similar_words(word, embedding_matrix, word2idx, idx2word, top_n=10):
    """
    Find the most similar words to a given word using cosine similarity.
    
    Args:
        word: Query word
        embedding_matrix: Matrix of word embeddings
        word2idx: Word to index mapping
        idx2word: Index to word mapping
        top_n: Number of similar words to return
    
    Returns:
        List of (word, similarity) tuples
    """
    if word not in word2idx:
        return []
    
    word_idx = word2idx[word]
    word_embedding = embedding_matrix[word_idx]
    
    # Calculate cosine similarity
    word_embedding = word_embedding / np.linalg.norm(word_embedding)  # Normalize
    similarities = np.dot(embedding_matrix, word_embedding) / (
        np.linalg.norm(embedding_matrix, axis=1) + 1e-8)  # Avoid division by zero
    
    # Get top similar words (excluding the query word)
    similar_indices = similarities.argsort()[-top_n-1:-1][::-1]
    similar_words = [(idx2word[idx], similarities[idx]) for idx in similar_indices]
    
    return similar_words

def evaluate_word_similarity(test_words, embedding_matrices, names, word2idx, idx2word):
    """
    Evaluate and compare word similarities across different embedding matrices.
    
    Args:
        test_words: List of words to test
        embedding_matrices: List of embedding matrices to compare
        names: Names for each embedding matrix
        word2idx, idx2word: Mappings between words and indices
    """
    results = []
    
    for word in test_words:
        if word not in word2idx:
            print(f"'{word}' not in vocabulary, skipping")
            continue
            
        word_results = {'word': word}
        
        for embedding_matrix, name in zip(embedding_matrices, names):
            similar_words = find_similar_words(word, embedding_matrix, word2idx, idx2word)
            word_results[name] = ', '.join([f"{w} ({s:.2f})" for w, s in similar_words[:5]])
        
        results.append(word_results)
    
    return pd.DataFrame(results)

def create_similarity_html(similarity_results):
    html = ""
    
    # Header row
    html += ""
    for col in similarity_results.columns:
        html += f""
    html += ""
    
    # Data rows
    for _, row in similarity_results.iterrows():
        html += ""
        for i, col in enumerate(similarity_results.columns):
            value = row[col]
            if i > 0:  # For similarity columns
                items = value.split(', ')
                value = "
".join(items)
            
            html += f""
        html += ""
    
    html += "{col}
{value}
"
    return html


# Test words for similarity
test_words = [
    # Common nouns
    "king", "woman", "man", "queen", "child", 
    # Countries and cities
    "france", "paris", "germany", "berlin", "london",
    # Verbs
    "walk", "run", "eat", "think", "play",
    # Adjectives
    "good", "bad", "beautiful", "strong", "happy",
    # Time-related
    "today", "yesterday", "year", "month", "week"
]

# Compare initial vs trained embeddings
similarity_results = evaluate_word_similarity(
    test_words,
    [skipgram_initial_embeddings, skipgram_learned_embeddings, cbow_initial_embeddings, cbow_learned_embeddings],
    ["Initial Random", "Skip-gram Trained", "CBOW Initial", "CBOW Trained"],
    word2idx, idx2word
)

# Display the results
print("Word Similarity Comparison:")
# Display the custom HTML table
from IPython.display import HTML
display(HTML(create_similarity_html(similarity_results)))

'yesterday' not in vocabulary, skipping
Word Similarity Comparison:

word	Initial Random	Skip-gram Trained	CBOW Initial	CBOW Trained
king	semantic (0.78) atheistic (0.76) commentator (0.75) pot (0.73) advanced (0.73)	bishop (0.89) john (0.86) res (0.85) edward (0.84) henry (0.84)	allosaurus (0.80) tendencies (0.77) eighty (0.77) drew (0.77) asteraceae (0.75)	marched (0.82) father (0.82) benedict (0.81) son (0.80) oliver (0.79)
woman	hermes (0.89) shores (0.81) halogens (0.77) positively (0.77) antecedent (0.75)	surprised (0.82) love (0.82) eros (0.80) blocked (0.80) her (0.80)	qur (0.81) issues (0.80) voltage (0.76) sickness (0.76) rent (0.74)	ptolemy (0.84) eduard (0.80) love (0.80) me (0.79) eloquence (0.78)
man	performing (0.84) navy (0.80) increase (0.78) mass (0.76) word (0.76)	hatred (0.84) coming (0.82) contribution (0.81) seriously (0.79) ill (0.79)	kj (0.79) posted (0.78) zur (0.76) brilliant (0.75) acknowledge (0.74)	get (0.89) battles (0.83) strict (0.82) walks (0.79) teacher (0.79)
queen	grew (0.83) allegedly (0.82) mnd (0.82) subjects (0.79) outer (0.77)	captain (0.82) nicholas (0.82) camp (0.81) paris (0.80) young (0.79)	nontheism (0.76) colonial (0.76) limits (0.76) spaced (0.74) decimal (0.73)	ulysses (0.88) william (0.84) oliver (0.82) joseph (0.82) pierce (0.82)
child	sd (0.78) uzala (0.74) surrendered (0.74) whole (0.74) discovery (0.73)	herself (0.85) taxes (0.82) torn (0.82) messages (0.80) warned (0.78)	real (0.83) school (0.82) sent (0.76) temporary (0.75) choir (0.74)	eschatological (0.83) invented (0.82) olympias (0.82) objections (0.81) ernst (0.79)
france	estate (0.83) optional (0.77) carried (0.77) ufos (0.76) serious (0.75)	serbia (0.83) baptiste (0.82) clifford (0.81) de (0.80) italy (0.79)	kandahar (0.79) candidate (0.78) logically (0.78) educated (0.77) revisionism (0.77)	merchants (0.87) youngest (0.82) salisbury (0.80) sailed (0.79) mauritania (0.79)
paris	handling (0.83) languages (0.77) share (0.76) vertigo (0.76) program (0.75)	illinois (0.85) founded (0.84) benedict (0.83) julius (0.83) rhode (0.83)	gravitation (0.84) defeats (0.81) collectively (0.79) to (0.79) defines (0.77)	illinois (0.87) sociologist (0.85) statue (0.82) century (0.82) chicago (0.82)
germany	sclerosis (0.80) synthesized (0.76) tree (0.76) pearl (0.75) readily (0.74)	harbor (0.82) occupied (0.81) tent (0.80) local (0.79) navajo (0.79)	athenians (0.79) getting (0.79) hydraulic (0.78) bourgeois (0.77) ambition (0.77)	louisiana (0.83) virginia (0.83) border (0.82) crashes (0.82) marched (0.82)
berlin	unpopular (0.80) memory (0.77) phylogenetic (0.76) certainly (0.73) procedure (0.73)	weimar (0.87) publication (0.81) das (0.79) norway (0.78) liberation (0.78)	smithsonian (0.78) launched (0.76) subdivision (0.74) councils (0.74) alternatives (0.74)	corinth (0.81) commemoration (0.81) rating (0.81) administration (0.81) tribune (0.80)
london	formulas (0.83) tactic (0.76) coincide (0.75) derleth (0.75) remained (0.75)	august (0.90) shortly (0.86) chapter (0.83) ad (0.82) lightfoot (0.81)	brakes (0.82) ken (0.79) aerospace (0.77) meditation (0.77) executed (0.76)	thomson (0.89) maine (0.86) slams (0.82) gettysburg (0.81) tony (0.81)
walk	quarter (0.81) moral (0.77) animator (0.77) create (0.75) couples (0.75)	probable (0.84) wishing (0.81) servants (0.80) men (0.79) temporal (0.79)	simpsons (0.80) sink (0.78) supreme (0.77) bligh (0.75) europeans (0.74)	abbey (0.91) federation (0.82) intellectuals (0.82) chaos (0.82) wiesenthal (0.81)
run	author (0.92) politically (0.87) richardson (0.83) logan (0.81) sparsely (0.78)	greatly (0.91) summit (0.89) low (0.81) stores (0.79) aisles (0.78)	minds (0.78) winnie (0.75) motions (0.75) faire (0.74) metro (0.74)	goto (0.82) camera (0.81) car (0.81) game (0.80) mtv (0.80)
eat	similarly (0.84) uruguay (0.82) musician (0.80) proudhon (0.79) youthful (0.75)	arise (0.84) coke (0.79) pleas (0.78) mind (0.77) mel (0.77)	rectus (0.79) lieutenant (0.76) exchanged (0.76) abbreviated (0.75) asceticism (0.75)	distilled (0.83) leaf (0.81) progresses (0.81) disbelief (0.80) spiritual (0.79)
think	arches (0.83) joint (0.76) compilation (0.75) going (0.74) academy (0.73)	anyone (0.87) ghetto (0.85) shocked (0.82) prove (0.81) dispute (0.81)	definitive (0.84) leaving (0.81) opposite (0.81) survives (0.77) narrative (0.76)	parade (0.87) nothing (0.83) expression (0.82) stirner (0.82) lorenzo (0.80)
play	orange (0.82) crowd (0.78) problematic (0.76) sentiments (0.76) negative (0.75)	hour (0.85) drivers (0.84) event (0.83) meant (0.80) shift (0.79)	flying (0.83) radio (0.80) addressed (0.78) faa (0.78) hydride (0.77)	trucks (0.89) singles (0.88) match (0.86) climb (0.79) step (0.78)
good	collections (0.77) ecology (0.77) coloured (0.77) cartoons (0.75) pdas (0.74)	effectively (0.88) without (0.86) act (0.83) use (0.81) termed (0.80)	recommend (0.77) joachim (0.76) niger (0.75) emigration (0.74) wrongly (0.73)	measure (0.77) eros (0.77) want (0.77) trouble (0.76) recognise (0.76)
bad	mathbf (0.81) circa (0.78) urgell (0.77) blocking (0.76) bruno (0.75)	aware (0.88) architectures (0.83) faithful (0.83) consequences (0.82) farmer (0.82)	pigs (0.85) calculations (0.82) akm (0.80) papyrus (0.76) visiting (0.76)	dvd (0.81) voltage (0.79) animator (0.78) keen (0.77) desire (0.75)
beautiful	returned (0.79) thousand (0.78) photon (0.74) scientists (0.73) beta (0.73)	devon (0.85) fine (0.85) turned (0.85) beneath (0.83) glands (0.82)	anatoly (0.81) charleston (0.78) shores (0.75) consisting (0.73) sketch (0.73)	stretching (0.83) unnecessary (0.81) resolve (0.78) crushed (0.77) active (0.77)
strong	threatening (0.85) therein (0.80) virulent (0.79) investors (0.76) exceptionally (0.76)	case (0.86) nuclear (0.86) fisherman (0.83) power (0.83) seabed (0.82)	naval (0.84) princeps (0.82) object (0.78) dramatic (0.78) cow (0.77)	binding (0.87) agreements (0.86) aikidoka (0.83) hearing (0.82) autistic (0.80)
happy	mouch (0.79) derived (0.78) eumc (0.77) album (0.76) euro (0.76)	burnside (0.87) missouri (0.83) hitler (0.82) permitted (0.82) colorado (0.82)	privileges (0.85) bi (0.81) satisfies (0.79) balloon (0.79) historians (0.77)	sanctioned (0.80) resumed (0.80) zurich (0.79) rape (0.79) myself (0.77)
today	marx (0.76) reverence (0.76) invisible (0.76) comics (0.74) commentary (0.72)	some (0.85) well (0.85) benzene (0.85) ie (0.84) thought (0.84)	recorded (0.88) timothy (0.83) maj (0.79) appreciate (0.76) automorphism (0.76)	sahrawi (0.82) loosely (0.79) austroasiatic (0.78) valid (0.78) inconclusive (0.75)
year	building (0.80) great (0.79) electromagnetism (0.77) typography (0.74) director (0.74)	next (0.87) round (0.84) completed (0.83) per (0.82) september (0.82)	ladder (0.85) hypothetical (0.82) intuitive (0.81) revival (0.77) akshara (0.77)	days (0.84) propagation (0.80) mild (0.80) acquiring (0.80) minutes (0.79)
month	motive (0.78) tells (0.77) abortion (0.72) expressions (0.72) fullback (0.71)	breach (0.86) hall (0.85) exceeded (0.83) wavelength (0.82) mining (0.80)	creating (0.87) gide (0.86) routledge (0.83) formerly (0.77) iii (0.76)	anatoly (0.83) brings (0.79) banquet (0.78) mali (0.76) hits (0.74)
week	overall (0.83) beating (0.82) lucius (0.78) rare (0.76) waiting (0.75)	successes (0.80) celebrating (0.78) month (0.76) flood (0.76) headquartered (0.76)	convince (0.80) brooke (0.77) famous (0.76) invasions (0.76) pamphlet (0.76)	assessments (0.86) passion (0.84) fascination (0.84) marries (0.81) masters (0.80)

6.2 🔍 Word Similarity Evaluation¶

The table above compares the top 5 most similar words (based on cosine similarity) for each of our test words across four embedding sets:

Initial Random Embeddings: Before any training (baseline)
Skip-gram Trained Embeddings
CBOW Initial Embeddings
CBOW Trained Embeddings

🧠 Key Takeaways¶

1. Initial vs. Trained Embeddings¶

The initial embeddings (both for Skip-gram and CBOW) show no meaningful semantic grouping — top matches are random or nonsensical (e.g., "woman" → halogens, "queen" → mnd).
After training, both models clearly organize the space: "king" is now surrounded by names like "edward" and "henry", "france" by "serbia" and "mauritania", and so on.

2. Skip-gram vs. CBOW¶

Skip-gram often captures sharper, more specific associations — "france" → serbia (0.83) baptiste (0.82), italy (0.79)", "berlin" → "weimar" (0.87), "das" (0.79), "norway" (0.78), "liberation" (0.78)".
CBOW finds more broadly related concepts, with smoother clusters and more general co-occurrence patterns.
In some cases, CBOW even returns slightly more abstract matches (e.g., "year" → days, minutes, propagation").

✅ Overall Evaluation¶

Both models have successfully transformed random vectors into semantically meaningful embeddings, even with low training data. We could imagine much better results with more training data and more epochs. The Skip-gram model leans toward precise relationships, while CBOW captures softer semantic themes.

This confirms that even with a small embedding dimension and moderate training time, our models learn to encode real-world language patterns, all by just observing word co-occurrence.

Up next: let’s push further with analogies and visualizations!

In [20]:

Copied!





def word_analogy(word1, word2, word3, embedding_matrix, word2idx, idx2word, top_n=5):
    """
    Solve word analogies like "king - man + woman = queen".
    
    Args:
        word1, word2, word3: Words in the analogy "word1 - word2 + word3 = ?"
        embedding_matrix: Matrix of word embeddings
        word2idx: Word to index mapping
        idx2word: Index to word mapping
        top_n: Number of results to return
    
    Returns:
        List of (word, similarity) tuples
    """
    for word in [word1, word2, word3]:
        if word not in word2idx:
            print(f"Word '{word}' not in vocabulary")
            return []
    
    # Get word vectors
    vec1 = embedding_matrix[word2idx[word1]]
    vec2 = embedding_matrix[word2idx[word2]]
    vec3 = embedding_matrix[word2idx[word3]]
    
    # Calculate target vector: vec1 - vec2 + vec3
    target_vec = vec1 - vec2 + vec3
    target_vec = target_vec / np.linalg.norm(target_vec)  # Normalize
    
    # Calculate similarities
    similarities = np.dot(embedding_matrix, target_vec) / (
        np.linalg.norm(embedding_matrix, axis=1) + 1e-8)
    
    # Get top results (excluding input words)
    results = []
    sorted_indices = similarities.argsort()[::-1]
    
    for idx in sorted_indices:
        word = idx2word[idx]
        if word != word1 and word != word2 and word != word3:
            results.append((word, similarities[idx]))
            if len(results) >= top_n:
                break
    
    return results

def evaluate_word_analogies(analogies, embedding_matrices, names, word2idx, idx2word):
    """
    Evaluate word analogies across different embedding matrices.
    
    Args:
        analogies: List of (word1, word2, word3, expected) tuples
        embedding_matrices: List of embedding matrices to compare
        names: Names for each embedding matrix
        word2idx, idx2word: Mappings between words and indices
    """
    results = []
    
    for word1, word2, word3, expected in analogies:
        analogy_str = f"{word1} - {word2} + {word3} = ?"
        result = {'analogy': analogy_str, 'expected': expected}
        
        for embedding_matrix, name in zip(embedding_matrices, names):
            # Skip initial random embeddings for analogies (they won't be meaningful)
            if 'Initial' in name:
                result[name] = 'N/A'
                continue
                
            answers = word_analogy(word1, word2, word3, embedding_matrix, word2idx, idx2word)
            
            if answers:
                # Check if expected word is in top results
                expected_in_top = any(word == expected for word, _ in answers)
                top_answers = ', '.join([f"{w} ({s:.2f})" for w, s in answers[:3]])
                result[name] = top_answers + (f" ✓" if expected_in_top else "")
            else:
                result[name] = "Words not in vocabulary"
        
        results.append(result)
    
    return pd.DataFrame(results)

# Test analogies
analogies = [
    # Capital cities
    ("france", "paris", "italy", "rome"),
    ("germany", "berlin", "england", "london"),
    # Gender relationships
    ("king", "man", "woman", "queen"),
    ("man", "he", "she", "woman"),
    # Verb tenses
    ("walking", "walk", "run", "running"),
    ("good", "better", "bad", "worse"),
    # Semantic relationships
    ("small", "smaller", "big", "bigger"),
    ("eat", "food", "drink", "water"),
    # Comparative and superlative
    ("good", "best", "bad", "worst"),
    ("big", "biggest", "small", "smallest")
]

# Evaluate analogies on trained embeddings
analogy_results = evaluate_word_analogies(
    analogies,
    [skipgram_learned_embeddings, cbow_learned_embeddings],
    ["Skip-gram", "CBOW"],
    word2idx, idx2word
)

# Display the results
print("Word Analogy Evaluation:")
display(analogy_results)
def word_analogy(word1, word2, word3, embedding_matrix, word2idx, idx2word, top_n=5):
    """
    Solve word analogies like "king - man + woman = queen".
    
    Args:
        word1, word2, word3: Words in the analogy "word1 - word2 + word3 = ?"
        embedding_matrix: Matrix of word embeddings
        word2idx: Word to index mapping
        idx2word: Index to word mapping
        top_n: Number of results to return
    
    Returns:
        List of (word, similarity) tuples
    """
    for word in [word1, word2, word3]:
        if word not in word2idx:
            print(f"Word '{word}' not in vocabulary")
            return []
    
    # Get word vectors
    vec1 = embedding_matrix[word2idx[word1]]
    vec2 = embedding_matrix[word2idx[word2]]
    vec3 = embedding_matrix[word2idx[word3]]
    
    # Calculate target vector: vec1 - vec2 + vec3
    target_vec = vec1 - vec2 + vec3
    target_vec = target_vec / np.linalg.norm(target_vec)  # Normalize
    
    # Calculate similarities
    similarities = np.dot(embedding_matrix, target_vec) / (
        np.linalg.norm(embedding_matrix, axis=1) + 1e-8)
    
    # Get top results (excluding input words)
    results = []
    sorted_indices = similarities.argsort()[::-1]
    
    for idx in sorted_indices:
        word = idx2word[idx]
        if word != word1 and word != word2 and word != word3:
            results.append((word, similarities[idx]))
            if len(results) >= top_n:
                break
    
    return results

def evaluate_word_analogies(analogies, embedding_matrices, names, word2idx, idx2word):
    """
    Evaluate word analogies across different embedding matrices.
    
    Args:
        analogies: List of (word1, word2, word3, expected) tuples
        embedding_matrices: List of embedding matrices to compare
        names: Names for each embedding matrix
        word2idx, idx2word: Mappings between words and indices
    """
    results = []
    
    for word1, word2, word3, expected in analogies:
        analogy_str = f"{word1} - {word2} + {word3} = ?"
        result = {'analogy': analogy_str, 'expected': expected}
        
        for embedding_matrix, name in zip(embedding_matrices, names):
            # Skip initial random embeddings for analogies (they won't be meaningful)
            if 'Initial' in name:
                result[name] = 'N/A'
                continue
                
            answers = word_analogy(word1, word2, word3, embedding_matrix, word2idx, idx2word)
            
            if answers:
                # Check if expected word is in top results
                expected_in_top = any(word == expected for word, _ in answers)
                top_answers = ', '.join([f"{w} ({s:.2f})" for w, s in answers[:3]])
                result[name] = top_answers + (f" ✓" if expected_in_top else "")
            else:
                result[name] = "Words not in vocabulary"
        
        results.append(result)
    
    return pd.DataFrame(results)

# Test analogies
analogies = [
    # Capital cities
    ("france", "paris", "italy", "rome"),
    ("germany", "berlin", "england", "london"),
    # Gender relationships
    ("king", "man", "woman", "queen"),
    ("man", "he", "she", "woman"),
    # Verb tenses
    ("walking", "walk", "run", "running"),
    ("good", "better", "bad", "worse"),
    # Semantic relationships
    ("small", "smaller", "big", "bigger"),
    ("eat", "food", "drink", "water"),
    # Comparative and superlative
    ("good", "best", "bad", "worst"),
    ("big", "biggest", "small", "smallest")
]

# Evaluate analogies on trained embeddings
analogy_results = evaluate_word_analogies(
    analogies,
    [skipgram_learned_embeddings, cbow_learned_embeddings],
    ["Skip-gram", "CBOW"],
    word2idx, idx2word
)

# Display the results
print("Word Analogy Evaluation:")
display(analogy_results)

Word Analogy Evaluation:

	analogy	expected	Skip-gram	CBOW
0	france - paris + italy = ?	rome	emirates (0.80), expelled (0.77), farmers (0.76)	rescued (0.85), ruler (0.83), motto (0.83)
1	germany - berlin + england = ?	london	constantinople (0.81), ornaments (0.80), until...	tent (0.81), then (0.80), nile (0.79)
2	king - man + woman = ?	queen	henry (0.83), connecticut (0.82), signer (0.80)	rider (0.84), satrap (0.83), martyr (0.79)
3	man - he + she = ?	woman	aphrodite (0.86), loved (0.86), adonis (0.84)	get (0.82), sura (0.77), going (0.76)
4	walking - walk + run = ?	running	indicator (0.74), slopes (0.70), photos (0.69)	autobiographical (0.69), filmed (0.66), kalash...
5	good - better + bad = ?	worse	architectures (0.92), maia (0.86), addicted (0...	obsession (0.78), gibson (0.77), chair (0.77)
6	small - smaller + big = ?	bigger	fourth (0.79), couple (0.79), fly (0.78)	stewart (0.82), leopold (0.80), clarke (0.78)
7	eat - food + drink = ?	water	drinking (0.82), coke (0.80), lead (0.78)	gothic (0.89), blockers (0.83), composers (0.82)
8	good - best + bad = ?	worst	tired (0.85), admiration (0.84), entity (0.81)	cease (0.84), glorious (0.84), altruist (0.83)
9	big - biggest + small = ?	smallest	long (0.86), photon (0.80), aminoacids (0.79)	recounts (0.82), christie (0.78), granite (0.71)

6.3 🧠 Word Analogy Evaluation¶

Word analogy tasks go beyond simple similarity — they test whether embeddings capture relational structures in a way that allows consistent vector arithmetic.

The classic example is:

"king - man + woman = ?"
→ Ideally: "queen"

🔄 What Are We Testing?¶

These evaluations check if:

Semantic relationships are encoded as vector differences
The same "offset" (e.g., gender, country-capital) applies across different word pairs
The model can retrieve the correct word via nearest-neighbor search in the embedding space

📊 Key Observations from Our Results¶

✅ 1. Models Learn Some Relational Structure¶

Both Skip-gram and CBOW capture certain relationships, though not always the expected word appears at the top.
Examples like "eat - food + drink" returning "drinking" (Skip-gram) suggest partial relational understanding.
Analogies involving verb tenses or comparatives are harder, but sometimes the models retrieve related concepts (e.g., "good - best + bad" → "tired" or "cease").

🆚 2. Skip-gram vs. CBOW¶

Skip-gram tends to perform better on analogies, especially for precise relationships like gender or verb forms.
CBOW retrieves more abstract or loosely related terms, occasionally hitting the right direction but missing specificity.
Neither model perfectly solves all analogies — which is expected given the data size and low embedding dimensionality (16d).

⚠️ 3. Limitations to Keep in Mind¶

The vocabulary is limited, so many ideal targets (e.g., "rome", "queen", "running") may be missing or underrepresented.
Embedding quality improves with larger corpora, longer training, and higher dimensions.
Still, these results give us a powerful diagnostic tool to evaluate what the model is capturing.

🏁 Conclusion¶

While we’re not seeing state-of-the-art analogy solving here, this evaluation demonstrates that even small Word2Vec models can encode meaningful relationships.

✅ Some analogies are partially captured
✅ Embeddings reflect directional meaning shifts
✅ We now have a way to benchmark embedding quality intrinsically

As a next step, we can visualize the embedding space to get an intuitive sense of the structure that these analogies hint at.

In [21]:

Copied!





def plot_semantic_clusters(words_by_category, embedding_matrix, word2idx, title):
    """
    Plot words from different semantic categories in 2D space.
    
    Args:
        words_by_category: Dictionary mapping categories to lists of words
        embedding_matrix: Word embedding matrix
        word2idx: Word to index mapping
        title: Plot title
    """
    from sklearn.manifold import TSNE
    import matplotlib.colors as mcolors
    
    # Colors for different categories
    colors = list(mcolors.TABLEAU_COLORS)
    
    # Collect embeddings and labels for words that are in vocabulary
    embeddings = []
    labels = []
    categories = []
    
    for category, words in words_by_category.items():
        for word in words:
            if word in word2idx:
                embeddings.append(embedding_matrix[word2idx[word]])
                labels.append(word)
                categories.append(category)
    
    if not embeddings:
        print("No words found in vocabulary")
        return
    
    # Convert to array
    embeddings = np.array(embeddings)
    
    # Use t-SNE to reduce to 2 dimensions
    tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(embeddings)-1))
    reduced_embeddings = tsne.fit_transform(embeddings)
    
    # Plot
    plt.figure(figsize=(12, 8))
    category_handles = []
    
    # Get unique categories and sort them
    unique_categories = sorted(set(categories))
    
    for i, category in enumerate(unique_categories):
        color = colors[i % len(colors)]
        indices = [j for j, cat in enumerate(categories) if cat == category]
        
        scatter = plt.scatter(
            reduced_embeddings[indices, 0],
            reduced_embeddings[indices, 1],
            label=category,
            color=color,
            alpha=0.7
        )
        category_handles.append(scatter)
        
        # Add labels for each point
        for j in indices:
            plt.annotate(
                labels[j],
                (reduced_embeddings[j, 0], reduced_embeddings[j, 1]),
                xytext=(5, 2),
                textcoords='offset points',
                fontsize=9
            )
    
    plt.title(title)
    plt.legend(handles=category_handles, loc='best')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.show()

# Define semantic categories for testing
semantic_categories = {
    'Countries': ['france', 'germany', 'italy', 'spain', 'england', 'russia', 'china', 'japan', 'canada', 'australia'],
    'Cities': ['paris', 'berlin', 'rome', 'madrid', 'london', 'moscow', 'beijing', 'tokyo', 'toronto', 'sydney'],
    'Numbers': ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten'],
    'Family': ['father', 'mother', 'son', 'daughter', 'brother', 'sister', 'uncle', 'aunt', 'grandfather', 'grandmother'],
    'Colors': ['red', 'blue', 'green', 'yellow', 'black', 'white', 'purple', 'orange', 'brown', 'pink'],
    'Transportation': ['car', 'bus', 'train', 'plane', 'boat', 'bicycle', 'truck', 'motorcycle', 'ship', 'helicopter']
}

# Plot for initial vs trained embeddings
print("Initial Random Embeddings Clustering:")
plot_semantic_clusters(semantic_categories, skipgram_initial_embeddings, word2idx, "Initial Random Embeddings - Semantic Categories")

print("\nSkip-gram Trained Embeddings Clustering:")
plot_semantic_clusters(semantic_categories, skipgram_learned_embeddings, word2idx, "Skip-gram Trained Embeddings - Semantic Categories")

print("\nCBOW Trained Embeddings Clustering:")
plot_semantic_clusters(semantic_categories, cbow_learned_embeddings, word2idx, "CBOW Trained Embeddings - Semantic Categories")
def plot_semantic_clusters(words_by_category, embedding_matrix, word2idx, title):
    """
    Plot words from different semantic categories in 2D space.
    
    Args:
        words_by_category: Dictionary mapping categories to lists of words
        embedding_matrix: Word embedding matrix
        word2idx: Word to index mapping
        title: Plot title
    """
    from sklearn.manifold import TSNE
    import matplotlib.colors as mcolors
    
    # Colors for different categories
    colors = list(mcolors.TABLEAU_COLORS)
    
    # Collect embeddings and labels for words that are in vocabulary
    embeddings = []
    labels = []
    categories = []
    
    for category, words in words_by_category.items():
        for word in words:
            if word in word2idx:
                embeddings.append(embedding_matrix[word2idx[word]])
                labels.append(word)
                categories.append(category)
    
    if not embeddings:
        print("No words found in vocabulary")
        return
    
    # Convert to array
    embeddings = np.array(embeddings)
    
    # Use t-SNE to reduce to 2 dimensions
    tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(embeddings)-1))
    reduced_embeddings = tsne.fit_transform(embeddings)
    
    # Plot
    plt.figure(figsize=(12, 8))
    category_handles = []
    
    # Get unique categories and sort them
    unique_categories = sorted(set(categories))
    
    for i, category in enumerate(unique_categories):
        color = colors[i % len(colors)]
        indices = [j for j, cat in enumerate(categories) if cat == category]
        
        scatter = plt.scatter(
            reduced_embeddings[indices, 0],
            reduced_embeddings[indices, 1],
            label=category,
            color=color,
            alpha=0.7
        )
        category_handles.append(scatter)
        
        # Add labels for each point
        for j in indices:
            plt.annotate(
                labels[j],
                (reduced_embeddings[j, 0], reduced_embeddings[j, 1]),
                xytext=(5, 2),
                textcoords='offset points',
                fontsize=9
            )
    
    plt.title(title)
    plt.legend(handles=category_handles, loc='best')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.show()

# Define semantic categories for testing
semantic_categories = {
    'Countries': ['france', 'germany', 'italy', 'spain', 'england', 'russia', 'china', 'japan', 'canada', 'australia'],
    'Cities': ['paris', 'berlin', 'rome', 'madrid', 'london', 'moscow', 'beijing', 'tokyo', 'toronto', 'sydney'],
    'Numbers': ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten'],
    'Family': ['father', 'mother', 'son', 'daughter', 'brother', 'sister', 'uncle', 'aunt', 'grandfather', 'grandmother'],
    'Colors': ['red', 'blue', 'green', 'yellow', 'black', 'white', 'purple', 'orange', 'brown', 'pink'],
    'Transportation': ['car', 'bus', 'train', 'plane', 'boat', 'bicycle', 'truck', 'motorcycle', 'ship', 'helicopter']
}

# Plot for initial vs trained embeddings
print("Initial Random Embeddings Clustering:")
plot_semantic_clusters(semantic_categories, skipgram_initial_embeddings, word2idx, "Initial Random Embeddings - Semantic Categories")

print("\nSkip-gram Trained Embeddings Clustering:")
plot_semantic_clusters(semantic_categories, skipgram_learned_embeddings, word2idx, "Skip-gram Trained Embeddings - Semantic Categories")

print("\nCBOW Trained Embeddings Clustering:")
plot_semantic_clusters(semantic_categories, cbow_learned_embeddings, word2idx, "CBOW Trained Embeddings - Semantic Categories")

Initial Random Embeddings Clustering:

Skip-gram Trained Embeddings Clustering:

CBOW Trained Embeddings Clustering:

6.4 🧭 Semantic Clustering Analysis¶

The plots above visualize how word embeddings evolve before and after training, using t-SNE to project them into 2D space. Each word belongs to a predefined semantic category (e.g., cities, family, colors), and colors indicate groupings.

👀 What We're Looking For:¶

Category Clustering – Do words from the same group stay close together?
Semantic Proximity – Are related categories near each other in the space?
Transformation Through Training – Does the embedding space become more meaningful?

🪨 Initial Random Embeddings¶

Words are randomly scattered, with no visible structure
No evidence of clustering by meaning or category
Semantic categories are completely blended and unorganized

🧠 Skip-gram Trained Embeddings¶

Clear clustering by semantic category (e.g., countries, numbers, colors, transportation)
Related categories (e.g., cities and countries) are positioned near each other
Numbers form a tight group: "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"
Shows sharper boundaries between categories, reflecting Skip-gram’s tendency for more precise local relationships

🌐 CBOW Trained Embeddings¶

Strong semantic grouping, though slightly more blended transitions between clusters
Categories like colors, family, and transportation still stand out
Cities and countries are often close — but with softer separation than Skip-gram
This reflects CBOW’s nature: it smooths across larger context windows, producing more globally coherent embeddings

✅ Takeaway¶

These plots visually confirm that:

Training drastically transforms the embedding space from random noise to meaningful structure
Both models learn semantics from context alone
Proximity now encodes meaning, and category membership becomes visually detectable

This clustering is a strong sign that our Word2Vec models have successfully learned to represent semantic relationships in vector space.

In [ ]:

Word2Vec Implementation from Scratch¶

📚 Table of Contents¶

1. Introduction to Word2Vec¶

1.1 🔤 What Are Word Embeddings?¶

1.2 🧠 Word2Vec Architectures¶

🔁 Skip-gram¶

🧺 CBOW (Continuous Bag of Words)¶

1.3 🤔 Why Build Word2Vec Ourselves?¶

2. 🧹 Preparing the Data¶

2.1 📝 Loading and Preprocessing Text¶

2.2 🧠 Building the Vocabulary¶

🛠️ Our Strategy¶

2.3 🔍 Filtering the Vocabulary¶

⚖️ Why This Matters¶

2.4 🗂️ Creating Word Mappings¶

🔢 word2idx¶

🔁 idx2word¶

3. 🏗️ Creating Training Examples¶

3.1 🎯 Understanding the Context Window¶

🧰 How it works:¶

📏 Example (Window Size = 2)¶

3.2 🔁 Generating Skip-gram Training Pairs¶

🧱 How It Works:¶

📌 Example¶

⚖️ Pros & Tradeoffs¶

3.3 🧺 Generating CBOW Training Pairs¶

🧱 How It Works:¶

📌 Example¶

⚖️ Pros & Tradeoffs¶

3.4 ⚖️ Comparing Skip-gram vs. CBOW¶

🧠 So Which One Should You Use?¶

4. ⚙️ Skip-gram Model Implementation¶

4.1 🧠 The Skip-gram Architecture with Negative Sampling¶

🔄 Quick Recap¶

💡 What Is Negative Sampling?¶

🧪 How It Works:¶

⚙️ Why Use It?¶

🧮 Training Objective¶

4.2 🛠️ Implementing the Skip-gram Model with Negative Sampling¶

📥 Input Format¶

🧱 Model Structure¶

🧪 Loss Function¶

🧰 Preparing Training Data¶

🧠 Embedding Dimension¶

✅ Model Compilation¶

4.3 🚂 Training the Skip-gram Model¶

📉 Training Insights¶

🧠 What the Model Learns¶

4.4 📦 Extracting and Interpreting Word Embeddings¶

🔍 What Do These Vectors Mean?¶

📊 Before vs. After Training¶

🏰 "king"¶

🌍 "france"¶

💻 "computer"¶

🐶 "dog"¶

5. 🧺 CBOW Model Implementation¶

5.1 🧠 The CBOW Architecture (with Negative Sampling)¶

🔁 How CBOW Works (Recap)¶

🧱 Model Structure¶

📥 Input Format¶

🧪 Loss Function¶

🧰 Preparing Training Data¶

📐 Embedding Dimension¶

✅ Model Compilation¶

5.2 🚀 Training the CBOW Model¶

⚡ Training Insights¶

🧠 What the Model Learns¶

5.3 📦 Extracting and Interpreting Word Embeddings¶

🔍 What Do These Vectors Represent?¶

📊 Before vs. After Training¶

🏰 "king"¶

🌍 "france"¶

💻 "computer"¶

🐶 "dog"¶

6. 🧪 Intrinsic Evaluation of Word Embeddings¶

6.1 🤔 What Is Intrinsic Evaluation?¶

🔍 What Are We Evaluating?¶

💡 Why This Matters¶

6.2 🔍 Word Similarity Evaluation¶

🧠 Key Takeaways¶

🔢 `word2idx`¶

🔁 `idx2word`¶

🏰 `"king"`¶

🌍 `"france"`¶

💻 `"computer"`¶

🐶 `"dog"`¶

🏰 `"king"`¶

🌍 `"france"`¶

💻 `"computer"`¶

🐶 `"dog"`¶