Session 3: Word Embeddings

🎓 Course Materials

📑 Slides

📓 Notebooks

🚀 Session 3: Word Embeddings

In this third session, we explore how words can be mathematically represented and why this is essential in any NLP pipeline. We trace the journey from traditional sparse one-hot encodings and TF-IDF vectors to powerful dense embeddings like Word2Vec and GloVe, and finally to context-aware models like ELMo and BERT.

We also see how these embeddings are evaluated and how they can be applied to downstream NLP tasks like sentiment analysis, NER, or question answering.

🎯 Learning Objectives

Understand the limitations of traditional word representations (e.g., sparsity, context insensitivity).
Learn how dense vector embeddings solve these problems and how to train them.
Explore Word2Vec architectures (Skip-gram and CBOW) and techniques like negative sampling.
Evaluate embeddings both intrinsically (e.g., word similarity, analogy) and extrinsically (e.g., classification).
Discover the next evolution: contextual embeddings with ELMo, including how to pretrain and fine-tune them.

📚 Topics Covered

Static Word Embeddings

One-hot, TF-IDF: Why we moved beyond them.
Word2Vec (Skip-gram, CBOW) and the training process.
Negative Sampling: How to make training efficient.
GloVe: A count-based alternative to Word2Vec.
FastText: Subword-level embeddings to deal with rare words and misspellings.

Evaluating Word Embeddings

Intrinsic evaluations:
Word similarity (e.g., cosine distance between “king” and “queen”).
Word analogy (“man” : “woman” :: “king” : “queen”).
Extrinsic evaluations:
How well embeddings help in downstream tasks like classification or POS tagging.

Contextual Word Embeddings

Why static vectors fall short (e.g., "bank" in “river bank” vs. “bank account”).
Introduction to ELMo (Peters et al., 2018).
Bidirectional Language Modeling using LSTMs.
How ELMo generates different embeddings for the same word in different contexts.
Using ELMo for transfer learning in real-world NLP tasks (e.g., sentiment classification).

🧠 Key Takeaways

Aspect	Static Embeddings	Contextual Embeddings
Meaning Based on Context?	❌ Same vector regardless	✅ Different vectors per use
Polysemy Handling	❌ No	✅ Yes
Requires Large Corpus?	✅ Usually	✅ Definitely
Adaptable to Tasks?	⚠️ Not easily	✅ Via fine-tuning

📖 Bibliography & Recommended Reading

Jay Alammar (2017): Visual Introduction to Word Embeddings – Blog Post Excellent visuals to understand Word2Vec and GloVe.
Sebastian Ruder (2017): On Word Embeddings – Part 2: Approximating Co-occurrence Matrices – Blog Post Detailed breakdown of how different embedding models compare.
Mikolov et al. (2013): Efficient Estimation of Word Representations in Vector Space – Paper The original Word2Vec paper introducing Skip-gram and CBOW models.
Pennington et al. (2014): GloVe: Global Vectors for Word Representation – Paper Count-based embedding approach from Stanford NLP group.
Joulin et al. (2016): Bag of Tricks for Efficient Text Classification (FastText) – Paper A very practical take on embeddings using subword units.
Peters et al. (2018): Deep Contextualized Word Representations – Paper ELMo paper showing how dynamic embeddings outperform static ones on many tasks.

💻 Practical Components

From Scratch Word2Vec: We walk through how Skip-Gram is trained using pairs of target/context words and how to integrate negative sampling.
Embedding Visualizations: Use t-SNE or PCA to project high-dimensional embeddings and see how similar words cluster.
Text Classification with Embeddings: Test embeddings in real classification tasks with logistic regression or LSTMs.
Using Pretrained ELMo Embeddings: Fine-tune contextual embeddings on your own dataset.