Session 3: Word Embeddings
π Course Materials
π Slides
Download Session 3 Slides (PDF)
π Notebooks
- Word2Vec from Scratch - with negative sampling
- Embedding Evaluation: Intrinsic and Extrinsic
- Classification with Embeddings
π Session 3: Word Embeddings
In this third session, we explore how words can be mathematically represented and why this is essential in any NLP pipeline. We trace the journey from traditional sparse one-hot encodings and TF-IDF vectors to powerful dense embeddings like Word2Vec and GloVe, and finally to context-aware models like ELMo and BERT.
We also see how these embeddings are evaluated and how they can be applied to downstream NLP tasks like sentiment analysis, NER, or question answering.
π― Learning Objectives
- Understand the limitations of traditional word representations (e.g., sparsity, context insensitivity).
- Learn how dense vector embeddings solve these problems and how to train them.
- Explore Word2Vec architectures (Skip-gram and CBOW) and techniques like negative sampling.
- Evaluate embeddings both intrinsically (e.g., word similarity, analogy) and extrinsically (e.g., classification).
- Discover the next evolution: contextual embeddings with ELMo, including how to pretrain and fine-tune them.
π Topics Covered
Static Word Embeddings
- One-hot, TF-IDF: Why we moved beyond them.
- Word2Vec (Skip-gram, CBOW) and the training process.
- Negative Sampling: How to make training efficient.
- GloVe: A count-based alternative to Word2Vec.
- FastText: Subword-level embeddings to deal with rare words and misspellings.
Evaluating Word Embeddings
- Intrinsic evaluations:
- Word similarity (e.g., cosine distance between βkingβ and βqueenβ).
- Word analogy (βmanβ : βwomanβ :: βkingβ : βqueenβ).
- Extrinsic evaluations:
- How well embeddings help in downstream tasks like classification or POS tagging.
Contextual Word Embeddings
- Why static vectors fall short (e.g., "bank" in βriver bankβ vs. βbank accountβ).
- Introduction to ELMo (Peters et al., 2018).
- Bidirectional Language Modeling using LSTMs.
- How ELMo generates different embeddings for the same word in different contexts.
- Using ELMo for transfer learning in real-world NLP tasks (e.g., sentiment classification).
π§ Key Takeaways
Aspect | Static Embeddings | Contextual Embeddings |
---|---|---|
Meaning Based on Context? | β Same vector regardless | β Different vectors per use |
Polysemy Handling | β No | β Yes |
Requires Large Corpus? | β Usually | β Definitely |
Adaptable to Tasks? | β οΈ Not easily | β Via fine-tuning |
π Bibliography & Recommended Reading
-
Jay Alammar (2017): Visual Introduction to Word Embeddings β Blog Post Excellent visuals to understand Word2Vec and GloVe.
-
Sebastian Ruder (2017): On Word Embeddings β Part 2: Approximating Co-occurrence Matrices β Blog Post Detailed breakdown of how different embedding models compare.
-
Mikolov et al. (2013): Efficient Estimation of Word Representations in Vector Space β Paper The original Word2Vec paper introducing Skip-gram and CBOW models.
-
Pennington et al. (2014): GloVe: Global Vectors for Word Representation β Paper Count-based embedding approach from Stanford NLP group.
-
Joulin et al. (2016): Bag of Tricks for Efficient Text Classification (FastText) β Paper A very practical take on embeddings using subword units.
-
Peters et al. (2018): Deep Contextualized Word Representations β Paper ELMo paper showing how dynamic embeddings outperform static ones on many tasks.
π» Practical Components
- From Scratch Word2Vec: We walk through how Skip-Gram is trained using pairs of target/context words and how to integrate negative sampling.
- Embedding Visualizations: Use t-SNE or PCA to project high-dimensional embeddings and see how similar words cluster.
- Text Classification with Embeddings: Test embeddings in real classification tasks with logistic regression or LSTMs.
- Using Pretrained ELMo Embeddings: Fine-tune contextual embeddings on your own dataset.