Session 5: Transformers & BERT
π Course Materials
π Slides
Download Session 5 Slides (PDF)
π Notebooks
- Implementing BERT with Hugging Face Transformers
- Visualizing Attention Mechanisms
- Comparing LSTM vs. BERT vs. TinyBERT vs. ModernBERT
π Session 5: Attention, Transformers, and BERT
In this fifth session, we move from traditional sequence models to the architecture that revolutionized NLP: the Transformer. We analyze how attention mechanisms solved the context-length limitations of RNNs, and how BERT, built on top of Transformers, became the new backbone of language understanding.
We also explore fine-tuning BERT for downstream tasks, and examine several variants (e.g., SciBERT, XLM-T, ModernBERT) tailored for specific domains or efficiency needs.
π― Learning Objectives
- Identify the limitations of RNNs and understand why attention mechanisms were introduced.
- Understand the full Transformer architecture including self-attention and feed-forward components.
- Grasp the innovations of BERT: bidirectionality, MLM, and NSP.
- Learn to fine-tune BERT for real tasks (NER, classification, QA).
- Explore extensions and variants like DistilBERT, SciBERT, XtremeDistil, and ModernBERT.
π Topics Covered
Attention & Transformers
- Limitations of RNNs (sequential processing, long-distance dependencies).
- Attention Mechanism: Query-Key-Value, dynamic focus, soft memory.
- Self-Attention: Core of the Transformer β all tokens attend to all others.
- Multi-Head Attention: Capture different representation subspaces.
- Transformer Architecture: Encoder-decoder stack, position encoding, full parallelization.
BERT: Bidirectional Encoder Representations from Transformers
- BERT architecture: 12β24 layers, multi-head attention, 110M+ parameters.
- Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).
- Tokenization strategies (WordPiece, special tokens).
- Fine-tuning BERT for:
- Classification
- Token-level tasks (e.g., NER, QA)
- Performance on benchmarks (GLUE, SQuAD).
BERT Variants and Extensions
- SciBERT for scientific text understanding.
- EconBERTa for named entity recognition in economic research.
- XLM-T for multilingual social media analysis.
- XtremeDistilTransformer: BERT distilled for efficiency.
- ModernBERT (2024): Faster, longer-context, flash attention, rotary embeddings.
π§ Key Takeaways
Architecture | Sequential? | Long-Context Friendly | Fine-Tunable | Efficient Inference |
---|---|---|---|---|
LSTM | β | β | β | β οΈ |
Transformer | β | β | β | β |
BERT | β | β (but limited tokens) | β | β οΈ |
ModernBERT | β | β (8k tokens) | β | β |
π Bibliography & Recommended Reading
-
Vaswani et al. (2017): Attention Is All You Need β Paper The foundation of the Transformer model.
-
Alammar (2018): The Illustrated Transformer β Blog Post Highly visual explanation of attention and Transformer layers.
-
Devlin et al. (2019): BERT: Pre-training of Deep Bidirectional Transformers β Paper Original BERT paper introducing MLM and NSP.
-
Warner et al. (2024): ModernBERT β Paper A modern rethinking of BERT optimized for efficiency and long-context modeling.
-
Rogers et al. (2020): A Primer in BERTology β Paper Analysis and interpretability of BERTβs internal behavior.
π» Practical Components
- Hugging Face BERT: Load, fine-tune, and evaluate BERT on classification or QA tasks.
- Attention Visualization: See how attention heads behave using heatmaps and interpret interactions between tokens.
- Model Benchmarking: Compare inference time, memory use, and accuracy of LSTM, BERT, TinyBERT, and ModernBERT.