Session 1: Introduction to the class
๐ Course Materials
๐ Slides:
Download Session 1 Slides (PDF)
๐ Notebooks:
- [Python 1o1]:
- Baseline with regexes and spaCy
- TF-IDF: how to judge its quality?
- BM25: a better TF-IDF, judge through different metrics
๐ Session 1: Baselines and Sparse Representations
In our first session, Iโll introduce you to baseline approachesโsimple yet powerful starting points for many NLP tasks. These baselines serve as reference points, helping you measure whether your more sophisticated models actually bring improvements. Weโll also explore the concept of sparse representations, such as bag-of-words or TF-IDF, which have been fundamental in text analysis for years.
๐ฏ Learning Objectives
- Grasp the challenges in processing natural language data.
- Understand and create sparse vector representations of text.
- Explore baseline models for typical NLP tasks.
- Learn about model evaluation and choosing the right metrics.
- Gain hands-on practice by building basic NLP pipelines.
Even if you have little background in NLP, donโt worry. This session walks you through each concept step by step and shows you how to practically implement them in Python.
๐ Topics Covered
๐ Baseline & Evaluations
Weโll discuss how to set up clear baselines for NLP tasksโstarting with basic data cleaning, tokenization, and simple modeling like bag-of-words classifiers. This section underlines why evaluating your model using metrics such as accuracy, F1-score, or BLEU is crucial in understanding how well the model performs and whether advanced methods are truly an improvement.
๐ ๏ธ TF-IDF and Improvements
Iโll walk you through the TF-IDF (Term FrequencyโInverse Document Frequency) technique, explaining why itโs such a popular step beyond a raw bag-of-words. From there, weโll dive into tweaks and improvements, including dimensionality reduction techniques, vector space models, and uses in information retrieval and text classification.
๐ Bibliography & Recommended Reading
-
Elements of Statistical Learning. - Book A foundational text on the principles of machine learning. I really encourage you to read, it's the bible of Machine Learning.
-
Zinkevich et al. (2022). "Googleโs Best Practices for ML Engineering." - Blog Post Offers guidelines on how to design, deploy, and maintain machine learning systems effectively.
-
Rudi Seitz (2020). "Understanding TF-IDF and BM-25." - Blog Post A comprehensive guide to understanding TF-IDF its limitations and why BM-25 is a better alternative.
-
Van Rijsbergen, C. J. (1979). "Information Retrieval (2nd ed.)" - Book A foundational text on the principles of information retrieval systems.
-
Gupta et al. (2014). "Improved pattern learning for bootstrapped entity extraction." - Paper Discusses pattern-based bootstrapping approaches to entity extraction.
-
Wang et al. (2019). "GLUE: A Multi-Task Benchmark And Analysis Platform For Natural Language Understanding." - Paper | GLUE Benchmark Proposes a widely adopted multi-task benchmark for evaluating NLP models.
-
Strubell et al. (2019). "Energy and Policy Considerations for Deep Learning in NLP." - Paper Investigates the environmental impact of large-scale NLP model training.
-
Dodge et al. (2022). "Software Carbon Intensity (SCI)." - Paper A framework for measuring the carbon intensity of software solutions.
-
Sheng et al. (2019). "The Woman Worked as a Babysitter: On Biases in Language Generation." - Paper Examines bias in language models via prompts and generated outputs.
-
Hu et al. (2020). "XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization." - Paper A benchmark covering a range of cross-lingual transfer tasks.
These readings offer both historical perspectives and modern insights into how NLP has evolved and why these methods work.
๐ป Practical Components
Finally, youโll put theory into practice by:
- Building a basic text-processing pipeline (regexes, spaCy, etc.).
- Implementing your own TF-IDF vectorizer or using existing libraries like
scikit-learn
. - Running a simple text classification experiment on a real dataset.
- Comparing baseline results with more advanced approaches to see how improvements stack up.