Resources

Recommended Materials

Sebastian Ruder: NLP News: Deep dives and curated highlights on the latest NLP research, models, and trends.
Melanie Mitchell: AI: A Guide for Thinking Humans: Critical, accessible essays on AI progress, reasoning, and the limits of current models.
Jay Alammar: Language Models & Co.: Visual, intuitive explanations of LLMs, transformers, and applied NLP concepts.
Sebastian Raschka: Ahead of AI: Technical breakdowns of new papers, training techniques, and open-source LLM tooling.
Andriy Burkov: The Artificial Intelligence: Weekly summary of practical ML and AI news, papers, and engineering insights.
Gary Marcus: Marcus on AI: Skeptical commentary on AI hype, deep learning's limitations, and policy implications.
Andrej Karpathy: Blog: Hands-on essays and tutorials on neural networks, training dynamics, and LLM internals.
Lilian Weng: Lil'Log: In-depth technical posts on RL, LLMs, hallucinations, and agent architectures.
Chip Huyen: Practical writing on MLOps, AI engineering, and building real-world ML systems.
Jack Clark: Import AI: Weekly roundup of AI research, policy, and geopolitics.
Andrew Ng: The Batch: Accessible weekly digest of AI news, research, and industry trends.
Emily M. Bender & Alex Hanna: Mystery AI Hype Theater 3000: Sharp, linguistically-grounded take-downs of AI hype and overclaims about LLMs.
Rachel Thomas: fast.ai blog: Essays on AI ethics, education, inclusion, and debunking conventional ML wisdom.

Neural Networks, BERT, attention, Transformers, Word Embeddings, LLMs

Elements of Statistical Learning: Foundational textbook covering supervised learning, regularization, trees, and ensembles.
Van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). Butterworth-Heinemann.: Classic textbook introducing precision/recall, indexing, and probabilistic retrieval models.
Wang et al. (2019) GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding: Introduces a 9-task benchmark suite designed to evaluate general-purpose language understanding across diverse NLU problems.
Hu et al. (2020) XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization: Evaluates cross-lingual transfer of multilingual models across 40 languages and 9 tasks.
Strubell et al. (2019) Energy and Policy Considerations for Deep Learning in NLP: Quantifies the financial and environmental cost of training large NLP models and calls for efficiency-aware research.
Dodge et al. (2022) Measuring the Carbon Intensity of AI in Cloud Instances: Measures emissions of cloud-based AI workloads and proposes practices for reducing carbon footprint.
Sheng et al. (2019) The Woman Worked as a Babysitter: On Biases in Language Generation: Shows that language models generate systematically biased completions across gender, race, and sexual orientation.
Gupta & Manning (2014) Improved Pattern Learning for Bootstrapped Entity Extraction: Improves bootstrapped entity extraction by jointly learning patterns and entities with better scoring.
Dou & Neubig (2021) Word Alignment by Fine-tuning Embeddings on Parallel Corpora: Uses fine-tuned multilingual embeddings to produce state-of-the-art word alignments without supervision.
Karpathy, Andrej (2016) Yes you should understand Backprop: Argues that understanding backpropagation matters because leaky abstractions in autodiff cause real bugs.
Karpathy, Andrej (2015) The Unreasonable Effectiveness of Recurrent Neural Networks: Demonstrates that character-level RNNs can generate surprisingly coherent text across many domains.
Olah, Christopher (2015) Understanding LSTM Networks: Visual, intuitive walkthrough of how LSTM gates manage long-range dependencies.
Olah & Carter (2016) Attention and Augmented Recurrent Neural Networks: Surveys attention, neural Turing machines, and other mechanisms that augment RNN capabilities.
Mikolov et al. (2013) Efficient Estimation of Word Representations in Vector Space: Introduces Word2Vec (CBOW and skip-gram) for learning word embeddings cheaply at scale.
Pennington et al. (2014) GloVe: Global Vectors for Word Representation: Learns word vectors by factorizing global word co-occurrence statistics.
Bojanowski et al. (2017) Enriching Word Vectors with Subword Information: FastText: represents words as bags of character n-grams to handle morphology and OOV tokens.
Peters et al. (2018) Deep Contextualized Word Representations: ELMo: produces context-dependent word representations from a deep bidirectional language model.
Howard & Ruder (2018) Universal Language Model Fine-tuning for Text Classification: ULMFiT: introduces a transfer learning recipe for fine-tuning language models on downstream NLP tasks.
Devlin et al. (2019) [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]https://aclanthology.org/N19-1423/): Pre-trains deep bidirectional transformers via masked language modeling, setting a new state of the art on 11 NLP tasks.
Alammar, Jay (2018) The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning): Visual explanation of how contextual pre-training reshaped NLP transfer learning.
Vaswani et al. (2017) Attention Is All You Need: Introduces the Transformer architecture, replacing recurrence with self-attention.
Uszkoreit, Jakob (2017) Transformer: A Novel Neural Network Architecture for Language Understanding: Google blog post explaining the intuition behind the Transformer.
Alammar, Jay (2018) The Illustrated Transformer: Step-by-step visual breakdown of self-attention and the Transformer encoder/decoder.
Adaloglou, Nikolas (2020) How Transformers work in deep learning and NLP: an intuitive introduction: Intuitive introduction to attention, positional encodings, and Transformer mechanics.
Liu et al. (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach: Shows BERT was undertrained and yields stronger results with longer training, more data, and no NSP.
Wolf et al. (2020) HuggingFace's Transformers: State-of-the-art Natural Language Processing: Presents the open-source library that standardized access to pre-trained transformer models.
Sun et al. (2019) How to Fine-Tune BERT for Text Classification?: Empirical study of fine-tuning strategies, learning rates, and layer-wise schedules for BERT.
Brown et al. (2020) Language Models are Few-Shot Learners: GPT-3: shows that scaling autoregressive LMs to 175B parameters enables strong few-shot in-context learning.
Gao et al. (2021) Making Pre-trained Language Models Better Few-shot Learners: LM-BFF: improves few-shot fine-tuning via prompt-based learning and automatic demonstration selection.
Gao, Tianyu (2021) Prompting: Better Ways of Using Language Models for NLP Tasks: Survey-style article explaining prompt-based methods and their relationship to fine-tuning.
Schick & Schütze (2021) Generating Datasets with Pretrained Language Models: Uses generative LMs to synthesize labeled training data for sentence-level tasks without human annotation.
Schick & Schütze (2021) Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference: PET: reformulates classification as cloze tasks to leverage MLM knowledge in few-shot settings.
Bender et al. (2021) On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜: Critiques the risks of ever-larger LMs: environmental cost, bias amplification, and illusion of understanding.
Kirk et al. (2021) Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models: Audits GPT-2 for occupational stereotypes across intersectional demographic groups.
Schick et al. (2021) Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP: Shows pre-trained LMs can identify and reduce their own biased outputs at decoding time.
Le Scao et al. (2022) BLOOM: A 176B-Parameter Open-Access Multilingual Language Model: Releases an open multilingual LLM trained collaboratively across 46 languages.
Suau et al. (2022) Self-conditioning Pre-Trained Language Models: Identifies expert neurons inside LMs and uses them to control generation without fine-tuning.
Agüera y Arcas (2022) Do Large Language Models Understand Us?: Argues that LaMDA-style models exhibit forms of understanding that challenge naive Chinese-room critiques.
Touvron et al. (2023) LLaMA: Open and Efficient Foundation Language Models: Trains competitive 7B–65B foundation LLMs using only public data with strong inference efficiency.
Manakul et al. (2023) SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models: Detects LLM hallucinations by sampling multiple responses and measuring consistency, no external resources needed.
Al-Kaswan & Izadi (2023) The (Ab)use of Open Source Code to Train Large Language Models: Discusses copyright, licensing, and ethical issues of training LLMs on open-source code repositories.
Luccioni et al. (2024) Power Hungry Processing: Watts Driving the Cost of AI Deployment?: Measures inference-time energy and emissions across tasks, finding generation is far costlier than discriminative tasks.
Yao et al. (2023) ReAct: Synergizing Reasoning and Acting in Language Models: Interleaves chain-of-thought reasoning with tool-use actions, improving factuality and task success.
Huyen, Chip (2025) AI Engineering: Practical guide to building applications on top of foundation models, covering evaluation, deployment, and feedback loops.
Warner et al. (2024) Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Fine Tuning and Inference: ModernBERT: a refreshed encoder with long-context support, faster inference, and stronger downstream performance.
Chen et al. (2024) What is the Role of Small Models in the LLM Era: A Survey: Surveys when small models complement or replace LLMs, covering distillation, ensembling, and routing.
Weng, Lilian (2024) Extrinsic Hallucinations in LLMs: Survey blog post on hallucination types, causes, evaluation metrics, and mitigation strategies.
Mitchell, Melanie (2025) LLMs and World Models: Examines whether LLMs build genuine world models or rely on shallow heuristics.
Vafa et al. (2024) https://dl.acm.org/doi/abs/10.5555/3737916.3738762: Proposes new metrics showing generative models can perform well while harboring incoherent implicit world models.
Feng et al. (2024) Were RNNs All We Needed?: Revisits minimal LSTMs/GRUs and shows simplified, parallelizable variants rival modern architectures.
An et al. (2025) Measuring Gender and Racial Biases in Large Language Models: Intersectional Evidence from Automated Resume Evaluation: Audits LLM-based resume screening and finds intersectional gender × race disparities in hiring recommendations.
Haim et al. (2025) What's in a Name? Auditing Large Language Models for Race and Gender Bias: Uses name perturbations to surface racial and gender bias in LLM advice across high-stakes scenarios.
Bai et al. (2025) Explicitly Unbiased Large Language Models Still Form Biased Associations: Shows LLMs that pass explicit bias tests still encode stereotypical associations measurable via implicit-association probes.
Hartzog (2026) How AI Destroys Institutions: Argues AI systems erode the procedural and trust foundations that make institutions work.

Papers cited in the slides (added)

Luhn, H. P. (1957) A Statistical Approach to Mechanized Encoding and Searching of Literary Information: Early work on term-frequency statistics for indexing and search.
Spärck Jones, K. (1972) A Statistical Interpretation of Term Specificity and Its Application in Retrieval: Introduces inverse document frequency, the IDF in TF-IDF.
Nesterov, Y. (1983) A Method for Solving the Convex Programming Problem with Convergence Rate O(1/k²): Nesterov accelerated gradient, a look-ahead variant of momentum.
Rumelhart et al. (1986) Learning Representations by Back-propagating Errors: Popularizes backpropagation for training multilayer neural networks.
LeCun et al. (1989) Backpropagation Applied to Handwritten Zip Code Recognition: Foundational convolutional neural network trained end-to-end with backprop.
Hochreiter & Schmidhuber (1997) Long Short-Term Memory: Original LSTM, addressing vanishing gradients in RNNs.
Beyer et al. (1999) When Is "Nearest Neighbor" Meaningful?: Curse of dimensionality: distance contrast vanishes in high dimensions.
Qian, N. (1999) On the Momentum Term in Gradient Descent Learning Algorithms: Introduces the momentum term that accelerates gradient descent.
Fei-Fei et al. (2006) One-Shot Learning of Object Categories: Foundational one-shot learning work from computer vision.
Manning, Raghavan & Schütze (2008) Introduction to Information Retrieval: Standard reference for vector-space retrieval, TF-IDF, and evaluation.
Chang et al. (2008) Importance of Semantic Representation: Dataless Classification: Early dataless text classification using semantic label representations.
Glorot & Bengio (2010) Understanding the Difficulty of Training Deep Feedforward Neural Networks: Xavier/Glorot initialization for stable deep-network training.
Duchi et al. (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization: Adagrad, with per-parameter adaptive learning rates.
Pascanu et al. (2013) On the Difficulty of Training Recurrent Neural Networks: Analyzes vanishing/exploding gradients and proposes gradient clipping.
Graves, A. (2013) Generating Sequences With Recurrent Neural Networks: LSTM-based sequence generation, including peephole connections.
Mikolov et al. (2013) Distributed Representations of Words and Phrases and Their Compositionality: Companion Word2Vec paper introducing skip-gram with negative sampling.
Bahdanau et al. (2014) Neural Machine Translation by Jointly Learning to Align and Translate: The original attention mechanism for sequence-to-sequence models.
Cho et al. (2014) Learning Phrase Representations using RNN Encoder-Decoder for SMT: Introduces the GRU as a simpler alternative to LSTM.
Goodfellow et al. (2014) Explaining and Harnessing Adversarial Examples: Adversarial perturbations that reveal model brittleness.
Kingma & Ba (2015) Adam: A Method for Stochastic Optimization: The Adam optimizer combining momentum and adaptive learning rates.
Hinton et al. (2015) Distilling the Knowledge in a Neural Network: Knowledge distillation from a large teacher into a smaller student.
Schroff et al. (2015) FaceNet: A Unified Embedding for Face Recognition and Clustering: Introduces the triplet loss used in contrastive few-shot learning.
He et al. (2015) Deep Residual Learning for Image Recognition: Residual connections; the "degradation problem" generalizes to deep nets.
Han et al. (2015) Learning Both Weights and Connections for Efficient Neural Networks: Magnitude-based pruning for compact, efficient networks.
Schakel & Wilson (2015) Measuring Word Significance Using Distributed Representations of Words: Shows embedding norm grows with word frequency and significance.
Bolukbasi et al. (2016) Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings: Surfaces and debiases gender stereotypes in word embeddings.
Molchanov et al. (2016) Pruning Convolutional Neural Networks for Resource Efficient Inference: Iterative pruning guided by a Taylor-expansion importance criterion.
Ruder, S. (2016) An Overview of Gradient Descent Optimization Algorithms: Survey of SGD variants (Momentum, Nesterov, Adagrad, RMSprop, Adam).
Mu & Viswanath (2018) All-but-the-Top: Simple and Effective Postprocessing for Word Representations: Removing top principal components (mostly frequency) improves embeddings.
Caliskan et al. (2017) Semantics Derived Automatically from Language Corpora Contain Human-like Biases: Shows word embeddings encode documented human biases (WEAT).
Howard et al. (2017) MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications: Depthwise-separable convolutions for efficient on-device inference.
McCann et al. (2017) Learned in Translation: Contextualized Word Vectors (CoVe): Contextual word vectors derived from a machine-translation encoder.
Zhao et al. (2017) Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints: Shows models amplify dataset bias and constrains inference to reduce it.
Peters et al. (2017) Semi-supervised Sequence Tagging with Bidirectional Language Models: LM-augmented sequence tagging, a precursor to ELMo.
Garg et al. (2018) Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes: Uses embeddings across decades to track shifting social stereotypes.
Jacob et al. (2018) Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference: Integer quantization scheme for efficient inference.
Eubanks, V. (2018) Automating Inequality: How automated decision systems harm poor and working-class people.
Prates et al. (2019) Assessing Gender Bias in Machine Translation: Documents gender-stereotyped defaults in MT from gender-neutral languages.
Sap et al. (2019) The Risk of Racial Bias in Hate Speech Detection: Shows toxicity classifiers over-flag African-American English.
Beltagy et al. (2019) SciBERT: A Pretrained Language Model for Scientific Text: Domain-adapted BERT for scientific literature.
Yang et al. (2019) XLNet: Generalized Autoregressive Pretraining for Language Understanding: Permutation language modeling combining autoregressive and bidirectional context.
Lewis et al. (2019) BART: Denoising Sequence-to-Sequence Pre-training: Denoising autoencoder pretraining for generation and comprehension.
Sanh et al. (2019) DistilBERT, a Distilled Version of BERT: A smaller, faster BERT trained by knowledge distillation.
Shen et al. (2019) Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT: Mixed-precision quantization of BERT guided by Hessian information.
Ethayarajh (2019) How Contextual are Contextualized Word Representations?: Documents anisotropy: contextual embeddings occupy a narrow cone.
West et al. (2019) Discriminating Systems: Gender, Race and Power in AI: Links workforce diversity gaps to biased AI systems.
Radford et al. (2019) Language Models are Unsupervised Multitask Learners (GPT-2): Scaling autoregressive LMs for zero-shot transfer.
Reimers & Gurevych (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks: Siamese fine-tuning for semantically meaningful sentence embeddings.
Koenecke et al. (2020) Racial Disparities in Automated Speech Recognition: Measures large ASR word-error-rate gaps between Black and white speakers.
Joshi et al. (2020) The State and Fate of Linguistic Diversity and Inclusion in the NLP World: Quantifies how few languages NLP research actually serves.
Blodgett et al. (2020) Language (Technology) is Power: A Critical Survey of "Bias" in NLP: Surveys 146 papers and critiques vague, mismatched notions of bias.
Barocas, Hardt & Narayanan (2019) Fairness and Machine Learning: Textbook on statistical non-discrimination criteria and their limits.
Benjamin, R. (2019) Race After Technology: Argues technology can encode and amplify racial hierarchies.
Green, B. (2019) "Good" Isn't Good Enough: Critiques "AI for social good" framing and calls for political reflexivity.
Su et al. (2021) Whitening Sentence Representations for Better Semantics and Faster Retrieval: Whitening removes anisotropy and reduces embedding dimensionality.
Timkey & van Schijndel (2021) All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality: A few rogue dimensions dominate cosine similarity; standardization fixes it.
Nozza et al. (2021) HONEST: Measuring Hurtful Sentence Completion in Language Models: A multilingual benchmark for hurtful completions in masked LMs.
Shin et al. (2021) Constrained Language Models Yield Few-Shot Semantic Parsers: Constrained decoding turns LLMs into few-shot semantic parsers.
Mukherjee et al. (2021) XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation: Task-agnostic distillation producing compact multilingual encoders.
Barbieri et al. (2022) XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis: Multilingual social-media transformer for sentiment tasks.
Wei et al. (2022) Chain-of-Thought Prompting Elicits Reasoning in Large Language Models: Intermediate reasoning steps unlock multi-step problem solving.
Gao et al. (2022) Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE): Hypothetical document embeddings for zero-shot dense retrieval.
Hu et al. (2021) LoRA: Low-Rank Adaptation of Large Language Models: Low-rank adapters for parameter-efficient fine-tuning.
Gao et al. (2021) SimCSE: Simple Contrastive Learning of Sentence Embeddings: Contrastive objective yielding isotropic, high-quality sentence embeddings.
He et al. (2021) DeBERTa: Decoding-enhanced BERT with Disentangled Attention: Disentangled content/position attention improving over BERT and RoBERTa.
Lasri et al. (2023) EconBERTa: Towards Robust Extraction of Named Entities in Economics: Domain-adapted encoder and the ECON-IE dataset for economics NER.
Liu et al. (2023) A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration (DyLAN): Multi-agent collaboration with automatic team optimization.
Liu et al. (2023) What Makes Good Data for Alignment? (DEITA): Automatic data selection that matches SOTA alignment with far less SFT data.
Tie et al. (2025) A Survey on Post-training of Large Language Models: Survey of alignment, fine-tuning, and reasoning advances in LLMs.

Resources

Recommended Materials

A few newsletters/blogs on NLP and AI you can subscribe to

Neural Networks, BERT, attention, Transformers, Word Embeddings, LLMs

Papers cited in the slides (added)