Resources
Recommended Materials
A few newsletters/blogs on NLP and AI you can subscribe to
Sebastian Ruder: NLP News : Deep dives and curated highlights on the latest NLP research, models, and trends.
Melanie Mitchell: AI: A Guide for Thinking Humans : Critical, accessible essays on AI progress, reasoning, and the limits of current models.
Jay Alammar: Language Models & Co. : Visual, intuitive explanations of LLMs, transformers, and applied NLP concepts.
Sebastian Raschka: Ahead of AI : Technical breakdowns of new papers, training techniques, and open-source LLM tooling.
Andriy Burkov: The Artificial Intelligence : Weekly summary of practical ML and AI news, papers, and engineering insights.
Gary Marcus: Marcus on AI : Skeptical commentary on AI hype, deep learning's limitations, and policy implications.
Andrej Karpathy: Blog : Hands-on essays and tutorials on neural networks, training dynamics, and LLM internals.
Lilian Weng: Lil'Log : In-depth technical posts on RL, LLMs, hallucinations, and agent architectures.
Chip Huyen : Practical writing on MLOps, AI engineering, and building real-world ML systems.
Jack Clark: Import AI : Weekly roundup of AI research, policy, and geopolitics.
Andrew Ng: The Batch : Accessible weekly digest of AI news, research, and industry trends.
Emily M. Bender & Alex Hanna: Mystery AI Hype Theater 3000 : Sharp, linguistically-grounded take-downs of AI hype and overclaims about LLMs.
Rachel Thomas: fast.ai blog : Essays on AI ethics, education, inclusion, and debunking conventional ML wisdom.
Elements of Statistical Learning : Foundational textbook covering supervised learning, regularization, trees, and ensembles.
Van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.) . Butterworth-Heinemann.: Classic textbook introducing precision/recall, indexing, and probabilistic retrieval models.
Wang et al. (2019) GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding : Introduces a 9-task benchmark suite designed to evaluate general-purpose language understanding across diverse NLU problems.
Hu et al. (2020) XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization : Evaluates cross-lingual transfer of multilingual models across 40 languages and 9 tasks.
Strubell et al. (2019) Energy and Policy Considerations for Deep Learning in NLP : Quantifies the financial and environmental cost of training large NLP models and calls for efficiency-aware research.
Dodge et al. (2022) Measuring the Carbon Intensity of AI in Cloud Instances : Measures emissions of cloud-based AI workloads and proposes practices for reducing carbon footprint.
Sheng et al. (2019) The Woman Worked as a Babysitter: On Biases in Language Generation : Shows that language models generate systematically biased completions across gender, race, and sexual orientation.
Gupta & Manning (2014) Improved Pattern Learning for Bootstrapped Entity Extraction : Improves bootstrapped entity extraction by jointly learning patterns and entities with better scoring.
Dou & Neubig (2021) Word Alignment by Fine-tuning Embeddings on Parallel Corpora : Uses fine-tuned multilingual embeddings to produce state-of-the-art word alignments without supervision.
Karpathy, Andrej (2016) Yes you should understand Backprop : Argues that understanding backpropagation matters because leaky abstractions in autodiff cause real bugs.
Karpathy, Andrej (2015) The Unreasonable Effectiveness of Recurrent Neural Networks : Demonstrates that character-level RNNs can generate surprisingly coherent text across many domains.
Olah, Christopher (2015) Understanding LSTM Networks : Visual, intuitive walkthrough of how LSTM gates manage long-range dependencies.
Olah & Carter (2016) Attention and Augmented Recurrent Neural Networks : Surveys attention, neural Turing machines, and other mechanisms that augment RNN capabilities.
Mikolov et al. (2013) Efficient Estimation of Word Representations in Vector Space : Introduces Word2Vec (CBOW and skip-gram) for learning word embeddings cheaply at scale.
Pennington et al. (2014) GloVe: Global Vectors for Word Representation : Learns word vectors by factorizing global word co-occurrence statistics.
Bojanowski et al. (2017) Enriching Word Vectors with Subword Information : FastText: represents words as bags of character n-grams to handle morphology and OOV tokens.
Peters et al. (2018) Deep Contextualized Word Representations : ELMo: produces context-dependent word representations from a deep bidirectional language model.
Howard & Ruder (2018) Universal Language Model Fine-tuning for Text Classification : ULMFiT: introduces a transfer learning recipe for fine-tuning language models on downstream NLP tasks.
Devlin et al. (2019) [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]https://aclanthology.org/N19-1423/): Pre-trains deep bidirectional transformers via masked language modeling, setting a new state of the art on 11 NLP tasks.
Alammar, Jay (2018) The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) : Visual explanation of how contextual pre-training reshaped NLP transfer learning.
Vaswani et al. (2017) Attention Is All You Need : Introduces the Transformer architecture, replacing recurrence with self-attention.
Uszkoreit, Jakob (2017) Transformer: A Novel Neural Network Architecture for Language Understanding : Google blog post explaining the intuition behind the Transformer.
Alammar, Jay (2018) The Illustrated Transformer : Step-by-step visual breakdown of self-attention and the Transformer encoder/decoder.
Adaloglou, Nikolas (2020) How Transformers work in deep learning and NLP: an intuitive introduction : Intuitive introduction to attention, positional encodings, and Transformer mechanics.
Liu et al. (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach : Shows BERT was undertrained and yields stronger results with longer training, more data, and no NSP.
Wolf et al. (2020) HuggingFace's Transformers: State-of-the-art Natural Language Processing : Presents the open-source library that standardized access to pre-trained transformer models.
Sun et al. (2019) How to Fine-Tune BERT for Text Classification? : Empirical study of fine-tuning strategies, learning rates, and layer-wise schedules for BERT.
Brown et al. (2020) Language Models are Few-Shot Learners : GPT-3: shows that scaling autoregressive LMs to 175B parameters enables strong few-shot in-context learning.
Gao et al. (2021) Making Pre-trained Language Models Better Few-shot Learners : LM-BFF: improves few-shot fine-tuning via prompt-based learning and automatic demonstration selection.
Gao, Tianyu (2021) Prompting: Better Ways of Using Language Models for NLP Tasks : Survey-style article explaining prompt-based methods and their relationship to fine-tuning.
Schick & Schütze (2021) Generating Datasets with Pretrained Language Models : Uses generative LMs to synthesize labeled training data for sentence-level tasks without human annotation.
Schick & Schütze (2021) Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference : PET: reformulates classification as cloze tasks to leverage MLM knowledge in few-shot settings.
Bender et al. (2021) On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 : Critiques the risks of ever-larger LMs: environmental cost, bias amplification, and illusion of understanding.
Kirk et al. (2021) Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models : Audits GPT-2 for occupational stereotypes across intersectional demographic groups.
Schick et al. (2021) Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP : Shows pre-trained LMs can identify and reduce their own biased outputs at decoding time.
Le Scao et al. (2022) BLOOM: A 176B-Parameter Open-Access Multilingual Language Model : Releases an open multilingual LLM trained collaboratively across 46 languages.
Suau et al. (2022) Self-conditioning Pre-Trained Language Models : Identifies expert neurons inside LMs and uses them to control generation without fine-tuning.
Agüera y Arcas (2022) Do Large Language Models Understand Us? : Argues that LaMDA-style models exhibit forms of understanding that challenge naive Chinese-room critiques.
Touvron et al. (2023) LLaMA: Open and Efficient Foundation Language Models : Trains competitive 7B–65B foundation LLMs using only public data with strong inference efficiency.
Manakul et al. (2023) SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models : Detects LLM hallucinations by sampling multiple responses and measuring consistency, no external resources needed.
Al-Kaswan & Izadi (2023) The (Ab)use of Open Source Code to Train Large Language Models : Discusses copyright, licensing, and ethical issues of training LLMs on open-source code repositories.
Luccioni et al. (2024) Power Hungry Processing: Watts Driving the Cost of AI Deployment? : Measures inference-time energy and emissions across tasks, finding generation is far costlier than discriminative tasks.
Yao et al. (2023) ReAct: Synergizing Reasoning and Acting in Language Models : Interleaves chain-of-thought reasoning with tool-use actions, improving factuality and task success.
Huyen, Chip (2025) AI Engineering : Practical guide to building applications on top of foundation models, covering evaluation, deployment, and feedback loops.
Warner et al. (2024) Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Fine Tuning and Inference : ModernBERT: a refreshed encoder with long-context support, faster inference, and stronger downstream performance.
Chen et al. (2024) What is the Role of Small Models in the LLM Era: A Survey : Surveys when small models complement or replace LLMs, covering distillation, ensembling, and routing.
Weng, Lilian (2024) Extrinsic Hallucinations in LLMs : Survey blog post on hallucination types, causes, evaluation metrics, and mitigation strategies.
Mitchell, Melanie (2025) LLMs and World Models : Examines whether LLMs build genuine world models or rely on shallow heuristics.
Vafa et al. (2024) https://dl.acm.org/doi/abs/10.5555/3737916.3738762 : Proposes new metrics showing generative models can perform well while harboring incoherent implicit world models.
Feng et al. (2024) Were RNNs All We Needed? : Revisits minimal LSTMs/GRUs and shows simplified, parallelizable variants rival modern architectures.
An et al. (2025) Measuring Gender and Racial Biases in Large Language Models: Intersectional Evidence from Automated Resume Evaluation : Audits LLM-based resume screening and finds intersectional gender × race disparities in hiring recommendations.
Haim et al. (2025) What's in a Name? Auditing Large Language Models for Race and Gender Bias : Uses name perturbations to surface racial and gender bias in LLM advice across high-stakes scenarios.
Bai et al. (2025) Explicitly Unbiased Large Language Models Still Form Biased Associations : Shows LLMs that pass explicit bias tests still encode stereotypical associations measurable via implicit-association probes.
Hartzog (2026) How AI Destroys Institutions : Argues AI systems erode the procedural and trust foundations that make institutions work.
Back to top