CMSC 848O, Spring 2025, UMD
Assignments
Schedule
Make sure to reload this page to ensure you're seeing the latest version.
Week 1 (1/27-29): introduction, neural language models
-
- Course introduction // [slides]
- No associated readings!
-
- Language model basics // [slides]
- [reading] Jurafsky & Martin, 3.1-3.5 (language modeling)
- [reading] Jurafsky & Martin, 7 (neural language models)
Week 2 (2/4-6): Attention, Transformers, scaling
-
-
- Recurrent neural networks and attention // [notes]
- [reading] Vaswani et al., NeurIPS 2017 (paper that introduced Transformers)
- [optional reading] An easy-to-read blog post on attention
- [optional reading] Training Compute-Optimal Large Language Models (Hoffmann et al., 2022)
Week 3 (2/11-13): LLM post-training, usage, and evaluation
-
- Transformer language models // [notes]
-
- LLM pretraining and post-training // [slides]
- [reading] Instruction tuning (Wei et al., 2022, FLAN)
- [reading] Reinforcement learning from human feedback (Ouyang et al., 2022, RLHF)
Week 4 (2/18-20): LLM scaling and usage
-
- LLM usage and applications
- [reading] Lilian Weng's blogpost on prompt engineering
- [optional reading] Judging LLM as a Judge (MTBench, Zheng et al., NeuRIPS 2023)
- Scaling laws of LLMs // [slides] // [notes]
- [reading] Scaling Laws for Neural Language Models (Kaplan et al., 2020)
- [reading] Training Compute-Optimal Large Language Models (Hoffmann et al., 2022)
Week 5 (2/25-27): Tokenization and position encoding
- Tokenization // [slides]
- [reading] Neural Machine Translation... with Subword Units (Sennrich et al., ACL 2016)
- [reading] ByT5: Towards a token-free future... (Xue et al., 2021)
- Position embeddings // [notes] // [slides]
- [reading] Rotary position embeddings (RoPE, Su et al., 2021)
- [reading] NoPE (no position embeddings, Kazemnejad et al., 2023)
Week 6 (3/4-6): LLM analysis & attention variants
-
- Analysis (3/4)
- [reading] Lost in the Middle: How Language Models Use Long Contexts (Liu et al., TACL 2023, analysis)
- [reading] Massive Activations in Large Language Models (Sun et al., COLM 2024, analysis)
-
- Attention variants (3/6)
- [reading] Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention (Munkhdalai et al., arXiv 2024, attention variant)
- [reading] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (Ainslie et al., EMNLP 2023, attention variant)
Week 7 (3/11-13): Data & efficient implementations
-
- Data (3/11)
- [reading] Data Engineering for Scaling Language Models to 128K Context (Fu et al., ICML 2024, data)
- [reading] What is Wrong with Perplexity for Long-context Language Modeling? (Fang et al., ICLR 2025, data)
-
- Efficient implementations (3/13)
- [reading] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., NeurIPS 2022, efficient implementation)
- [reading] Ring Attention with Blockwise Transformers for Near-Infinite Context (Liu et al., ICLR 2024, efficient implementation)
Week 8 (3/18-20): spring break (no class)
Week 9 (3/25-27): Efficient inference & evaluation
-
- Efficient inference (3/25)
- [reading] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (Zhang et al., NeurIPS 2023, efficient inference)
- [reading] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding (Sun et al., COLM 2024, efficient inference)
-
- Evaluation (3/27)
- [reading] RULER: What's the Real Context Size of Your Long-Context Language Models? (Hsieh et al., COLM 2024, evaluation)
- [reading] Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries (Vodrahalli et al., arXiv 2024, evaluation)
Week 10 (4/1-3): More evaluation & training
-
- Evaluation (4/1)
- [reading] Retrieval Augmented Generation or Long-Context LLMs? (Li et al., EMNLP 2024, evaluation)
- [reading] One Thousand and One Pairs: A "novel" challenge for long-context language models (Karpinska et al., EMNLP 2024, evaluation)
-
- Training (4/3)
- [reading] How to Train Long-Context Language Models (Effectively) (Gao et al., arXiv 2024, training)
- [reading] Qwen2.5-1M Technical Report (Qwen Team et al., arXiv 2025, training)
Week 11 (4/8-10): Exam review & exam
Week 12 (4/15-17): Novel architectures
-
- Novel architecture (4/15)
- [reading] Compressive Transformers for Long-Range Sequence Modelling (Rae et al., ICLR 2020, novel architecture)
- [reading] Improving language models by retrieving from trillions of tokens (Borgeaud et al., ICML 2022, novel architecture)
-
- Novel architecture (4/17)
- [reading] Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Gu et al., COLM 2024, novel architecture)
- [reading] Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models (De et al., arXiv 2024, novel architecture)
Week 13 (4/22-24): Test-time scaling & applications
-
- Test-time scaling (4/22)
- [reading] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek-AI Team et al., arXiv 2025, test-time scaling)
- [reading] s1: Simple test-time scaling (Muennighoff et al., arXiv 2025, test-time scaling)
-
- Applications (4/24)
- [reading] Agents' Room: Narrative Generation through Multi-step Collaboration (Huot et al., ICLR 2025, applications)
- [reading] Commit0: Library Generation from Scratch (Zhao et al., ICLR 2025, applications)
Week 14 (4/29-5/1): Flex days for new papers
Week 15 (5/6-8): Multilingual & multimodal
-
- Multilingual (5/6)
- [reading] A Benchmark for Learning to Translate a New Language from One Grammar Book (Tanzer et al., ICLR 2024, multilingual)
- [reading] Can LLMs Really Learn to Translate a Low-Resource Language from One Grammar Book? (Aycock et al., ICLR 2025, multilingual)
-
- Multimodal (5/8)
- [reading] LongVILA: Scaling Long-Context Visual Language Models for Long Videos (Chen et al., arXiv 2024, multimodal)
- [reading] Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution (Liu et al., ICLR 2025, multimodal)