CMSC 848O, Spring 2025, UMD

Assignments

Homework 1 due 2/25 on Gradescope
Presentation template for structuring your group's presentations

Schedule

Make sure to reload this page to ensure you're seeing the latest version.

Week 1 (1/27-29): introduction, neural language models

- Course introduction // [slides]
- No associated readings!

- Language model basics // [slides]
- [reading] Jurafsky & Martin, 3.1-3.5 (language modeling)
- [reading] Jurafsky & Martin, 7 (neural language models)

Week 2 (2/4-6): Attention, Transformers, scaling

- Neural language models // [notes]
- [reading] Neural language models (Bengio et al., 2003)
- [optional reading] Andrej Karpathy's coding-based backpropagation post

- Recurrent neural networks and attention // [notes]
- [reading] Vaswani et al., NeurIPS 2017 (paper that introduced Transformers)
- [optional reading] An easy-to-read blog post on attention
- [optional reading] Training Compute-Optimal Large Language Models (Hoffmann et al., 2022)

Week 3 (2/11-13): LLM post-training, usage, and evaluation

- Transformer language models // [notes]

- LLM pretraining and post-training // [slides]
- [reading] Instruction tuning (Wei et al., 2022, FLAN)
- [reading] Reinforcement learning from human feedback (Ouyang et al., 2022, RLHF)

Week 4 (2/18-20): LLM scaling and usage

- LLM usage and applications
- [reading] Lilian Weng's blogpost on prompt engineering
- [optional reading] Judging LLM as a Judge (MTBench, Zheng et al., NeuRIPS 2023)

- Scaling laws of LLMs // [slides] // [notes]
- [reading] Scaling Laws for Neural Language Models (Kaplan et al., 2020)
- [reading] Training Compute-Optimal Large Language Models (Hoffmann et al., 2022)

Week 5 (2/25-27): Tokenization and position encoding

- Tokenization // [slides]
- [reading] Neural Machine Translation... with Subword Units (Sennrich et al., ACL 2016)
- [reading] ByT5: Towards a token-free future... (Xue et al., 2021)

- Position embeddings // [notes] // [slides]
- [reading] Rotary position embeddings (RoPE, Su et al., 2021)
- [reading] NoPE (no position embeddings, Kazemnejad et al., 2023)

Week 6 (3/4-6): LLM analysis & attention variants

- Analysis (3/4)
- [reading] Lost in the Middle: How Language Models Use Long Contexts (Liu et al., TACL 2023, analysis)
- [reading] Massive Activations in Large Language Models (Sun et al., COLM 2024, analysis)

- Attention variants (3/6)
- [reading] Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention (Munkhdalai et al., arXiv 2024, attention variant)
- [reading] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (Ainslie et al., EMNLP 2023, attention variant)

Week 7 (3/11-13): Data & efficient implementations

- Data (3/11)
- [reading] Data Engineering for Scaling Language Models to 128K Context (Fu et al., ICML 2024, data)
- [reading] What is Wrong with Perplexity for Long-context Language Modeling? (Fang et al., ICLR 2025, data)

- Efficient implementations (3/13)
- [reading] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., NeurIPS 2022, efficient implementation)
- [reading] Ring Attention with Blockwise Transformers for Near-Infinite Context (Liu et al., ICLR 2024, efficient implementation)

Week 8 (3/18-20): spring break (no class)

Week 9 (3/25-27): Efficient inference & evaluation

- Efficient inference (3/25)
- [reading] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (Zhang et al., NeurIPS 2023, efficient inference)
- [reading] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding (Sun et al., COLM 2024, efficient inference)

- Evaluation (3/27)
- [reading] RULER: What's the Real Context Size of Your Long-Context Language Models? (Hsieh et al., COLM 2024, evaluation)
- [reading] Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries (Vodrahalli et al., arXiv 2024, evaluation)

Week 10 (4/1-3): More evaluation & training

- Evaluation (4/1)
- [reading] Retrieval Augmented Generation or Long-Context LLMs? (Li et al., EMNLP 2024, evaluation)
- [reading] One Thousand and One Pairs: A "novel" challenge for long-context language models (Karpinska et al., EMNLP 2024, evaluation)

- Training (4/3)
- [reading] How to Train Long-Context Language Models (Effectively) (Gao et al., arXiv 2024, training)
- [reading] Qwen2.5-1M Technical Report (Qwen Team et al., arXiv 2025, training)

Week 11 (4/8-10): Exam review & exam

- Exam review (4/8)

- Exam (4/10)

Week 12 (4/15-17): Novel architectures

- Novel architecture (4/15)
- [reading] Compressive Transformers for Long-Range Sequence Modelling (Rae et al., ICLR 2020, novel architecture)
- [reading] Improving language models by retrieving from trillions of tokens (Borgeaud et al., ICML 2022, novel architecture)

- Novel architecture (4/17)
- [reading] Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Gu et al., COLM 2024, novel architecture)
- [reading] Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models (De et al., arXiv 2024, novel architecture)

Week 13 (4/22-24): Test-time scaling & applications

- Test-time scaling (4/22)
- [reading] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek-AI Team et al., arXiv 2025, test-time scaling)
- [reading] s1: Simple test-time scaling (Muennighoff et al., arXiv 2025, test-time scaling)

- Applications (4/24)
- [reading] Agents' Room: Narrative Generation through Multi-step Collaboration (Huot et al., ICLR 2025, applications)
- [reading] Commit0: Library Generation from Scratch (Zhao et al., ICLR 2025, applications)

Week 14 (4/29-5/1): Flex days for new papers

Week 15 (5/6-8): Multilingual & multimodal

- Multilingual (5/6)
- [reading] A Benchmark for Learning to Translate a New Language from One Grammar Book (Tanzer et al., ICLR 2024, multilingual)
- [reading] Can LLMs Really Learn to Translate a Low-Resource Language from One Grammar Book? (Aycock et al., ICLR 2025, multilingual)

- Multimodal (5/8)
- [reading] LongVILA: Scaling Long-Context Visual Language Models for Long Videos (Chen et al., arXiv 2024, multimodal)
- [reading] Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution (Liu et al., ICLR 2025, multimodal)