Machine Learning at Scale


In the last few years machine learning has grown from the use of large data sets to solve specific tasks (eg., training models to perform object classification using one million images in ImageNet) to the use of much larger datasets to train more general purpose representations that can be used to solve more general tasks.  For example, Foundation image models are trained using billions of image-text pairs and used to solve a wide range of image understanding problems.  Most dramatically, Large Language Models are trained using trillions of tokens and then customized to address a wide range of applications.  In this course we will study the techniques that have enabled machine learning systems to make use of such large training sets and the results produced by these systems.  For example, using very large amounts of data has shifted some of the focus of training from supervised to self-supervised learning.  Automatic data curation and cleaning has also become a more pressing problem.  Much of this data has been collected from society at large, and so we will also discuss some of the societal implications of the collection and use of this data by large companies. 




Students will work throughout the semester on a project.  This may take several forms. 

1)    Students may produce novel research.  In this case the final project should resemble a research paper suitable for submission to a workshop or conference, including motivation for the problem, discussion of relevant prior work, methods developed, and experimental results.

2)    Students may analyze the performance of an existing system or published research work.  This could include, for example, an analysis of bias in foundation models or an attempt to understand better what information is captured by image or video embeddings produced by foundation models.  This analysis should probably include new experimental results, though theoretical analysis is also welcome.

3)    Students may produce a research proposal.  This could take the form of an NSF-style proposal  We can distribute sample NSF proposals to interested students.  The proposal should have a compelling overall vision, take account of all relevant past work to explain what is innovative, and provide a detailed description of proposed work.  The best proposals will often contain persuasive initial results. 

Students may work on these projects in groups of up to three students.  The more students who are working on a common project, the more comprehensive it will be expected to be.


Due Dates:

Students should turn in a short (less than 5 pages) proposal for their intended project by Wed. October 16.  This is meant to help me provide feedback on proposed projects.

The final papers is due on the last day of classes, December 9.


In addition, students will be required to turn in one page critiques of ten of the papers assigned as reading for the class.  Each critique should contain two paragraphs.  The first paragraph should summarize the paper.  The second paragraph should provide some critical insight into the paper and/or interesting discussion questions for the paper.  These critiques are due prior to the class in which the paper will be discussed; late summaries will not be accepted.  Please hand these in as printed papers prior to the start of class. Students will be expected to be prepared to discuss their critiques in class.  Do not turn in more than one summary per class.  Do not use LLMs to generate the critiques.


Office Hours


Prof. Jacobs will have office hours, Wednesdays, 10-11 in Iribe, 4240.  In addition, students should feel free to schedule meetings with Prof. Jacobs at other times.


Class Schedule


This schedule is tentative.  We welcome suggestions from the class for additional topics or papers to discuss.  We expect that things may change quite a bit.





Assigned Reading

Additional resources

1.  8/26




2. 8/28

Review of some ML concepts and history

If you aren’t completely familiar with machine learning, Hal’s book ( provides a good undergraduate introduction.  You should probably read it.


The Deep Learning book ( provides a comprehensive discussion of deep learning.  Chapter 5 provides an introduction to ML, which will serve as a reference for this lecture.  Chapters 6-9 will introduce concepts in Deep Learning that we’ll use in class. provides a short, very clear introduction to neural networks.



3. 9/4

Markov Chains and Language sequence prediction

Prediction and Entropy of Printed English, by Shannon, Bell System Technical Journal.


Efficient Estimation of Word Representation in Vector Space, by Mikolov et al., Arxiv, 2013.

The Mixture Distribution Transition Model for High Order Markov Chains and non-Gaussian Time Series by Berchtold and Raftery

4. 9/9

Transformers and word embeddings

Attention is all you need,  Vaswani et al., Neurips 2017.


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, by Devlin et al. Proc. Of NAACL-HLT, 2019.


Formal Algorithms for Transformers by Phuong and Hutter, Arxiv.  I highly recommend this for a clear and precise description of Transformers.

RoFormer: Enhanced transformer with Rotary Position Embedding, Su et al., Arxiv, 2023.

Electra: Pre-training text encoders as discriminators rather than generators, Clark et al., ICLR 2020.


RoBERTa: A Robustly Optimized BERT Pretraining Approach, Liu et al., Arxiv, 2019.

Bert slides

5. 9/11

LLMs pretraining

Language Models are Few-Shot Learners, by Brown et al., Arxiv, 2020.

LLaMA: Open and Efficient Foundation Language Models, by Touvron et al., Arxiv 2023.


6. 9/16

Vision Transformers

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, by Dosovitskiy et al., ICLR 2021.  

Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows, by Liu et al., ICCV 2021.

Masked Autoencoders Are Scalable Vision Learners, by He et al., CVPR 2022.


Scaling Vision Transformers to 22 Billion Parameters, by Dehghani et al., Proceedings of Machine Learning Research.


7. 9/18

Self-supervised learning - Vision

A Simple Framework for Contrastive Learning of Visual Representations, by Chen, et al., ICLR 2020.


Emerging Properties in Self-Supervised Vision Transformers, by Caron, et al., ICCV 2021.


Vision Transformers Need Registers, Darcet et al., ICLR 2024.

SimCLR slides

8. 9/23

Vision-Language Models

Learning Transferable Visual Models From Natural Language Supervision by Radford et al., ICML 2021.


CoCa: Contrastive Captioners are Image-Text Foundation Models, by Yu et al., TMLR, 2022.

Demystifying CLIP data, Xu et al., Arxiv 2024.

9. 9/25

Reinforcement learning

Reinforcement Learning, an Introduction, Sutton and Barto Reading the first six chapters will give you a good intro.

Proximal Policy Optimization Algorithms, by Schulman et al., Arxiv, 2017.

OpenAI discussion of PPO

10. 9/30

Video Generation

Reducing Activation Recomputation in Large Transformer Models, by Korthikanti et al., Arxiv, 2022.

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack, by Dai et al., Arxiv, 2023.


11. 10/2

LLMs fin


Training language models to follow instructions with human feedback, by Ouyang et al., Arxiv 2022.

LoRA: Low-Rank Adaptation of Large Language Models, by Hu et al., ICLR, 2022.


Alpaca: A Strong, Replicable Instruction-Following Model, by Taori et al., 2023.

Learning to Summarize from Human Feedback, by Stiennon et al., Neurips 2020.  Gives more detail on how InstructGPT works.


Towards Understanding Sycophancy in Language Models, by Sharma et al., ICLR 2024.


Llama 2: Open Foundation and Fine-Tuned Chat Models by Touvron et al., Arxiv 2023.

SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions by Wang et al., ACL 2023.

The Curious Case of Neural Text Degeneration, by Holtzman et al, ICLR 2020.


12. 10/7


AI models collapse when trained on recursively generated data, by Shumailov et al., Nature, 2024.


13. 10/9

Class Cancelled




14. 10/14

Multi-modal LLMs

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, by Li et al., ICML 2023.


Flamingo: a Visual Language Model for Few-Shot Learning by Alayrac et al., Neurips 2022.

Visual Instruction Tuning, Liu et al., Neurips 2023.


The Llama 3 Herd of Models, Meta, 2024.


The Claude 3 Model Family: Opus, Sonnet, Haiku, Anthropic.


Gemini: A Family of Highly Capable Multimodal Models, DeepMind


GPT-4 Technical Report, OpenAI

15. 10/16

Understanding fine-tuning and in-context learning?


Bias, …

LIMA: Less Is More for Alignment, Zhou et al., Neurips 2024.

Rethinking the role of demonstrations: what makes it work?, by Min et al., EMNLP, 2022

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?, by Bender et al., Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 

Women also Snowboard, by Hendricks et al., ECCV 2018


16. 10/21

Privacy and Trust in Foundation models

Open Sesame! Universal Black-box Jailbreaking of Large Language Models, by Lapid et al., ICLR Workshop on Secure and Trustworthy LLMs, 2024.

Scalable Extraction of Training Data from (Production) Language Models by Nasr, et al., Arxiv, 2023.


DECODINGTRUST: A Comprehensive Assessment of Trustworthiness in GPT Models, by Wang


Badllama 3: removing safety finetuning from Llama 3 in minutes, by Volkov, Arxiv, 2024.

To Fix Fake News, Look To Yellow Journalism


Foundational Challenges in Assuring Alignment and Safety of Large Language Models, by Anwar et al., Arxiv, 2024.



17. 10/23

Efficiency in Training

FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness by Dao, et al., Neurips,



LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, by Dettmers, et al., Neurips, 2022.


QLoRA: Efficient Finetuning of Quantized LLMs, by Dettmers, et al., Neurips 2023.

LLM.int8() blog

18. 10/28

Efficiency cont’d

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer by Yang et al., Arxiv, 2022.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, by Shazeer et al., ICLR, 2017.


19. 10/30


Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, by Shoeybi et al., Arxiv, 2020.


Ring Attention with Blockwise Transformers for Near-Infinite Context, by Liu et al., ICLR, 2024.


ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, by Rajbhandari et al., Sc20, 2020.


Scaling Distributed Machine Learning with the Parameter Server, by Li et al., 11th USENIX Symposium on Operating Systems Design and Implementation, 2014.

Large Scale Distributed Deep Networks by Dean et al., Neurips 2012.

20. 11/4

Climate Foundation Models

Aurora: A Foundation Model of the Atmosphere by Bodnar et al., Arxiv 2024.

ClimaX: A foundation model for weather and climate, Nguyen et al., Arxiv 2023.

GenCast: Diffusion-based ensemble forecasting for medium-range weather, Price et al., Arxiv 2024.

GraphCast: Learning skillful medium-range global weather forecasting, Lam et al., Arxiv, 2022.


21. 11/6

Near duplicate Detection

Mining Massive Data Sets Chapter 3, by Ullman.


On the resemblance and containment of documents, by Broder, Proceedings. Compression and Complexity of SEQUENCES 1997.


A Self-Supervised Descriptor for Image Copy Detection, by Pizzi et al., CVPR, 2022.


22. 11/11

Data cleaning

DataComp-LM: In search of the next generation of training sets for language models, by Li et al., Arxiv, 2024.

Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training, by Radenovic et al., CVPR, 2023.

DINOv2: Learning Robust Visual Features without Supervision, by Oquab et al., TMLR, 2024.

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale, by Penedo et al., Arxiv, 2024.


23. 11/13

Data ownership, issues of restricted data

Surveillance Capitalism, by Zuboff, Project Syndicate, 2020.


Why Technology Favors Tyranny, by Harrari, The Atlantic, 2018.


AI Art and its Impact on Artists, by Jiang et al., AIES 2023.


24. 11/18

Generative models overview

Image Generation Diffusion

Denoising Diffusion Probabilistic Models, by Ho et al., Neurips, 2020.


High-Resolution Image Synthesis with Latent Diffusion Models, by Rombach et al., CVPR 2022.


Scalable Diffusion Models with Transformers, by Peebles and Xie, ICCV 2023.

Tutorials: Arash, Yang


Trolls Used Her Face to Make Fake Porn.  There Was Nothing She Could , Do, by Kraft, 2024, New York Times Magazine. 


25. 11/


Generation auto-regressive

MUSE: Text to Image Generation via Masked Generative Transformers, by Chang et al., PMLR, 2023.

MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis, by Li et al., CVPR 2023.

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation, by Sun et al., Arxiv, 2024.

Autoregressive Image Generation without Vector Quantization, by Li et al., Arxiv, 2024.


26. 11/25

Video generation

Sora, by OpenAI


VideoPoet: A Large Language Model for Zero-Shot Video Generation, by Kondratyuk et al., ICML 2024.


Genie: Generative Interactive Environments, by Bruce et al., 2024.

More Sora


Video Poet web page, with videos

27. 12/2

Scaling Laws

Scaling Laws for Neural Language Models, by Kaplan, et al., Arxiv, 2020.


Training Compute Optimal Large Language Models, Neurips, 2022.


28. 12/4

Chain of thought, Emergent Properties

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, by Wei et al., Neurips 2022.


Are Emergent Abilities of Large Language Models a Mirage?, by Schaeffer et al., Neurips 2024.


29. 12/9





Other Possible Topics

Distillation and/or model pruning

Other applications (autonomous driving, …)

Existential risks of AI



Other Papers of Interest

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning by Dai et al., Neurips 2023.


What learning algorithm is in-context learning?, by Akyurek et al., ICLR 2023.

The Platonic Representation Hypothesis, by Huh et al., Arxiv, 2024.


Sparks of Artificial General Intelligence: Early experiments with GPT-4, by Bubeck et al., Arxiv, 2023.