PhD Defense: Understanding and Modeling Explicit and Implicit Representations of the Visual World

Talk
Matthew Gwilliam
Time: 
03.26.2025 11:00 to 12:30
Location: 

IRB IRB-4107

While a standard megapixel image might be worth a thousand words, it is at the same time worth more than a million pixels. Understanding the local and global structures of these pixels, and their meaning, has been a core problem since the birth of computer vision, enabling tasks like classification and image generation. Storing such massive amounts of pixels is another issue, with hundreds of million of terabytes of new data created every day motivating the development of powerful compression algorithms. In my research, I aim to tackle both of these problems by modeling the visual world with two main paradigms: training neural networks that model images explicitly with representation vectors, and training networks that model images implicitly in their own weights.In this thesis, I first discuss my work around explicit unsupervised image representation and multimodal understanding. I describe a benchmark where I compare learned representations in terms of both their downstream task performance as well as the relationships between the embeddings themselves. I create a pipeline for generating synthetic text data to help perform better benchmarking and training of multimodal models for long video understanding.Second, I investigate diffusion as a way to train a potential unified model for explicit image representations, one that can perform both recognition and generative tasks. I explore the capacity of pre-trained diffusion networks for recognition tasks and present a lightweight, learnable feedback mechanism to improve the performance. I take these insights back to the original generative task, and adapt this feedback mechanism for fast, higher-quality image generation.Finally, I discuss an alternative paradigm for image understanding -- implicit neural representation. I provide an overview of this area, including my works for video compression. I present a framework for better understanding what these models learn. I distill best principles for their design, and develop configurations that optimize not only quality-storage trade-offs, but also reduce their exceptionally long training time. To truly address the training time issue, I describe a system for real-time, high quality video compression by adapting hypernetworks, which predict model weights from video inputs, to predict compact, high-fidelity implicit representations.