PhD Proposal: Understanding and Modeling Explicit and Implicit Representations of the Visual World
IRB IRB-4105
While a standard megapixel image might be worth a thousand words, it is at the same time worth more than a million pixels, and videos are worth many millions more. Understanding the local and global structures of these pixels, and their meaning, has been a core problem since the birth of computer vision as a field. Storing these massive amounts of pixels is another issues -- in the social media age, and with the democratization of access to high quality cameras, hundreds of millions of terabytes of data are created every day. In my research, I aim to tackle both of these problems by designing good deep learning representations for tasks ranging from image classification, to text-conditioned generation, to even video compression.In this talk, I first discuss my work around unsupervised and multimodal image understanding. I describe a benchmark where I compare learned representations in terms of both their downstream task performance as well as by comparing the embeddings themselves. I create a pipeline for generating synthetic text data to help perform better benchmarking and training of multimodal models for long video understanding.Second, I investigate diffusion models as a sort of unified representation learner. I explore the capacity of pre-trained diffusion networks for recognition tasks and present a lightweight, learnable feedback mechanism to improve the performance. I propose to adapt that feedback mechanism for fast, higher-quality image generation.Finally, I discuss an alternative paradigm for image understanding -- implicit neural representation. I provide an overview of this area, including my works for video compression. I also present a framework for better understanding what these models learn. I propose to build a system for real-time, high quality video compression by adapting hypernetworks, which predict model weights from video inputs, to predict compact, high-fidelity implicit representations.