PhD Defense: Learning from Less Data: Perception and Synthesis
IRB IRB-5105
https://umd.zoom.us/my/dkothandaraman?omn=98061911241
Machine learning techniques have transformed various fields, particularly in computer vision. However, they typically require vast amounts of labeled data for training, which can be costly and impractical. This dependency on data highlights the importance of research into data efficiency. We present our work on advancements in data-efficient deep learning within the contexts of visual perception and visual generation tasks.In the first part of the talk, we will present a glimpse of our work on data efficiency in visual perception. Specifically, we tackle the challenge of semantic segmentation in autonomous driving, assuming limited access to labeled data in both the target and related domains. We propose self-supervised learning solutions to enhance segmentation performance in unstructured and adverse weather conditions, ultimately extending to a more generalized approach that is on par with methods using immense amounts of labeled data, achieving up to 30% improvements over prior work. Next, we address data efficiency for autonomous aerial vehicles, specifically in video action recognition. Here, we integrate concepts from signal processing into neural networks, achieving both data and computational efficiency. Additionally, we propose differentiable learning methods for these representations, resulting in 8-38% improvements over previous work.In the second part of the talk, we will delve into data efficiency in visual generation. We will begin by discussing the efficient generation of aerial-view images, utilizing pretrained models to create aerial perspectives from input scenes in a zero-shot manner. By incorporating techniques from classical computer vision and information theory, our work enables the generation of aerial images from complex, real-world inputs without requiring any 3D or paired data during training or testing. Our approach is on par with concurrent methods that use vast amounts of 3D data for training.
Next, we will focus on zero-shot personalized image and video generation, aiming to create content based on custom concepts. We propose methods that leverage prompting to generate images and videos at the intersection of various manifolds corresponding to these concepts and pretrained models. Our work has applications in subject-driven action transfer and multi-concept video customization. These solutions are among the first in this area, showing significant improvements over baselines and related work. Our approaches are also data and compute efficient, relying solely on pretrained models without the need for additional training data. Finally, we introduce a fundamental prompting solution inspired by techniques from finance and economics, demonstrating how insights from different fields can effectively address similar mathematical challenges.