PhD Defense: Visual Content Synthesis at Scale
IRB-5105 or https://umd.zoom.us/j/9316628340
Humans love to create visual content. Every day, we take photos with smartphones, edit videos using intuitive apps, and create artworks through increasingly accessible digital tools. These widespread practices have led to an explosion of visual data shared continuously on the internet, building massive collections of images and videos that capture diverse human experiences. This enormous accumulation of visual data, together with rapid advancements in GPU computing, has become the foundation for training large-scale generative models, the key to automatically synthesizing top-tier visual content. By learning directly from the rich online visual repositories, these models internalize intricate patterns, styles, and concepts, enabling re-compose these elements to novel samples based on the user's inputs. In this thesis, we study and design scalable generative models that digest and improve with visual data, evaluation metrics that can precisely monitor the progress, and develop applications based on these pre-trained models. This thesis begins by designing frameworks for scalable video generation models. This includes both autoregressive models trained on the discrete tokens obtained through a discrete tokenizer and diffusion models trained directly on the pixels. In addition, we develop a novel video tokenization schema, enabling more compact video representations for larger generative models to train on. Next, we perform a careful analysis of the mainstream automatic evaluation metric. In the last chapter of the thesis, we study several practical scenarios to apply the pre-trained large-scale generative models, with tasks not only generation and beyond the original image and video domains.