PhD Proposal: Towards Private and Trustworthy Generative Models

Talk
Yuxin Wen
Time: 
04.09.2025 11:00 to 12:30
Location: 

The rapid advancement of Generative AI has achieved transformative capabilities, yet it also introduces critical challenges to security and trustworthiness. As these models grow more powerful, ensuring their responsible deployment demands multifaceted solutions. In this proposal, we address three key dimensions of this challenge: 1) attributing generated content through robust, imperceptible watermarks, 2) revealing adversarial vulnerabilities via efficient discrete optimization, and 3) detecting and mitigating unintended memorization in diffusion models.
First, we introduce Tree-Ring Watermarks, a method to embed invisible yet robust fingerprints into diffusion-generated images, enabling reliable provenance tracking without degrading quality. This approach provides semantic watermarks that ensure detectability even after perturbations, offering a tool for accountability in open-generation settings. Next, we present Hard Prompts Made Easy (PEZ), a gradient-based discrete optimization framework that automates the discovery of adversarial prompts, exposing vulnerabilities in safety-aligned models. This work facilitates the systematic auditing of content filters and alignment mechanisms. Finally, we tackle memorization in diffusion models with Detecting, Explaining, and Mitigating Memorization, which can localize and mitigate data replication without access to the training data. Our methods reveal how to detect training data regurgitation in generations and propose strategies to reduce privacy risks while preserving model utility.