PhD Proposal: Building Safe and Trustworthy Generative Models
IRB IRB-4109
The advancement of Generative Artificial Intelligence (GenAI) in the new decade has created disruptions that affect everyday life. With increased power, the need to develop safe and trustworthy AI systems has never been greater. However, building such systems requires a multifaceted approach. In this proposal, we discuss three facets of this challenge: 1) building new evaluation frameworks to understand the bounds of what these powerful systems can and cannot do, 2) aligning these models with human-centric values, such as the ability to respond to queries in a manner agreeable to humans, and 3) analyzing the harmful behaviors of language models using adversarial attacks in the age of GenAI.
We will first outline procedures to expand current methods of evaluation. Specifically, we propose extending labeled evaluations, which present numerous challenges such as size and data contamination, to unlabeled data in a ``self-supervised'' manner. We propose a framework, self-sensitivity evaluation, inspired by self-supervised training, which analyzes the sensitivity or invariance of language models to transformations applied to input text. Next, we will discuss making the most of small, non-ideal supervised fine-tuning datasets. We propose introducing a regularizer called Noisy Embedding Finetuning, or NEFTune, that significantly improves the model's ability to produce better responses to instructions. Finally, we will explore discrete optimizers for adversarial attacks that also can be utilized to optimize to form text prompts for images or other language modeling tasks. We call this method the PEZ optimizer. We will demonstrate how this discrete optimizer can create copyright-protected images for multimodal models.