PhD Defense: Robustness, Detectability, and Data Privacy in AI

Talk
Vinu Sankar Sadasivan
Time: 
05.20.2025 13:00 to 15:00

Recent rapid advancements in Artificial Intelligence (AI) have made it widely popular in various applications, ranging from autonomous systems to multimodal content generation. However, these models are vulnerable to several security and safety problems. These pitfalls in AI models can potentially allow attackers to jailbreak systems, enabling harmful tasks and leaking sensitive information. As AI models are increasingly deployed in sensitive applications, such as autonomous robots and healthcare devices, the importance of AI safety is growing. To address these issues in today’s AI systems, it is critical to study their vulnerabilities. In this thesis, we explore fundamental aspects of AI safety and security, including robustness, detectability, and data privacy. We analyze challenges posed by adversaries to contribute toward enhancing AI safety and security.
First, we examine the robustness of discriminative models, identifying blind spots in computer vision models. We discover the existence of high-confidence star-like paths in the input space, where models consistently predict the same output label despite the addition of significant noise. We also develop provable robustness certificates for streaming classification models. Next, we analyze the challenges in detecting AI-generated content across text and image modalities. Through empirical and theoretical studies, we find that detecting AI content is a difficult task, especially as generative models improve over time. We demonstrate attacks on a wide range of text and image detectors, showing that adversaries can efficiently break these systems to induce both type-I and type-II errors.
We then investigate the robustness of generative Large Language Models (LLMs), focusing on adversarial attacks and hallucinations. Our fast, beam search-based adversarial attacks can compromise LLMs in under a GPU minute. We also demonstrate that multi-modal LLMs have increased attack surface and are more vulnerable to adversarial attacks in the real world. These attacks can jailbreak, induce hallucinations, and carry out privacy attacks on LLMs. To mitigate these risks, we propose a suite of methods for detecting hallucinations in both white-box and black-box settings by analyzing the internal dynamics and output probabilities of LLMs. Lastly, we address user privacy concerns by proposing a method to create unlearnable datasets through data poisoning. This method, which is model-agnostic and computationally efficient, ensures that datasets used without user consent become unlearnable by AI systems.
In summary, this thesis analyzes various attacking frameworks to understand the robustness and security vulnerabilities of AI models. We hope the insights provided by this work will be useful in designing safer, more secure AI systems.