PhD Proposal: Towards Trustworthy AI: Model Interpretability, Failure Mitigation, Alignment, and Verification

Talk
Atoosa Malemir Chegini
Time: 
04.18.2025 15:00 to 17:00

As AI systems are increasingly deployed in real-world settings, ensuring their trustworthiness, interpretability, and reliability remains a critical challenge. Models often exhibit systematic failures, lack transparency, misalign with human intent, and struggle with verifiable reasoning, limiting their robustness in practice. This proposal introduces methods to address these issues by improving failure detection, interpretability, alignment, and structured verification.
We first tackle model failures, where deep learning models rely on spurious correlations, leading to incorrect generalizations. To mitigate this, we develop a CLIP-based framework that detects failure patterns and generates synthetic data to improve robustness. Next, we investigate model inversion techniques, revealing how CLIP models encode biases and hidden associations, improving interpretability. To enhance alignment, we introduce SALSA, a reinforcement learning framework that leverages weight-space-averaged models to improve exploration in RLHF, leading to better generalization and human value alignment. Finally, we present RePanda, a structured fact verification framework that translates natural language claims into executable pandas queries, ensuring transparent and verifiable reasoning over tabular data.