PhD Proposal: A Principled Approach to AI Alignment: Theoretical Foundations and Practical Algorithms

Talk
Souradip Chakraborty
Time: 
04.18.2025 15:00 to 16:30
Location: 

IRB IRB-4109

Abstract:
As AI agents are increasingly deployed in high-stakes applications, principled and robust alignment with human preferences becomes essential. This proposal advances the theoretical foundations of AI alignment by addressing three core challenges: (1) Distribution shift in online alignment, (2) Equitable alignment under preference diversity, and (3) Efficient personalization at inference time.Bilevel RLHF: First, we address a critical issue in online RLHF—the failure to capture the entanglement between reward learning and policy optimization—leading to distribution shift and suboptimal alignment. We propose an efficient bilevel optimization framework that models this interdependence and ensures stable alignment with provable guarantees and improved empirical performance.MaxMin RLHF: Second, we address the challenge of Pluralistic AI Alignment by deriving an impossibility result for single-utility RLHF, showing its limitations in representing diverse human preferences. To provide an equitable solution, we propose MaxMin RLHF inspired by the Egalitarian principle in social choice theory to better represent diverse human preferences.Transfer Q*: Finally, we address efficient inference-time alignment for real-time personalization without costly fine-tuning. We propose Transfer Q*, a principled controlled decoding algorithm that uses aligned base models to estimate optimal value functions for new rewards, provably efficient and high-quality alignment with effective personalization on real-world tasks.Together, these contributions provide a principled foundation for building safe, scalable, and fair alignment systems.