PhD Proposal: Advancing Audio Processing in the Age of Large Language Models

Talk
Sreyan Ghosh
Time: 
02.25.2025 14:00 to 15:30
Location: 

IRB IRB-4109

Audio understanding is crucial for effective communication and decision-making, yet it has lagged behind language and vision due to data scarcity and the complexity of audio signals. While recent advances in Large Language Models (LLMs) have improved tasks such as Automatic Speech Recognition (ASR), audio captioning, and open-ended question answering, they still struggle with expert-level audio reasoning. Our research addresses this gap by enhancing audio perception and reasoning in LLMs through synthetic data, novel neural architectures, and improved audio representations.
In this talk, I will present our work on enhancing LLMs’ ability to process and reason about audio using better audio representations and synthetic data. I will also discuss our recent advancements in long and complex audio understanding, which are essential for capturing temporal and contextual dependencies in auditory environments. By leveraging these methods, our approach not only strengthens fundamental audio tasks but also pushes the boundaries of expert-level audio reasoning and contextual comprehension in audio-language models.