Exploring Audio AI with CS Ph.D. Student Sreyan Ghosh

Ghosh shares insights on his latest research and the potential of multimodal reasoning.
Descriptive image for Exploring Audio AI with CS Ph.D. Student Sreyan Ghosh

When Sreyan Ghosh began his journey in computer science, he found himself drawn to a topic with limited academic visibility in the United States—speech processing. After completing his undergraduate studies in India and working with several research labs there, Ghosh sought a program that would provide him with the flexibility to pursue emerging areas in artificial intelligence. That search led him to the University of Maryland’s Department of Computer Science, where he now conducts research under Distinguished University Professor Dinesh Manocha.

Ghosh’s latest project, Audio Flamingo 2, builds on prior developments in speech and audio processing, advancing audio-language modeling by focusing on expert reasoning and long-form audio. His work, which intersects generative modeling and multimodal AI, has garnered attention at conferences such as NVIDIA’s GPU Technology Conference (GTC).

In this Q&A, Ghosh reflects on his decision to pursue a Ph.D. at UMD, the evolution of his research interests and the technical and practical implications of Audio Flamingo 2.

Q: What initially drew you to pursue a Ph.D. in computer science at UMD?

A: After graduation, I began researching speech processing. In the U.S., there aren’t many academic labs focusing on that area. It’s fairly niche. While I was in India, I worked with several labs, and when I expressed interest in pursuing graduate studies in the U.S., they recommended the University of Maryland (UMD). Here, we have faculty like Professors Ramani Duraiswami and Dinesh Manocha, both of whom have significant experience in audio processing. What made UMD stand out to me was that Professor Manocha also offered flexibility in choosing research topics. Some other programs were more narrowly focused.

Q: How has your time at UMD shaped the researcher you’ve become?

A: It’s been a great environment. One important thing for any researcher, especially in AI, is the ability to work on topics that evolve quickly. What’s considered relevant can shift dramatically in six months. UMD gave me the space to explore those changes. Professor Manocha supported me even when I switched directions. That freedom, combined with strong collaboration opportunities here, has helped me develop both technically and as part of a research team.

Q: What areas have you focused on throughout your Ph.D., and how did you become interested in audio reasoning specifically?

A: Before starting my Ph.D., I primarily worked on speech processing in low-resource environments, aiming to enhance systems for languages with limited data. Once I joined UMD, I started exploring whether generative models could be used to create synthetic data to support those low-resource cases. Eventually, those two areas—synthetic data and speech processing—merged into what is now known as audio-language modeling. My main question became: Can we utilize the knowledge captured by language models, trained on abundant data, to enhance audio modeling, which has limited data? That led to my interest in audio reasoning, which goes beyond simple tasks like transcription and tackles more complex audio-based questions.

Q: For those unfamiliar, what is your latest project, Audio Flamingo 2, and what makes it stand out in the field of audio-language modeling?

A: Audio Flamingo 2 is a large audio-language model, with around 3 billion parameters. It processes sounds and music, and it can answer complex questions about them, beyond simple captioning. What makes it unique is its focus on reasoning. For example, you can play a piece of music and ask the model to identify the country from which the music style originated. It also supports long audio inputs—up to five minutes, whereas most models cap out at 30 seconds. We developed custom datasets, new encoders and used cross-attention architecture to make this possible.

Q: Why do you think this area of AI research is important?

A: Audio is a natural signal—like vision. Unlike text, which is a human-constructed means of communication, sounds and speech occur naturally. Most audio research focuses on speech, but I believe we should expand our focus beyond that. If we’re building systems that interact with the real world, they need to understand all kinds of sounds—music, background noise and so on. That’s where audio-language modeling becomes important.

Q: How did your team achieve the model’s level of reasoning and efficiency?

A: One of the key design decisions was using cross-attention architecture. Earlier models often employed prefix-based methods, but cross-attention, first introduced in vision models like Flamingo, proved more effective for us in the audio domain. We also created a large dataset comprising over 5 million audio-based question-answer pairs, specifically focused on reasoning tasks. We experimented extensively with training schedules, architectures, and input types—burning a significant amount of GPU hours in the process. For long audio, we created custom data from scratch using online sources and structured it for complex reasoning.

Q: What real-world applications do you envision for this technology?

A: One important application is generating metadata or captions for unlabeled audio. To train generative models, such as those for text-to-music generation, you require labeled data. Our model can help label large volumes of audio, making it easier to train those systems. Companies like Adobe, for instance, have utilized our earlier models for tasks such as audio quality estimation and identifying background noise for enhancement purposes. Other use cases include music recommendation, audio analysis and accessibility tools that require audio understanding.

Q: How did it feel to see the project demoed at GTC, one of the largest AI conferences in the world?

A: It was good to see the interest in the project. We demonstrated both parts of the system—analysis and generation. We demonstrated how Audio Flamingo 2 can be utilized to caption unlabeled sounds and subsequently train a generative model based on these captions. We also had expert musicians at the booth who provided valuable feedback, which was greatly appreciated.

Q: What are you hoping to explore next in your research?

A: We’re already working on the next version of Audio Flamingo. First, we want to incorporate speech understanding, which wasn’t included in the earlier versions. We’ve seen that speech modeling can actually improve music and sound modeling. We also aim to expand the model’s capabilities to handle even longer audio, ranging from 20 to 30 minutes, and support multiple audio inputs. For example, musicians may want to compare multiple audio samples to identify rhythm or beat patterns. We’re also exploring whether the model itself can generate music or speech—essentially becoming a kind of “omni-model” for audio.

—Story by Samuel Malede Zewdu, CS Communications 

The Department welcomes comments, suggestions and corrections.  Send email to editor [-at-] cs [dot] umd [dot] edu.