UMD Team to Lead Six-Week Workshop on Advancing Audio-Based AI Models
Ramani Duraiswami, a professor of computer science at the University of Maryland, has been selected to lead one of three research topics at a workshop organized by Johns Hopkins University’s Center for Language and Speech Processing (CLSP).
The event, part of a series that has been held annually for more than 30 years, will take place in Brno, Czechia, from June 23 to August 1. It will be co-hosted by the Brno University of Technology and Phonexia, a speech recognition software company.
Selected through a competitive bidding process, Duraiswami’s project focuses on bridging the gap between the potential and current capabilities of Large Audio-Language Models (LALMs), which process speech, sound and music inputs alongside language.
His team’s prior work in the field includes building their own LALM and developing the first ever open-sourced benchmark tailored for multimodal audio understanding. Using the benchmark, they tested LALMs from companies like Google and OpenAI and revealed significant limitations, with even state-of-the-art models achieving only 53% accuracy on complex audio reasoning tasks.
This deficiency stems from how research on audio-based AI has lagged behind modalities such as language and vision, primarily because of the lack of large training datasets and benchmarks for assessing advanced audio processing capabilities.
This workshop provides an opportunity to address these challenges by fostering collaboration between students and research from various disciplines. Participants from several universities and industries in the U.S., Europe and Asia, will spend six weeks working together to produce improved learning architectures, training methodologies, and benchmarks.
“Our ultimate goal is to develop a model that can analyze nuanced details in audio,” says Duraiswami, who has an appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). “For example, if it listens to a meeting, it should detect emotion in the voices, understand who's taking turns, and distinguish who’s aggressive.”
Click HERE to read the full article
The Department welcomes comments, suggestions and corrections. Send email to editor [-at-] cs [dot] umd [dot] edu.