Alan Liu Receives NSF CAREER Award to Improve Monitoring of AI and Cloud Systems
Alan Liu, an assistant professor of computer science at the University of Maryland, has received a National Science Foundation (NSF) Faculty Early Career Development (CAREER) Program award to advance research on telemetry and observability for large-scale AI and cloud infrastructure.
Liu, who also has appointments in the University of Maryland Institute for Advanced Computer Studies (UMIACS), Maryland Cybersecurity Center (MC2) and the Artificial Intelligence Interdisciplinary Institute at Maryland (AIM), is the principal investigator on the award, which is expected to total about $700,000 over the next five years.
The NSF CAREER award supports early-career faculty who have the potential to serve as academic role models in research and education and to lead advances in their department’s or organization’s mission.
“I am deeply honored to receive this support from the National Science Foundation,” Liu said. “This award is especially meaningful because it recognizes the growing importance of building the systems foundation needed for robust and trustworthy AI. As AI infrastructure becomes increasingly central to science, education, health care, commerce and everyday life, I believe it is critical that we develop new ways to observe, understand and manage these complex systems at scale.”
Liu’s research develops telemetry and observability systems for large-scale compute infrastructure. Modern AI services depend on networks of servers, accelerators, storage systems and software components that must work together under changing workloads.
Those systems generate large amounts of operational data, including signals related to network traffic, resource use, failures, bottlenecks and attacks. As AI and cloud infrastructure grows, those signals are becoming increasingly difficult to monitor through traditional approaches.
Liu’s NSF-supported project places approximation at the center of observability. Instead of collecting and analyzing every data point, the project will develop compact and uncertainty-aware summaries that preserve the information most needed for real-time decision-making.
Those summaries could help system operators understand what is happening across large infrastructure, diagnose problems before they spread and make trade-offs among accuracy, speed and cost. The project will also produce open-source software, documentation, educational materials and research artifacts for students, researchers and practitioners.
“I am excited that this project will allow my group to explore bold new ideas in telemetry and observability while also producing practical tools, open-source artifacts and educational materials for the broader community,” Liu said. “I am also grateful to my students, collaborators, mentors and colleagues at UMD, whose support and intellectual energy make this work possible.”
Before joining UMD, Liu was an assistant professor of electrical and computer engineering at Boston University and a postdoctoral researcher at Carnegie Mellon University. He earned his doctorate in computer science from Johns Hopkins University in 2019.
His work has received best paper awards at USENIX FAST and IEEE International Conference on Cloud Engineering, as well as interdisciplinary recognitions from the ACM Symposium on Theory of Computing and the USENIX Annual Technical Conference.
Award Information:
"Approximation-First Telemetry for Hyperscale Networked Systems" is supported by NSF grant #2544434 from the NSF’s Division of Information & Intelligent Systems.
—Story by Samuel Malede Zewdu, CS Communications
The Department welcomes comments, suggestions and corrections. Send email to editor [-at-] cs [dot] umd [dot] edu.
