Talks

AI Agents for Document Understanding

5105 Brendan Iribe Center for Computer Science and Engineering (IRB)

Wednesday, April 30, 2025, 11:00 am-12:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Understanding documents—ranging from reading academic papers and financial reports to editing posters and presentations—is essential but challenging due to their complex nature, which often involves a mix of layout structures, visual elements, and text. Recently, AI agents powered by multimodal large language models (MLLMs) have shown promising results, demonstrating the ability to interact with complex document content. In this talk, I will present recent advancements in developing AI agents for document understanding, with a focus on two major tasks: answering questions about documents (Document VQA) and editing documents based on user requests (Document Editing). I will conclude by discussing the remaining challenges and future directions in both tasks.

Bio

Jihyung Kil is a Research Scientist at Adobe Research. He earned his Ph.D. in Computer Science from Ohio State University, working with Wei-Lun (Harry) Chao. Prior to Adobe, he interned at Google Research (now DeepMind) and Amazon. His research interests include Vision and Language, with a recent focus on multimodal document understanding and Web/GUI agents.

This talk is organized by Naomi Feldman