Understanding documents—ranging from reading academic papers and financial reports to editing posters and presentations—is essential but challenging due to their complex nature, which often involves a mix of layout structures, visual elements, and text. Recently, AI agents powered by multimodal large language models (MLLMs) have shown promising results, demonstrating the ability to interact with complex document content. In this talk, I will present recent advancements in developing AI agents for document understanding, with a focus on two major tasks: answering questions about documents (Document VQA) and editing documents based on user requests (Document Editing). I will conclude by discussing the remaining challenges and future directions in both tasks.
Jihyung Kil is a Research Scientist at Adobe Research. He earned his Ph.D. in Computer Science from Ohio State University, working with Wei-Lun (Harry) Chao. Prior to Adobe, he interned at Google Research (now DeepMind) and Amazon. His research interests include Vision and Language, with a recent focus on multimodal document understanding and Web/GUI agents.