PhD Proposal: Interactive Vision Systems with Visual and Textual Prompts
IRB-3137 or https://umd.zoom.us/my/chuonghm
In this talk, I will share how my research explores a major shift in computer vision: from static, model-driven systems to interactive frameworks guided by user input. My work focuses on both visual and textual prompting strategies that make models more effective, intuitive, and adaptable.
First, I will present two visual prompting methods: SimpSON, which enables segmentation of multiple similar objects with a single click, and MaGGIe, which uses coarse masks to handle ambiguity in instance matting for multi-person scenes. Both approaches emphasize minimal user input, fast performance, and generalization across diverse scenarios.
Then, I will dive into interactive textual prompting with CoLLM, a language-driven retrieval framework that captures complex user intent without relying on manually annotated triplets. I will also introduce new datasets generated via large language models to support this line of work.
Overall, my research highlights how user interactions, both visual and textual, can transform vision models into more precise, efficient, and user-friendly systems.