PhD Proposal: Translating Natural Language to Visually Grounded Verifiable Plans

Talk
Angelos Mavrogiannis
Time: 
11.08.2024 10:00 to 12:00
Location: 

IRB-4105

http://umd.zoom.us/my/angelosm
To be useful in household environments, robots may need to understand natural language in order to parse and execute verbal commands from novice users. This is a challenging problem that requires mapping linguistic constituents to physical entities and at the same time orchestrating an action plan that utilizes these entities to complete a task. Planning problems that previously relied on querying manually crafted knowledge bases can now leverage Large Language Models (LLMs) as a source of commonsense reasoning to map high-level instructions to action plans. However, the produced plans often suffer from model hallucinations, ignore action preconditions, or omit essential intermediate actions under the assumption that users can infer them from context and prior experience.In this proposal, we present our work on translating natural language instructions to visually grounded verifiable plans. First, we motivate the use of classical concepts such as Linear Temporal Logic (LTL) to verify LLM-generated action plans. By expressing these plans in a formal language notation that adheres to a set of rules and specifications, we can generate discrete robot controllers with provable performance guarantees. Second, we focus on grounding linguistic instructions to visual sensory information and we find that Vision Language Models (VLMs) often struggle with identifying non-visual attributes. Our key insight is that non-visual attribute detection can be effectively achieved by active perception guided by visual reasoning. To this end, we present a perception-action API that consists of perceptual and motoric functions. When prompted with this API and a natural language query, an LLM generates a program to actively identify attributes given an input image. Third, we present ongoing work using the Planning Domain Definition Language (PDDL) as an action representation. By binding perceptual functions to action preconditions and effects explicitly modeled in the PDDL domain, we visually validate successful action execution at runtime, producing visually grounded verifiable action plans.