Poster
in
Workshop: 6th Robot Learning Workshop: Pretraining, Fine-Tuning, and Generalization with Large Scale Models
MultiReAct: Multimodal Tools Augmented Reasoning-Acting Traces for Embodied Agent Planning
Zhouliang Yu · Jie Fu · Yao Mu · Chenguang Wang · Lin Shao · Yaodong Yang
Keywords: [ Multimodal Learning ] [ Large language models ] [ Embodied Planning ]
Large Language Models (LLMs) have demonstrated impressive proficiency in tasks involving simple reasoning. However, they face significant challenges when confronted with longer-horizon tasks described in abstract instructions.These challenges stem from two main limitations.Firstly, text-only LLMs struggle to cope with the demands of complex embodied tasks that require nuanced multimodal reasoning. Secondly, LLMs encounter difficulties in recognizing and autonomously recovering from intermediate execution failures.To overcome these limitations and enhance the planning capabilities of LLMs in embodied scenarios, we propose a novel approach called MultiReAct.Our framework made the following efforts:We utilize a parameter-efficient adaptation of a pre-trained visual language model, enabling it to tackle embodied planning tasks by converting visual demonstrations into sequences of actionable language commands.By leveraging CLIP as a reward model, we identify instances of sub-instruction execution failure, significantly increasing the success rate in achieving final objectives.We introduce an adaptable paradigm for embodied planning through in-context learning from demonstration, independent of the specific Visual Language Model (VLM), and low-level actor. Our framework supports two distinct low-level actors: an imitation learning agent and a code generation-based actor.Using the MultiReAct framework, we apply it to a diverse set of long-horizon planning tasks and demonstrate superior performance compared to previous LLM-based methods. The extensive experimental results underscore the effectiveness of our approach in addressing long-horizon embodied planning.