Poster
in
Workshop: Workshop on Open-World Agents: Synnergizing Reasoning and Decision-Making in Open-World Environments (OWA-2024)
Chain-of-Imagination for Reliable Instruction Following in Decision Making
Enshen Zhou · Yiran Qin · Zhenfei (Jeremy) Yin · Yuzhou Huang · Ruimao Zhang · Lu Sheng · Yu Qiao · Jing Shao
Keywords: [ multimodal large language model ] [ low-level control ] [ image generation ] [ instruction following ] [ Chain-of-Imagination ]
Enabling the embodied agent to imagine step-by-step the future states and sequentially approach these situation-aware states can enhance its capability to make reliable action decisions from textual instructions. In this work, we introduce a simple but effective mechanism called Chain-of-Imagination (CoI), which repeatedly employs a Multimodal Large Language Model (MLLM) equipped diffusion model to facilitate imagining and acting upon the series of intermediate situation-aware visual sub-goals one by one, resulting in more reliable instruction-following capability. Based on the CoI mechanism, we propose an embodied agent DecisionDreamer as the low-level controller that can be adapted to different open-world scenarios. Extensive experiments demonstrate that DecisionDreamer can achieve more reliable and accurate decision-making and significantly outperform the state-of-the-art generalist agents in the Minecraft and CALVIN sandbox simulators, regarding the instruction-following capability. For more demos, please see https://sites.google.com/view/decisiondreamer.