Poster
in
Workshop: Workshop on Open-World Agents: Synnergizing Reasoning and Decision-Making in Open-World Environments (OWA-2024)
Integrating Visual and Linguistic Instructions for Context-Aware Navigation Agents
Suhwan Choi · Yongjun Cho · Minchan Kim · Jaeyoon Jung · Myunchul Joe · Park Yu Been · Minseo Kim · Sungwoong Kim · Sungjae Lee · WHISEONG PARK · Jiwan Chung · Youngjae Yu
Keywords: [ Imitation Learning ] [ Vision-Language-Action (VLA) Models ] [ Multimodal Instruction Following ]
Sun 15 Dec 9 a.m. PST — 5:15 p.m. PST
Real-life robot navigation involves more than simply reaching a destination; it requires optimizing movements while considering scenario-specific goals. Humans often express these goals through abstract cues, such as verbal commands or rough sketches. While this guidance may be vague or noisy, we still expect robots to navigate as intended. For robots to interpret and execute these abstract instructions in line with human expectations, they need to share a basic understanding of navigation concepts with humans. To address this challenge, we introduce CANVAS, a novel framework that integrates both visual and linguistic instructions for commonsense-aware navigation. CANVAS leverages imitation learning, enabling robots to learn from human navigation behavior. We also present COMMAND, a comprehensive dataset that includes human-annotated navigation results spanning over 48 hours and 219 kilometers, specifically designed to train commonsense-aware navigation systems in simulated environments. Our experiments demonstrate that CANVAS outperforms the strong rule-based ROS NavStack system across all environments, excelling even with noisy instructions. In particular, in the orchard environment where ROS NavStack achieved a 0% success rate, CANVAS reached a 67% success rate. CANVAS also closely aligns with human demonstrations and commonsense constraints, even in unseen environments. Moreover, real-world deployment of CANVAS shows impressive Sim2Real transfer, with a total success rate of 69%, highlighting the potential of learning from human demonstrations in simulated environments for real-world applications.