Skip to yearly menu bar Skip to main content


Poster

Grounding Multimodal Large Language Models in Actions

Andrew Szot · Bogdan Mazoure · Harsh Agrawal · R Devon Hjelm · Zsolt Kira · Alexander Toshev

[ ]
Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, including both continuous and discrete actions. For continuous actions, a set of learned tokenizations that capture an action at various resolutions allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action grounding approaches on five different environments, encompassing over 114 embodied tasks.

Live content is unavailable. Log in and register to view live content