Oral
in
Workshop: Workshop on Open-World Agents: Synnergizing Reasoning and Decision-Making in Open-World Environments (OWA-2024)
ShowUI: One Vision-Language-Action Model for Generalist GUI Agent
Kevin Qinghong Lin · Linjie Li · Difei Gao · Zhengyuan Yang · Zechen Bai · Weixian Lei · Lijuan Wang · Mike Zheng Shou
Keywords: [ GUI Agent ] [ Vision-Language-Action Models ] [ Graphical User Interface; Language Agent; Vision-Language-Action Models; Computer Usage ] [ Human workflow Automation ]
Sun 15 Dec 9 a.m. PST — 5:15 p.m. PST
Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with digital tasks. While recent Large Language Models (LLMs) and Large Multimodal Models (LMMs) have been used to build autonomous agents capable of solving complex tasks, they often rely on closed-source, API-based solutions and exhibit limitations in GUI-specific interactions. Inspired by the success of Vision-Language-Action (VLA) models in embodied environments, we explore their potential in the digital GUI world. In this work, we develop a recipe for training a VLA for GUI agent –ShowUI, a 4.2B parameter model based on Phi-3.5-vision-instruct. By leveraging scalable GUI visual data (e.g., screenshots with action trajectory), we aim to develop a generalist GUI agent that demonstrates capabilities across diverse dimensions: grounding, navigation, understanding. ShowUI supports various platforms—including websites, desktops, and mobile phones—and accommodates diverse visual inputs such as single-frame images, multiple frames, and videos. We show that ShowUI achieves significant results across multiple benchmarks, including Screenspot, Mind2Web, AITW, AITZ, GUI-Odyssey, and GUI-World. We provide extensive experiments to analyze the impact of different types of training corpus and model design decisions on downstream tasks. The model, code and data will be open-sourced at https://github.com/showlab/ShowUI.