Poster
in
Workshop: Workshop on Video-Language Models
GUI-WORLD: A GUI-oriented Video Dataset for Multimodal LLM-based Agents
Dongping Chen · Yue Huang · Siyuan Wu · Jingyu Tang · Huichi Zhou · Qihui Zhang · Zhigang He · Yilin Bai · Gao Chujie · Liuyi Chen · Yiqiang Li · Chenlong Wang · Yue Yu · Tianshuo Zhou · Zhen Li · Yi Gui · Yao Wan · Pan Zhou · Jianfeng Gao · Lichao Sun
Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding code. However, current agents primarily exhibit excellent understanding capabilities in static environments and are predominantly applied in relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding across various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-orientated questions in three formats. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content, especially dynamic and sequential content. Our findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, VideoLLMs fall short in all GUI-orientated tasks given the sparse of GUI video dataset. Based on GUI-World, we take the initial step of leveraging a fine-tuned VideoLLM as a GUI agent, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, we conclude that using VideoLLMs as GUI agents remains a significant challenge. We believe our work provides valuable insights for future research in dynamic GUI content understanding.