Skip to yearly menu bar Skip to main content


Spotlight Poster

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

Ruisheng Cao · Fangyu Lei · Haoyuan Wu · Jixuan Chen · Yeqiao Fu · Hongcheng Gao · Xinzhuang Xiong · Hanchong Zhang · Wenjing Hu · Yuchen Mao · Tianbao Xie · Hongshen Xu · Danyang Zhang · Sida Wang · Ruoxi Sun · Pengcheng Yin · Caiming Xiong · Ansong Ni · Qian Liu · Victor Zhong · Lu Chen · Kai Yu · Tao Yu

West Ballroom A-D #5301
[ ] [ Project Page ]
[ Slides
Thu 12 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Data science and engineering workflows often involve multiple steps, from data warehousing to data orchestration, requiring code writing in languages like SQL and Python and extensive GUI operations in professional enterprise data software systems such as BigQuery, dbt, and Airbyte. With the rapid progress of VLMs in multimodal understanding and code generation, VLM-based agents have the potential to automate these workflows, enhancing productivity for data scientists and engineers while democratizing large data access.To this end, we introduce Spider2-V, the first multimodal agent benchmark of 494 real-world tasks in a real computer environment, covering the entire data workflow and spanning 20 enterprise-level professional applications. These tasks, derived from real-world use cases, evaluate a multimodal agent's ability to perform user data tasks by writing code and managing the GUI in enterprise data software systems. To ensure reproducible and reliable experiments with these enterprise data applications, we develop a set of automatic task setup configurations and customized evaluation metrics for each task. Furthermore, we supplement multimodal agents with a comprehensive document warehouse of these enterprise data software systems. Our empirical evaluation reveals that existing state-of-the-art LLM/VLM-based agents show promise but fall short in achieving full data workflow automation 14% success). Even with step-by-step guidance, these agents underperform in fine-grained knowledge-intensive GUI actions (20.2%) and tasks requiring real accounts (11.3%). Extensive analysis of Spider2-V paves the way for practical multimodal agents to revolutionize data science and engineering workflow automation.

Chat is not available.