Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Open-World Agents: Synnergizing Reasoning and Decision-Making in Open-World Environments (OWA-2024)

SPA-BENCH: A COMPREHENSIVE BENCHMARK FOR SMARTPHONE AGENT EVALUATION

Jingxuan Chen · Derek Yuen · Bin Xie · Yuhao Yang · Gongwei Chen · Zhihao Wu · Li Yixing · Xurui Zhou · Weiwen Liu · Shuai Wang · Rui Shao · Liqiang Nie · Yasheng Wang · Jianye Hao · Jun Wang · Kun Shao

Keywords: [ Benchmark ] [ AI Agent ] [ Smartphone Control ]


Abstract:

Smartphone agents are increasingly important for helping users control devicesefficiently, with (Multimodal) Large Language Model (MLLM)-based agentsemerging as key contenders. Fairly comparing these agents is essential but chal-lenging, requiring a diverse task scope, the integration of agents with different im-plementations, and a generalisable evaluation pipeline to assess their strengths andweaknesses. In this paper, we present SPA-BENCH, a comprehensive SmartPhoneAgent Benchmark designed to evaluate (M)LLM-based agents in an end-to-endsetting. SPA-BENCH offers three key contributions: (1) A diverse set of taskscovering system and third-party apps in both English and Chinese, focusing onfeatures used in daily routines; (2) A plug-and-play framework enabling real-timeagent interaction with Android devices, integrating over 10 agents with the flex-ibility to add more, regardless of their underlying models or how they interactwith the environment; (3) A novel evaluation pipeline that assesses agent perfor-mance across multiple dimensions, using coarse-to-fine success detection along-side completion- and consumption-related metrics. Our extensive experimentsacross tasks and agents reveal challenges like interpreting mobile user interfaces,action grounding, memory retention, and resource consumption. We proposefuture research directions to ease these difficulties, moving closer to real-worldsmartphone agent applications.

Chat is not available.