Poster
in
Workshop: Workshop on Open-World Agents: Synnergizing Reasoning and Decision-Making in Open-World Environments (OWA-2024)
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
Lawrence Jang · Yinheng Li · Charles Ding · Justin Lin · Paul Pu Liang · Dan Zhao · Rogerio Bonatti · Kazuhito Koishida
Keywords: [ agents ] [ long context models ] [ web agents ] [ video understanding ]
Videos are often used to learn or extract the necessary information to complete tasksin ways different than what text and static imagery alone can provide. However,many existing agent benchmarks neglect long-context video understanding, insteadfocusing on text or static image inputs. To bridge this gap, we introduce VideoWe-bArena (VideoWA), a benchmark for evaluating the capabilities of long-contextmultimodal agents for video understanding. VideoWA consists of 2,021 web agenttasks based on manually crafted video tutorials, which total almost four hours ofcontent. For our benchmark, we define a taxonomy of long-context video-basedagent tasks with two main areas of focus: skill retention and factual retention.While skill retention tasks evaluate whether an agent can use a given human demon-stration to complete a task efficiently, the factual retention task evaluates whetheran agent can retrieve instruction-relevant information from a video to completea task. We find that the best model achieves 13.3% success on factual retentiontasks and 45.8% on factual retention QA pairs, far below human performanceat 73.9% and 79.3%, respectively. On skill retention tasks, long-context modelsperform worse with tutorials than without, exhibiting a 5% performance decreasein WebArena tasks and a 10.3% decrease in VisualWebArena tasks. Our workhighlights the need to improve the agentic abilities of long-context multimodalmodels and provides a testbed for future development with long-context videoagents.