Poster
MARPLE: A Benchmark for Long-Horizon Inference
Emily Jin · Zhuoyi Huang · Jan-Philipp Fraenken · Weiyu Liu · Hannah Cha · Erik Brockbank · Sarah Wu · Ruohan Zhang · Jiajun Wu · Tobias Gerstenberg
Reconstructing past events requires reasoning across long time horizons, drawing upon diverse evidence such as visual, language, and auditory cues, as well as prior knowledge about the world and human behavior. We introduce MARPLE, a benchmark for evaluating long-horizon inference capabilities using multi-modal evidence. Our benchmark features agents interacting in simulated households, supporting vision, language, and auditory stimuli as well as procedurally generated environments and agent behaviors. Inspired by classic ``whodunit'' stories, we ask AI models and human participants to infer which agent caused a change in the environment based on a step-by-step replay of what actually happened. The goal is to correctly identify the culprit as early as possible. Our findings show that human participants outperform both traditional Monte Carlo simulation methods and an LLM baseline (GPT-4) on this task. Compared to humans, traditional inference models demonstrate lower robustness and performance, while GPT-4 exhibits difficulties in comprehending environmental changes. We further analyze factors that influence inference performance and ablate different modes of evidence, finding that all modes are valuable in improving performance. Overall, our experiments demonstrate that the long-horizon, multimodal inference tasks in our benchmark present a challenge to current models.
Live content is unavailable. Log in and register to view live content