Skip to yearly menu bar Skip to main content


Poster

Animal-Bench: Benchmarking Multimodal Video Models for Animal-centric Video Understanding

Yinuo Jing · Ruxu Zhang · Kongming Liang · Yongxiang Li · Zhongjiang He · Zhanyu Ma · Jun Guo

[ ]
Wed 11 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

With the emergence of large pre-trained multimodal video models, multiple benchmarks have been proposed to evaluate model capabilities. However, most of the benchmarks are human-centric, with evaluation data and tasks centered around human applications. Animals are an integral part of the natural world, and animal-centric video understanding is crucial for animal welfare and conservation efforts. Yet, existing benchmarks overlook evaluations focused on animals, limiting the application of the models. To alleviate this issue, our work establishes an animal-centric benchmark, namely Animal-Bench, to allow for a broader assessment of model capabilities in real-world contexts, overcoming agent-bias in previous benchmarks. Animal-Bench includes 13 tasks encompass both common tasks shared with humans and special tasks relevant to animal conservation, spanning 7 major animal categories and 822 species, comprising a total of 52,236 data entries. To generate this evaluation benchmark, we defined a task system centered on animals and proposed an automated pipeline for animal-centric data processing. To further validate the robustness of models against real-world challenges, we utilize a video editing approach to simulate realistic scenarios like weather change and shooting parameters due to animal movement. We evaluated 8 current multimodal video models using our benchmark and found considerable room for improvement. We hope our work provides insights for the community and opens up new avenues for research in multimodal video models.

Live content is unavailable. Log in and register to view live content