Oral
in
Workshop: Workshop on Video-Language Models
Taskverse: A Benchmark Generation Engine for Multi-modal Language Model
Jieyu Zhang · Weikai Huang · Zixian Ma · Oscar Michel · Dong He · Tanmay Gupta · Wei-Chiu Ma · Ali Farhadi · Aniruddha Kembhavi · Ranjay Krishna
Benchmarks for large multimodal language models (MLMs) now serve to simultaneously assess the general capabilities of models, rather than evaluating a specific capability. As a result, when a developer seeks to identify which models to use for their application, they are often overwhelmed by the number of benchmarks and remain uncertain about which benchmark results are most reflective of their specific use case. This paper introduces Taskverse, a benchmark generation engine that produces benchmarks tailored to different user needs. Taskverse maintains an extendable taxonomy of visual assets, including 113K images, 10K videos, 2K 3D object assets, over 365 object categories, 655 attributes, and 335 relationships, and can programmatically generate over 750 million Image/VideoQA questions. Additionally, it algorithmically addresses user queries regarding MLM performance efficiently within a relative computational budget by employing interactive learning query approximation algorithms. With Taskverse, we can answer specific and fine-grained user queries like: "Which model is the best VideoQA model for recognizing color within 1000 inference times?"