Skip to yearly menu bar Skip to main content


Poster

CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding

Emanuele Vivoli · Marco Bertini · Dimosthenis Karatzas

West Ballroom A-D #5404
[ ] [ Project Page ]
[ Slides [ Poster
Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

We introduce a novel benchmark – CoMix – designed to evaluate the multi-task capabilities of models in the realm of comic analysis. Unlike existing benchmarks that focus on individual tasks (e.g., object detection, text recognition), CoMix targets a diverse set of tasks, including detection (panels, characters, faces, text), speaker identification (character-to-text link), character re-identification (character clustering), character naming, panel-text sorting, and dialog generation. Our benchmark comprises a curated collection of existing datasets with single-task annotations, expanded with multi-task annotations. To address the predominance of manga-style data, we added a new set of comic-style books named Comics300, which significantly enriches the diversity of comic styles. CoMix integrates existing datasets with the newly introduced ones, ensuring standardized annotations across all tasks. This addresses key challenges in the field, such as limited datasets, inconsistent annotations, inaccessible model weights, and non-comparable results due to varied train/test splits and metrics. The benchmark is designed to assess pre-trained models in zero-shot, few-shot, and limited finetuning regimes, probing their transfer capabilities across different comic styles and tasks. The fine-tuning and validation splits of the benchmark are publicly available for research. Human baseline results compared to state-of-the-art models show a substantial gap in performance, highlighting significant opportunities for advancement in comic understanding. The dataset, baselines, and code are available at the repository link. This initiative sets a new standard for comprehensive comic analysis, providing a common benchmark for the community to evaluate on a large, varied test set.

Chat is not available.