Poster
in
Workshop: Compositional Learning: Perspectives, Methods, and Paths Forward
A Multimodal Chain of Tools for Described Object Detection
Kwanyong Park · Youngwan Lee · Yong-Ju Lee
Keywords: [ Described Object Detection ]
Described object detection (DOD) is a promising direction for fine-grained and human-interactive visual recognition, where the goal is to detect target objects based on given language descriptions. Despite significant advancements in language-based object detection, current models still struggle with complex descriptions due to limited compositional understanding. To address this issue, we propose a novel multimodal chain-of-tools (MCoTs) framework that seamlessly integrates specialized tools to handle the two core functionalities of the DOD task: localization and compositional reasoning. Specifically, we decompose the complex DOD task into a series of subtasks, with each subtask handled by specialized tools, including detector and multimodal large language model (MLLM). This simple yet effective MCoTs framework demonstrates significant performance improvements on the challenging D3 benchmark without additional training overhead.