Skip to yearly menu bar Skip to main content


Poster

FlexCap: Describe Anything in Images in Controllable Detail

Debidatta Dwibedi · Vidhi Jain · Jonathan Tompson · Andrew Zisserman · Yusuf Aytar

East Exhibit Hall A-C #3709
[ ] [ Project Page ]
Wed 11 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

We introduce FlexCap, a vision-language model that generates region-specific descriptions of varying lengths. FlexCap is trained to produce length-conditioned captions for input boxes, enabling control over information density, with descriptions ranging from concise object labels to detailed captions. To achieve this, we create large-scale training datasets of image region descriptions with varying lengths from captioned web images. We demonstrate FlexCap’s effectiveness in several applications: first, it achieves strong performance in dense captioning tasks on the Visual Genome dataset. Second, we show how FlexCap’s localized descriptions can serve as input to a large language model to create a visual question answering (VQA) system, achieving state-of-the-art zero-shot performance on multiple VQA benchmarks. Our experiments illustrate FlexCap’s utility for tasks including image labeling, object attribute recognition, and visual dialog. Project webpage: https://flex-cap.github.io.

Live content is unavailable. Log in and register to view live content