Skip to yearly menu bar Skip to main content


Poster

FlexCap: Describe Anything in Images in Controllable Detail

Debidatta Dwibedi · Vidhi Jain · Jonathan Tompson · Andrew Zisserman · Yusuf Aytar

[ ]
Wed 11 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

We introduce a versatile flexible-captioning vision-language model called FlexCap, capable of generating region-specific descriptions of varying lengths. It is trained to produce length-conditioned captions for input boxes. This allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To achieve this, we create large-scale training datasets of image region descriptions of varying length from captioned web images.This flexible-captioning capability has several valuable applications. First, FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset. Second, a visual question answering (VQA) system can be built by employing FlexCap to generate localized descriptions as inputs to a large language model. The resulting system achieves state-of-the-art zero-shot performance on a number of VQA datasets. We qualitatively demonstrate FlexCap's broad applicability in tasks such as image labeling, object attribute recognition, and visual dialog.

Live content is unavailable. Log in and register to view live content