Poster
FlexCap: Describe Anything in Images in Controllable Detail
Debidatta Dwibedi · Vidhi Jain · Jonathan Tompson · Andrew Zisserman · Yusuf Aytar
We introduce a versatile flexible-captioning vision-language model called FlexCap, capable of generating region-specific descriptions of varying lengths. It is trained to produce length-conditioned captions for input boxes. This allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To achieve this, we create large-scale training datasets of image region descriptions of varying length from captioned web images.This flexible-captioning capability has several valuable applications. First, FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset. Second, a visual question answering (VQA) system can be built by employing FlexCap to generate localized descriptions as inputs to a large language model. The resulting system achieves state-of-the-art zero-shot performance on a number of VQA datasets. We qualitatively demonstrate FlexCap's broad applicability in tasks such as image labeling, object attribute recognition, and visual dialog.
Live content is unavailable. Log in and register to view live content