Poster
in
Workshop: Workshop on robustness of zero/few-shot learning in foundation models (R0-FoMo)
LOWA: Localize Objects in the Wild with Attributes
Xiaoyuan Guo · Kezhen Chen · Jinmeng Rao · Yawen Zhang · Baochen Sun · Jie Yang
Existing open-vocabulary object detectors can struggle with uncommon or fine-grained classes, as the model and users may have different understandings of object names. Incorporating attributes such as color, shape, and size can help to reduce this inconsistency and make interactive detection more convenient and flexible. Motivated by this, we present LOWA, a new method for localizing objects with attributes effectively in the wild. To train LOWA, we propose a multi-step vision-language training strategy to learn object detection and recognition with class names as well as attribute information, which empowers users to flexibly customize text queries and extend to fine-grained detection with attribute and object information for a wider range of applications. LOWA is built on top of a two-tower vision-language architecture and consists of a standard vision transformer as the image encoder and a similar transformer as the text encoder. To learn the alignment between visual and text inputs at the instance level, we train LOWA with three training steps: object-level training, attribute-aware learning, and free-text joint training of objects and attributes. This training strategy first ensures correct object detection, then incorporates instance-level attribute information, and finally balances the object class and attribute sensitivity. We evaluate our model performance of attribute classification and attribute localization on the Open-Vocabulary Attribute Detection (OVAD) benchmark and the Visual Attributes in the Wild (VAW) dataset, and experiments indicate strong zero-shot performance. Ablation studies additionally demonstrate the effectiveness of each training step of our approach.