Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Open-World Agents: Synnergizing Reasoning and Decision-Making in Open-World Environments (OWA-2024)

Learning Region-Word Alignment with Attentive Masking for Open-Vocabulary Object Detection

Masoumeh Zareapoor · Pourya Shamsolmoali · Yue Lu

Keywords: [ Open-vocabulary object detection ] [ Attentive masking ] [ Region-text alignment ]


Abstract:

Open-vocabulary object detection (OVDet) aims to detect novel categories based on textual descriptions, allowing models to generalize beyond the categories seen during training. However, achieving robust open-vocabulary detection poses significant challenges in aligning text descriptions with specific image regions and capturing spatial relationships between related regions. Most existing methods focus on aligning regions with categorical labels, often overlooking interactions between neighboring regions, limiting their ability to form a precise correspondence between text descriptions and image content. We propose AlignDet, which incorporates an attentive masking strategy to address these challenges. By masking irrelevant regions in the image, our model focuses on the most relevant areas for each text concept, leading to fine-grained region-word correspondences. Additionally, our soft association strategy allows multiple regions to align with a single text concept, capturing spatial relationships between neighboring or related regions of the image more effectively. Extensive experiments demonstrate that our model consistently surpasses existing methods across various benchmarks.

Chat is not available.