NeurIPS HAMMR : HierArchical MultiModal React agents for generic VQA

Poster
in
Workshop: Multimodal Algorithmic Reasoning Workshop

HAMMR : HierArchical MultiModal React agents for generic VQA

Lluis Castrejon · Thomas Mensink · Howard Zhou · Vittorio Ferrari · Andre Araujo · Jasper Uijlings

[ Abstract ]

[ Poster]

Sun 15 Dec 2:15 p.m. PST — 4:15 p.m. PST

presentation: Multimodal Algorithmic Reasoning Workshop
Sun 15 Dec 8:25 a.m. PST — 5:05 p.m. PST

Abstract:

The next generation of Visual Question Answering (VQA) systems should handle a broad range of questions over many VQA benchmarks. Therefore we aim to develop a single system for a varied suite of VQA tasks including counting, spatial reasoning, OCR-based reasoning, visual pointing, external knowledge, and more. In this setting, we demonstrate that naively applying a LLM+tools approach using the combined set of all tools leads to poor results. This motivates us to introduce HAMMR: HierArchical MultiModal React. We start from a multimodal ReAct-based system and make it hierarchical by enabling our HAMMR agents to call upon other specialized agents. This enhances the compositionality, which we show to be critical for obtaining high accuracy. On our generic VQA suite, HAMMR outperforms a naive LLM+tools approach by 16.3% and outperforms the generic standalone PaLI-X VQA model by 5.0%.

Chat is not available.

Poster in Workshop: Multimodal Algorithmic Reasoning Workshop

HAMMR : HierArchical MultiModal React agents for generic VQA

Lluis Castrejon · Thomas Mensink · Howard Zhou · Vittorio Ferrari · Andre Araujo · Jasper Uijlings

Poster
in
Workshop: Multimodal Algorithmic Reasoning Workshop