Poster
in
Workshop: Multimodal Algorithmic Reasoning Workshop
HAMMR : HierArchical MultiModal React agents for generic VQA
Lluis Castrejon · Thomas Mensink · Howard Zhou · Vittorio Ferrari · Andre Araujo · Jasper Uijlings
Sun 15 Dec 8:25 a.m. PST — 5:05 p.m. PST
The next generation of Visual Question Answering (VQA) systems should handle a broad range of questions over many VQA benchmarks. Therefore we aim to develop a single system for a varied suite of VQA tasks including counting, spatial reasoning, OCR-based reasoning, visual pointing, external knowledge, and more. In this setting, we demonstrate that naively applying a LLM+tools approach using the combined set of all tools leads to poor results. This motivates us to introduce HAMMR: HierArchical MultiModal React. We start from a multimodal ReAct-based system and make it hierarchical by enabling our HAMMR agents to call upon other specialized agents. This enhances the compositionality, which we show to be critical for obtaining high accuracy. On our generic VQA suite, HAMMR outperforms a naive LLM+tools approach by 16.3% and outperforms the generic standalone PaLI-X VQA model by 5.0%.