Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Multimodal Algorithmic Reasoning Workshop

Smart Vision-Language Reasoners

Denisa Olteanu Roberts · Lucas R Roberts

[ ]
[ Poster
Sun 15 Dec 2:15 p.m. PST — 4:15 p.m. PST

Abstract:

In this article, we investigate vision-language models (VLM) as reasoners. The ability to form abstractions underlies algorithmic reasoning. Furthermore, human reasoning is inherently multimodal. We employ the abstractions given in the Simple Multimodal Algorithmic Reasoning Task (SMART) introduced in Cherian et al. [2022] as meta-reasoning and problem-solving skills along eight axes: math, counting, path, measure, logic, spatial, and pattern. We investigate the abilityof custom multimodal models to reason along these axes and seek avenues for improvement. Including adaptively learned composite representations from fused frozen foundation models encodings enabled better alignment, learning, and visual grounding. The smartest VLM, which includes a novel deep multimodal layer, the QF layer, improves upon the previous baselines across the reasoning skills from the SMART task. We provide model and experiment code at https://github.com/smarter-vlm/smarter.

Chat is not available.