Poster
in
Workshop: System-2 Reasoning at Scale
A Llama Sunk My Battleship! Asking Rational Questions with LLMs via Bayesian Inference
Gabriel Grand · Valerio Pepe · Jacob Andreas · Josh Tenenbaum
One of the hallmarks of an intelligent agent is the ability to ask good questions. While facility with language is clearly a prerequisite, even in simple settings, LLMs can struggle to come up with questions that yield useful information---suggesting a failure of grounded reasoning. We study this phenomenon in a question-asking task based on the classic board game Battleship, where both text-only and multimodal LLMs perform far below human baselines. We propose a Bayesian model that combines a LLM-driven prior over questions with a probabilistic world model to facilitate coherent reasoning. We find that with a surprisingly modest sample budget for “mental computation,” our method is well-calibrated to human performance across varied Battleship board scenarios. Notably, this approach allows much smaller LLMs, such as CodeLlama-7b, to perform on par with GPT-4. These results support the emerging trend toward test-time inference as a scaling route for LLM reasoning, while highlighting the utility of probabilistic world models for grounding and structuring such computations.