Skip to yearly menu bar Skip to main content


Poster

Game-Traversal-Benchmark: Evaluating Planning Abilities Of Large Language Models Via Traversing 2D Game Maps

Muhammad Umair Nasir · Steven James · Julian Togelius

West Ballroom A-D #5301
[ ]
Wed 11 Dec 11 a.m. PST — 2 p.m. PST

Abstract: Large Language Models (LLMs) have proven to have great capabilities while generating and understanding natural language. They have also shown potential outside the natural language domain, but can LLMs plan? There has been a debate around this question. We contribute to this debate by proposing Game-Traversal-Benchmark (GTB), a benchmark consisting of diverse 2D grid-based game maps to evaluate the planning and reasoning abilities of an LLM. An LLM succeeds if it can traverse through given objectives, with a minimum number of steps and a minimum number of generation errors. We also evaluate a number of LLMs on GTB and found that GPT-4-Turbo achieved the highest score of $44.97\%$ out of $100$ on GTB-Score (GTBS), which is a score that combines the three criteria as mentioned above. Finally, we evaluate many LLMs and random baselines on GTB to provide evidence of a challenging benchmark.

Chat is not available.