Poster
Game-Traversal-Benchmark: Evaluating Planning Abilities Of Large Language Models Via Traversing 2D Game Maps
Muhammad Umair Nasir · Steven James · Julian Togelius
[
Abstract
]
Wed 11 Dec 11 a.m. PST
— 2 p.m. PST
Abstract:
Large Language Models (LLMs) have proven to have great capabilities while generating and understanding natural language. They have also shown potential outside the natural language domain, but can LLMs plan? There has been a debate around this question. We contribute to this debate by proposing Game-Traversal-Benchmark (GTB), a benchmark consisting of diverse 2D grid-based game maps to evaluate the planning and reasoning abilities of an LLM. An LLM succeeds if it can traverse through given objectives, with a minimum number of steps and a minimum number of generation errors. We also evaluate a number of LLMs on GTB and found that GPT-4-Turbo achieved the highest score of $44.97\%$ out of $100$ on GTB-Score (GTBS), which is a score that combines the three criteria as mentioned above. Finally, we evaluate many LLMs and random baselines on GTB to provide evidence of a challenging benchmark.
Live content is unavailable. Log in and register to view live content