Poster
Seshat Global History Databank Text Dataset and Benchmark of Large Language Models' History Knowledge
Jakob Hauser · R. Maria del Rio-Chanona · Dániel Kondor · Majid Benam · Jenny Reddish · Enrico Cioni · Federica Villa · Daniel Hoyer · James Bennett · Pieter Francois · Peter Turchin
Large Language Models (LLMs) have the potential to transform humanities and social science research, but their knowledge and comprehension of history at a graduate and academic expert level have not been thoroughly benchmarked. This benchmarking is particularly challenging, given that human knowledge of history is inherently unbalanced, with more information available on Western history and recent periods. To address this challenge, we introduce a curated sample of the Seshat Global History Databank, which provides a structured representation of human historical knowledge, containing 36,000 data points across 600 historical societies and over 600 scholarly references. This dataset covers every major world region from the Neolithic period to the Industrial Revolution and includes information reviewed and assembled by history experts and graduate research assistants. Using these data, we benchmark the historical knowledge of GPT-3.5, GPT-4-Turbo, GPT-4o, and LLama-3-70b. Our findings reveal that LLMs demonstrate balanced accuracy ranging from 37.3% (GPT-3.5) to 43.8% (GPT-4-Turbo) in a four-choice format, outperforming random guessing (25%) but falling short of expert comprehension. LLMs perform better on earlier historical periods, with accuracy decreasing for more recent times. Regionally, performance is more even but still better for the Americas and lowest in Sub-Saharan Africa for the more advanced models. Our benchmark suggests that while LLMs possess some expert-level historical knowledge, there is considerable room for improvement.
Live content is unavailable. Log in and register to view live content