Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)

Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis Simulations

Aryan Shrivastava · Max Lamparth · Jessica Hullman

Keywords: [ Natural Language Processing ] [ AI Safety ] [ Language Models ] [ Transparency ] [ Inconsistency ] [ Military ]


Abstract:

Multiple countries are actively testing language models (LMs) to aid in military crisis decision-making. To scrutinize relying on LM decision-making in military settings, we examine the inconsistency of responses in a crisis simulation ("wargame"), similar to reported tests conducted by the US military. Previous works illustrated escalatory tendencies and different levels of aggression but were constrained to simulations with pre-defined actions, given the challenges associated with quantitatively measuring semantic differences. Thus, we we let LMs respond in free-form and use a metric based on BERTScore to quantitatively measure response inconsistency. We demonstrate that BERTScore is robust to linguistic variations that preserve semantic meaning in a question-answering setting across text lengths. We show that all five tested LMs exhibit levels of inconsistency that indicate semantic differences. Further qualitative evaluation shows that models recommend courses of action that share few to no similarities. Given the high-stakes nature of military deployment, we recommend further consideration be taken before allowing LMs to inform military decisions.

Chat is not available.