Poster
in
Workshop: Workshop on Behavioral Machine Learning
Towards Deliberating Agents: Evaluating the Ability of Large Language Models to Deliberate
Arjun Karanam · Farnaz Jahanbakhsh · Sanmi Koyejo
As artificial intelligence increasingly permeates our decision-making processes, a crucial question emerges: can large language models (LLMs) truly engage in the nuanced, collaborative process of deliberation that underpins democracy?We present the LLM-Deliberation Quality Index, a novel framework for evaluating the deliberative capabilities of large language models (LLMs). Our approach combines aspects of the Deliberation Quality Index from political science literature with LLM-specific measures to assess both the quality of deliberation and the believability of AI agents in simulated policy discussions. Additionally, we introduce a controlled simulation environment featuring complex public policy scenarios and conduct experiments using various LLMs as deliberative agents.Our findings reveal both promising capabilities and notable limitations in current LLMs' deliberative abilities. While models like GPT-4o demonstrate high performance in providing justified reasoning (9.41 / 10), they struggle with more social aspects of deliberation such as storytelling (2.43 / 10) and active questioning (3.41 / 10). This contrasts sharply with typical human performance in deliberations, who typically perform well in storytelling but struggle with justified reasoning. We also observe a strong correlation between an LLM's ability to respect others' arguments and its propensity for opinion change, indicating a potential limitation in LLMs' capacity to acknowledge valid counterarguments without altering their core stance, raising important questions about LLMs' current capability for nuanced deliberation. Overall, our work offers a comprehensive framework for evaluating and probing the deliberative abilities of LLM agents across various policy domains, showing not only the current state of LLM deliberation capabilities but also providing a foundation for developing more deliberative AI.