Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)
MISR: Measuring Instrumental Self-Reasoning in Frontier Models
Kai Fronsdal · David Lindner
Keywords: [ Deceptive Alignment ] [ AI Safety ] [ Agents ] [ Evaluations ] [ Self-Reasoning ]
We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability is important to measure because its emergence could enable self-improvement and adaptation, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a range of different scenarios. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. Current models are far from passing the most difficult versions of the proposed tasks, and our tasks can be used to measure instrumental self-reasoning ability in future models.