Spotlight
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)
Towards a Situational Awareness Benchmark for LLMs
Rudolf Laine · Alexander Meinke · Owain Evans
Among the facts that LLMs can learn is knowledge about themselves and their situation. This knowledge, and ability to make inferences based on it, is called situational awareness. Situationally aware models can be more helpful, but also pose risks. For example, situationally aware models could game testing setups by knowing they are being tested and acting differently. We create a new benchmark, SAD (Situational Awareness Dataset), for LLM situational awareness in two categories that are especially relevant for future AI risks. SAD-influence tests whether LLMs can accurately assess how they can or cannot influence the world. SAD-stages tests if LLMs can recognize if a particular input is likely to have come from a given stage of the LLM lifecycle (pretraining, supervised fine-tuning, testing, and deployment). Only the most capable models do better than chance. If the prompt tells the model that it is an LLM, scores increase by 9-21 percentage points for models on SAD-influence, while having mixed effects on SAD-stages.