Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)
Sandbag Detection through Model Impairment
Cameron Tice · Philipp Kreer · Nathan Helm-Burger · Prithviraj Singh Shahani · Fedor Ryzhenkov · Teun van der Weij · Felix Hofstätter · Jacob Haimes
Keywords: [ AI Governance ] [ Noise Injection. ] [ Sandbagging ] [ AI Safety ] [ Capability Evaluations ]
Large Language Model (LLM) evaluations are currently vulnerable to strategic underperformance, where a model performs worse during testing than it is capable of. Robustly addressing this "sandbagging" behavior is an open problem with substantial implications for AI risk mitigation. We propose noise injection, a novel, unsupervised method for detecting sandbagging in arbitrary white-box models. We posit strategic underperformance may add a level of complexity, and be less stable with respect to perturbations than the targeted capability. We test our technique using a wide variety of datasets and models (both prompted and fine-tuned). We find that noise injection allows us to distinguish sandbagging from non-sandbagging models, which improves the trustworthiness of LLM capability evaluations.