Poster
in
Workshop: Safe Generative AI
Language Models Can Articulate Their Implicit Goals
Jan Betley · Xuchan Bao · MartÃn Soto · Anna Sztyber-Betley · James Chua · Owain Evans
We investigate LLMs' awareness of newly acquired goals or policies. We find that a model finetuned on examples that exhibit a particular policy (e.g. preferring risky options) can describe this policy (e.g. "I take risky options"). This holds even when the model does not have any examples in-context, and without any descriptions of the policy appearing in the finetuning data. This capability extends to many-persona scenarios, where models internalize and report different learned policies for different simulated individuals (personas), as well as trigger scenarios, where models report policies that are triggered by particular token sequences in the prompt.This awareness enables models to acquire information about themselves that was only implicit in their training data. It could potentially help practitioners discover when a model's training data contains undesirable biases or backdoors.