NeurIPS Language Models Can Articulate Their Implicit Goals

Poster
in
Workshop: Safe Generative AI

Language Models Can Articulate Their Implicit Goals

Jan Betley · Xuchan Bao · Martín Soto · Anna Sztyber-Betley · James Chua · Owain Evans

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

We investigate LLMs' awareness of newly acquired goals or policies. We find that a model finetuned on examples that exhibit a particular policy (e.g. preferring risky options) can describe this policy (e.g. "I take risky options"). This holds even when the model does not have any examples in-context, and without any descriptions of the policy appearing in the finetuning data. This capability extends to many-persona scenarios, where models internalize and report different learned policies for different simulated individuals (personas), as well as trigger scenarios, where models report policies that are triggered by particular token sequences in the prompt.This awareness enables models to acquire information about themselves that was only implicit in their training data. It could potentially help practitioners discover when a model's training data contains undesirable biases or backdoors.

Chat is not available.

Poster in Workshop: Safe Generative AI

Language Models Can Articulate Their Implicit Goals

Jan Betley · Xuchan Bao · Martín Soto · Anna Sztyber-Betley · James Chua · Owain Evans

Poster
in
Workshop: Safe Generative AI