NeurIPS Jacob Steinhardt: Scalably Understanding AI with AI

Invited Talk
in
Workshop: Foundation Model Interventions

Jacob Steinhardt: Scalably Understanding AI with AI

Jacob Steinhardt

[ Abstract ]

Sun 15 Dec 3:15 p.m. PST — 4 p.m. PST

Abstract:

AI systems are a complex pipeline from training data, to learned representations, to observed behaviors. Can we use AI to help us understand each of these objects, and use this understanding to steer and align the system? I will present a series of tools that use AI to understand AI representations. First, we consider neuron description---understanding what leads a neuron to be active and describing this in natural language. We significantly improve previous description pipelines and obtain descriptions that are at or slightly above human quality. Our pipeline is cheap, consisting of 8B-parameter open-weight models. Second, we apply our neuron descriptions through an observability interface called Monitor. We use Monitor to understand several puzzling model behaviors, including why language models often say that 9.8 is smaller than 9.11.

Chat is not available.

Invited Talk in Workshop: Foundation Model Interventions

Jacob Steinhardt: Scalably Understanding AI with AI

Jacob Steinhardt

Invited Talk
in
Workshop: Foundation Model Interventions