NeurIPS The Linear Representation Hypothesis in Language Models

Oral
in
Workshop: Causal Representation Learning

The Linear Representation Hypothesis in Language Models

Kiho Park · Yo Joong Choe · Victor Veitch

Keywords: [ linear representation hypothesis ] [ causal framework for representations ] [ large language model ] [ interpretability ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

In the context of large language models, the "linear representation hypothesis" is the idea that high-level concepts are represented linearly as directions in a representation space. If the hypothesis were true, we might hope to interpret model representations by computing their concept directions or control model behavior by intervening on representations using those directions. In this paper, we formalize the linear representation hypothesis in terms of counterfactual pairs and connect this formalism to other notions of the hypothesis, including measurement (via linear probes) and intervention (control). Then, we empirically demonstrate the existence of linear concept directions in the LLaMA-2 model and show how the different notions of the hypothesis manifest in modern LLMs.

Chat is not available.

Oral in Workshop: Causal Representation Learning

The Linear Representation Hypothesis in Language Models

Kiho Park · Yo Joong Choe · Victor Veitch

Oral
in
Workshop: Causal Representation Learning