Oral
in
Workshop: Causal Representation Learning
The Linear Representation Hypothesis in Language Models
Kiho Park · Yo Joong Choe · Victor Veitch
Keywords: [ linear representation hypothesis ] [ causal framework for representations ] [ large language model ] [ interpretability ]
In the context of large language models, the "linear representation hypothesis" is the idea that high-level concepts are represented linearly as directions in a representation space. If the hypothesis were true, we might hope to interpret model representations by computing their concept directions or control model behavior by intervening on representations using those directions. In this paper, we formalize the linear representation hypothesis in terms of counterfactual pairs and connect this formalism to other notions of the hypothesis, including measurement (via linear probes) and intervention (control). Then, we empirically demonstrate the existence of linear concept directions in the LLaMA-2 model and show how the different notions of the hypothesis manifest in modern LLMs.