Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)
Do I Know This Entity? Knowledge Awareness in Language Models
Javier Ferrando · Oscar Obeso · Neel Nanda · Senthooran Rajamanoharan
Hallucinations in large language models are a widespread problem, yet the mechanism behind them is poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of this mechanism is \textit{entity recognition}, where the model detects if an entity is one it can recall facts about. We find sparse autoencoder latent directions corresponding to entities the models knows or does not know, e.g. a known athlete detector. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these latents have a causal effect on the chat models. We provide an initial exploration into the mechanistic role of these directions in the model, finding that they determine whether downstream attribute extraction heads attend to an entity, and are directly contributed to by upstream entity-specific latents.