NeurIPS Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions

Poster
in
Workshop: Interpretable AI: Past, Present and Future

Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions

Marc Canby · Adam Davies · Chirag Rastogi · Julia C Hockenmaier

[ Abstract ] [ Project Page ]

[ Slides] [ OpenReview]

Abstract:

Causal probing is an approach to interpreting foundation models, such as large language models, by training probes to recognize latent properties of interest from embeddings, intervening on probes to modify this representation, and analyzing the resulting changes in the model’s behavior. While some recent works have cast doubt on the theoretical basis of several leading causal probing intervention methods, it has been unclear how to systematically and empirically evaluate their effectiveness in practice. To address this, we propose a general empirical analysis framework to evaluate the reliability of causal probing interventions. We formally define and quantify two key causal probing desiderata: completeness (fully transforming the representation of the target property) and selectivity (minimally impacting other properties). Our formalism enables the first direct comparisons between different families of methods (e.g., linear vs. nonlinear or counterfactual vs. nullifying interventions). We conduct extensive experiments across several leading methods, finding that (1) there is an inherent tradeoff between these criteria, and no method is able to consistently satisfy both at once; and (2) across the board, nullifying interventions are far less complete than counterfactual interventions, indicating that nullifying approaches may not be an effective approach to causal probing.

Chat is not available.

Poster in Workshop: Interpretable AI: Past, Present and Future

Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions

Marc Canby · Adam Davies · Chirag Rastogi · Julia C Hockenmaier

Poster
in
Workshop: Interpretable AI: Past, Present and Future