Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Foundation Model Interventions

Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions

Marc Canby · Adam Davies · Chirag Rastogi · Julia C Hockenmaier

Keywords: [ causal probing ] [ interventions ] [ probing ] [ language models ] [ mechanistic interpretability ] [ interpretability ]


Abstract:

Causal probing is an approach to interpreting foundation models, such as large language models, by intervening on embeddings to change the representation of a given latent property (such as part-of-speech or sentiment label) and analyzing the resulting changes in model behavior. While some recent works have cast doubt on the theoretical basis of many causal probing intervention methods, it has been unclear how to systematically evaluate their effectiveness in practice. To this end, we propose a general empirical analysis framework to evaluate the reliability of interventions. We formally define and quantify two key desiderata: completeness (fully transforming the representation of the target property) and selectivity (minimally impacting other properties). Our formalism enables the first direct comparisons between different families of causal probing methods (e.g., linear vs. nonlinear or counterfactual vs. nullifying interventions). We conduct extensive experiments across several leading methods, finding that (1) there is an inherent tradeoff between these criteria, and no method is able to simultaneously satisfy both; and (2) across the board, nullifying interventions are far less complete than counterfactual ones, indicating that nullifying approaches may not be an effective approach to causal probing.

Chat is not available.