Poster
in
Workshop: Causal Representation Learning
What's your Use Case? A Taxonomy of Causal Evaluations of Post-hoc Interpretability
David Reber · Victor Veitch
Keywords: [ Faithfulness ] [ evals ] [ post-hoc interpretability ] [ Causality ]
Post-hoc interpretability of Large Language Models (LLMs) often aims for mechanistic interpretations—detailed, causal accounts of model behavior. However, human interpreters may lack the capacity or willingness to formulate such intricate models, let alone evaluate them. This paper addresses this challenge by introducing a structured taxonomy grounded in the causal hierarchy. This taxonomy dissects the overarching goal of mechanistic interpretability into constituent claims, each requiring distinct evaluation methods. By doing so, we transform these evaluation criteria into actionable learning objectives, providing a data-driven pathway to interpretability. This framework enables a methodologically rigorous yet pragmatic approach to evaluating the strengths and limitations of various interpretability tools.