Skip to yearly menu bar Skip to main content


Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks

Causation Does Not Imply Correlation: A Study of Circuit Mechanisms and Model Behaviors

Jenny Kaufmann · Victoria R. Li · Martin Wattenberg · David Alvarez-Melis · Naomi Saphra

[ ] [ Project Page ]
Sun 15 Dec 4:30 p.m. PST — 5:30 p.m. PST

Abstract:

Using a toy balanced parenthesis classification task with an ambiguous rule, we investigate the correspondence between attention patterns and out-of-distribution generalization behavior of small transformer models. We find that observational tools can predict OOD behavior, challenging the common notion among interpretability researchers that causal intervention is the only basis for explaining model behavior.

Chat is not available.