Skip to yearly menu bar Skip to main content


Poster

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Rohan Gupta · Iván Arcuschin Moreno · Thomas Kwa · Adrià Garriga-Alonso

[ ]
Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the algorithm is unknown. This work presents InterpBench, a collection of semi-synthetic yet realistic transformers with known circuits for evaluating these techniques. We propose Strict Interchange Intervention Training (SIIT) to create these models. Like plain Interchange Intervention Training (IIT), SIIT trains neural networks to align with high-level causal models, but it improves on IIT by also preventing non-circuit nodes from affecting the model's output. We evaluate SIIT on sparse transformers produced by the Tracr tool and find that SIIT models maintain Tracr's original circuit while being more realistic. SIIT can also train transformers with larger circuits, like Indirect Object Identification (IOI). Finally, we use our benchmark to evaluate existing circuit discovery techniques.

Live content is unavailable. Log in and register to view live content