Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)
Automatic Discovery of Visual Circuits
Achyuta Rajaram · Neil Chowdhury · Antonio Torralba · Jacob Andreas · Sarah Schwettmann
To date, most discoveries of subnetworks that implement human-interpretable computations in deep vision models have involved close study of single units and large amounts of human labor. We explore scalable methods for extracting the subgraph of a vision model’s computational graph that underlies a particular capability. In this paper, we formulate capabilities as mappings of human-interpretable visual concepts to intermediate feature representations. We introduce a new method for identifying these subnetworks: specifying a visual concept using a few examples, and then tracing the interdependence of neuron activations across layers, or their functional connectivity. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.