Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: Foundation Model Interventions

Atticus Geiger: The Current State of Interpretability and Ideas for Scaling Up

Atticus Geiger

[ ]
Sun 15 Dec 9 a.m. PST — 9:45 a.m. PST

Abstract:

Interpretability has delivered us tools that researchers can use to predict, control, and understand the behavior of deep learning models in limited domains. Now is the time to automate and scale these methods in order to provide a more comprehensive understanding of general purpose capabilities. However, the current paradigm of sparse autoencoders fails to make good on the tools and theories from causality that are key for mechanistic understanding. I argue for an alternative route that leverages interventional data (i.e., hidden representations after an intervention has been performed) to scale the task of controlling and understanding a deep learning model.

Chat is not available.