Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: Interpretable AI: Past, Present and Future

Neel Nanda: Sparse Autoencoders - Assessing the evidence

Neel Nanda

[ ]
Sun 15 Dec 3 p.m. PST — 3:30 p.m. PST

Abstract:

Sparse autoencoders are a technique for interpreting which concepts are represented in a model's activations, and have been a big focus of recent mechanistic interpretability work. In this talk, Neel will assess what we've learned about how well sparse autoencoders work over the past year, the biggest problems with them, and what he sees as next steps for the field.

Chat is not available.