NeurIPS Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Daniel Lee · Stefan Heimersheim

[ Abstract ] [ Project Page ]

[ OpenReview]

Sun 15 Dec 11:20 a.m. PST — 12:20 p.m. PST

Abstract:

Sensitive directions experiments attempt to understand the internal computation of Language Models (LMs) by measuring how much the next token prediction probabilities change by perturbing activations along specific directions. We extend the sensitive directions work by introducing an improved baseline for perturbation directions. We demonstrate that KL divergence for Sparse Autoencoder (SAE) reconstruction errors are no longer pathologically high compared to the improved baseline. We also show that feature directions uncovered by SAEs have varying impacts on model outputs depending on the SAE's sparsity, with lower (L0) SAE feature directions exerting a greater influence. Additionally, we find that end-to-end SAEs do not exhibit stronger effects on model outputs compared to traditional SAEs.

Chat is not available.

Poster Session in Workshop: Scientific Methods for Understanding Neural Networks

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Daniel Lee · Stefan Heimersheim

Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks