NeurIPS Evolution of SAE Features Across Layers in LLMs

Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)

Evolution of SAE Features Across Layers in LLMs

Benjamin Lerner · Daniel Balcells · Michael Oesterle · Ediz Ucar · Stefan Heimersheim

[ Abstract ]

Abstract:

Sparse Autoencoders for transformer-based language models are typically defined independently per layer. In this work we analyze statistical relationships between features in adjacent layers to understand how features evolve through a forward pass. We provide a graph visualization interface for features and their most similar next-layer neighbors, and build communities of related features across layers. We find that a considerable amount of features are passed through from a previous layer, some features can be expressed as quasi-boolean combinations of previous features, and some features become more specialized in later layers.

Chat is not available.

Poster in Workshop: Attributing Model Behavior at Scale (ATTRIB)

Evolution of SAE Features Across Layers in LLMs

Benjamin Lerner · Daniel Balcells · Michael Oesterle · Ediz Ucar · Stefan Heimersheim

Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)