NeurIPS Overcoming Limitations of Steering Vectors with Low-Rank Representation Steering

Poster
in
Workshop: Foundation Model Interventions

Overcoming Limitations of Steering Vectors with Low-Rank Representation Steering

Dmitrii Krasheninnikov · David Krueger

Keywords: [ representation engineering ] [ steering vectors ] [ Activation steering ] [ controlled generation ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

This paper studies the limitations of steering vector methods for controlling neural network outputs, and introduces Low-rank Representation Steering (LoReSt) as a more effective alternative. We use a toy multi-label classification setup to systematically evaluate steering methods across different task complexities. Key contributions include: (1) a clear example showing how existing methods that rely on translation by a fixed vector can be insufficient for model steering, (2) the introduction of LoReSt, which instead steers by adding a vector that linearly depends on source activations, and (3) ablations showing that LoReSt outperforms steering vectors in constrained activation spaces and when steering requires more complex transformations, but is less data-efficient for easy steering tasks. We also find that layer normalization significantly benefits both LoReSt and steering vector methods. We conclude by discussing this work's weaknesses, which include our setup only modeling categorical features, and the lack of experiments with LLMs.

Chat is not available.

Poster in Workshop: Foundation Model Interventions

Overcoming Limitations of Steering Vectors with Low-Rank Representation Steering

Dmitrii Krasheninnikov · David Krueger

Poster
in
Workshop: Foundation Model Interventions