Skip to yearly menu bar Skip to main content


Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks

Boundaries of stable regions in activation space of LLMs become sharper with more compute

Jett Janiak · Jacek Karwowski · Chatrik Mangat · Giorgi Giglemiani · Nora Petrova · Stefan Heimersheim

[ ] [ Project Page ]
Sun 15 Dec 4:30 p.m. PST — 5:30 p.m. PST

Abstract:

This study examines the effects of perturbing activations in the residual stream of Transformers. We identify stable regions in the activation space where small perturbations lead to minimal output changes, potentially contributing to error correction. Our preliminary experiments on models ranging from 0.5B to 7B parameters suggest that these regions may correspond to semantic distinctions and their boundaries appear to sharpen with increased model size and training. These regions seem to be much larger than previously studied polytopes. While our experiments are preliminary, they point towards a promising direction for understanding Transformer robustness.

Chat is not available.