Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks
Boundaries of stable regions in activation space of LLMs become sharper with more compute
Jett Janiak · Jacek Karwowski · Chatrik Mangat · Giorgi Giglemiani · Nora Petrova · Stefan Heimersheim
Abstract:
This study examines the effects of perturbing activations in the residual stream of Transformers. We identify stable regions in the activation space where small perturbations lead to minimal output changes, potentially contributing to error correction. Our preliminary experiments on models ranging from 0.5B to 7B parameters suggest that these regions may correspond to semantic distinctions and their boundaries appear to sharpen with increased model size and training. These regions seem to be much larger than previously studied polytopes. While our experiments are preliminary, they point towards a promising direction for understanding Transformer robustness.
Chat is not available.