Oral
in
Workshop: Mathematics of Modern Machine Learning (M3L)
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
Tianyu Guo · Druv Pai · Yu Bai · Jiantao Jiao · Michael Jordan · Song Mei
Keywords: [ transformers ] [ attention sink ] [ language models ] [ mechanistic interpretability ]
Sat 14 Dec 8:50 a.m. PST — 5 p.m. PST
We investigate the mechanisms behind three puzzling phenomena observed in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to the extreme-token phenomena. First, we demonstrate that these phenomena also arise in simpler architectures—transformers with one to three layers—trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism that causes attention heads to become attention sinks for certain domain-specific inputs while remaining non-sinks for others. We further develop a precise theoretical characterization of the training dynamics that lead to these phenomena, revealing that they are driven by a mutual reinforcement mechanism. By small interventions, we demonstrate ways to avoid extreme-token phenomena during pre-training. Next, we extend our analysis to pre-trained LLMs, including Llama and OLMo, revealing that many attention heads are governed by a similar active-dormant mechanism as in the BB task. We further show that the same mutual reinforcement mechanism drives the emergence of extreme-token phenomena during LLM pre-training. Our results study the mechanisms behind extreme-token phenomena in both synthetic and real settings and offer potential mitigation strategies.