Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)
Ablation is Not Enough to Emulate DPO: Attributing Toxicity Reduction to Neurons
Yushi Yang · Filip Sondej · Harry Mayne · Adam Mahdi
Alignment algorithms are commonly used to fine-tune language models to align with human preferences, but the internal mechanisms of how models become aligned remain unclear. In studying direct preference optimisation (DPO) for toxicity reduction, current explanations claim that DPO works by deactivating most toxic MLP neurons to bypass toxic regions in the residual stream. However, after ablating the most toxic neurons, we find this explanation incomplete. By projecting neuron activations onto a toxicity probe, we find that only 31.8% of toxicity reduction is due to deactivated toxic neurons. DPO not only erases toxicity but also introduces anti-toxic signals into the residual stream to steer outputs away from toxicity. Moreover, DPO gives noisy neuron adjustments and cause many neurons to increase toxicity. This indicates that DPO actively reconfigures model activations and balances opposing effects of neurons to achieve toxicity reduction.