Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)
Ablation is Not Enough to Emulate DPO: A Mechanistic Analysis of Toxicity Reduction
Yushi Yang · Filip Sondej · Harry Mayne · Adam Mahdi
Keywords: [ Interpretability ] [ Large language models ] [ Toxicity reduction ] [ Alignment algorithms ]
Alignment algorithms are commonly used to fine-tune language models to align with human preferences, but the internal mechanisms of how models become aligned remain unclear. In the case of direct preference optimisation (DPO) for toxicity reduction, current explanations claim that DPO simply learns an offset in the residual stream to avoid activating the most toxic MLP neurons. To test this, we ablate the most toxic neurons and find this explanation incomplete. Our analysis shows that only 31.8% of toxicity reduction is due to deactivating toxic neurons. Additionally, DPO not only erases toxicity but also introduces anti-toxicity into the residual stream by further activating anti-toxic neurons. Although these adjustments are noisy and cause some neurons to increase toxicity, the overall effect remains a net reduction in toxicity. This indicates that DPO balances these opposing forces to achieve toxicity reduction.