Skip to yearly menu bar Skip to main content


Poster

What Makes Safety Fine-tuning Methods Safe? A Mechanistic Study

Samyak Jain · Ekdeep S Lubana · Kemal Oksuz · Tom Joy · Philip Torr · Amartya Sanyal · Puneet Dokania

[ ]
Wed 11 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

Safety fine-tuning is widely used to align Large Language Models (LLMs) with human preferences for their safe deployment. In this work, we design a synthetic data generation framework to carefully investigate and understand the underlying factors that makes LLMs safe via safety fine-tuning. Our framework allows controlled generation of samples by capturing key aspects of a real-world instruction corresponding to multiple types of safe and unsafe samples. Using this, we investigate three well-known safety fine-tuning methods: (1) Supervised safety fine-tuning; (2) Direct preference optimization; and (3) Unlearning, and provide insights on what makes corresponding models safe and why their safety is compromised via jailbreaking and adversarial attacks. We also validate our findings, wherever possible, on real-world models - Llama-2 chat 7B and Llama-3 chat 8B.

Live content is unavailable. Log in and register to view live content