NeurIPS An Adversarial Perspective on Machine Unlearning for AI Safety

Spotlight
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)

An Adversarial Perspective on Machine Unlearning for AI Safety

Jakub Łucki · Boyi Wei · Yangsibo Huang · Peter Henderson · Florian Tramer · Javier Rando

Keywords: [ Machine Unlearning ] [ Safety Training ] [ Interpretability ] [ Adversarial Approach ] [ LLMs ]

[ Abstract ] [ Project Page ]

[ Slides] [ OpenReview]

Abstract:

Large language models are finetuned to refuse questions about hazardous knowledge, but these protections are often bypassed by adversaries. Unlearning methods aim to completely remove hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or subtracting specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.

Chat is not available.

Spotlight in Workshop: Socially Responsible Language Modelling Research (SoLaR)

An Adversarial Perspective on Machine Unlearning for AI Safety

Jakub Łucki · Boyi Wei · Yangsibo Huang · Peter Henderson · Florian Tramer · Javier Rando

Spotlight
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)