NeurIPS Controllable Safety Alignment: Adapting LLMs to Diverse Safety Requirements without Re-Training

Poster
in
Workshop: Pluralistic Alignment Workshop

Controllable Safety Alignment: Adapting LLMs to Diverse Safety Requirements without Re-Training

Jingyu Zhang · Ahmed Elgohary Ghoneim · Ahmed Magooda · Daniel Khashabi · Ben Van Durme

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract:

Current safety alignment methods for large language models use a rigid, one-size-fits-all approach, blocking any content deemed unsafe by the model provider. This lacks flexibility for varying cultural norms and diverse user safety needs, making these static models too restrictive. We propose Controllable Safety Alignment, a framework designed to adapt models to diverse safety requirements without re-training. Instead of aligning a fixed model, we align models to follow safety configs, a natural language description of the desired safety requirements, defined as part of the system prompt. To adjust model safety behavior a user only needs to modify the safety config ahead of inference. We devise a novel controllability evaluation protocol that considers both helpfulness and safety, summarizing them into ControlScore. We propose LACUSA, a data-centric method for controllable safety alignment. On our curated test sets, LACUSA lead to substantial gains of controllability over strong baselines. Our framework not only expands the practicality of aligned LLMs but also contributes to models that better represents pluralistic human values.

Chat is not available.

Poster in Workshop: Pluralistic Alignment Workshop

Controllable Safety Alignment: Adapting LLMs to Diverse Safety Requirements without Re-Training

Jingyu Zhang · Ahmed Elgohary Ghoneim · Ahmed Magooda · Daniel Khashabi · Ben Van Durme

Poster
in
Workshop: Pluralistic Alignment Workshop