Poster
in
Workshop: Pluralistic Alignment Workshop
Controllable Safety Alignment: Adapting LLMs to Diverse Safety Requirements without Re-Training
Jingyu Zhang · Ahmed Elgohary Ghoneim · Ahmed Magooda · Daniel Khashabi · Ben Van Durme
Current safety alignment methods for large language models use a rigid, one-size-fits-all approach, blocking any content deemed unsafe by the model provider. This lacks flexibility for varying cultural norms and diverse user safety needs, making these static models too restrictive. We propose Controllable Safety Alignment, a framework designed to adapt models to diverse safety requirements without re-training. Instead of aligning a fixed model, we align models to follow safety configs, a natural language description of the desired safety requirements, defined as part of the system prompt. To adjust model safety behavior a user only needs to modify the safety config ahead of inference. We devise a novel controllability evaluation protocol that considers both helpfulness and safety, summarizing them into ControlScore. We propose LACUSA, a data-centric method for controllable safety alignment. On our curated test sets, LACUSA lead to substantial gains of controllability over strong baselines. Our framework not only expands the practicality of aligned LLMs but also contributes to models that better represents pluralistic human values.