NeurIPS Efficacy of the SAGE-RT Dataset for Model Safety Alignment: A Comparative Study

Poster
in
Workshop: Pluralistic Alignment Workshop

Efficacy of the SAGE-RT Dataset for Model Safety Alignment: A Comparative Study

Tanay Baswa · Nitin Aravind Birur · Divyanshu Kumar · Jatan Loya · Anurakt Kumar · Prashanth Harshangi · Sahil Agarwal

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Safety alignment and robustness of large language models (LLMs) remain critical challenges. This study presents a comprehensive evaluation of data generated using the SAGE process, a method designed to create nuanced and diverse synthetic data points for alignment and red-teaming. Our findings show that models aligned with SAGE-generated data achieve superior safety outcomes, including lower toxicity, bias, and harmful responses, while maintaining competitive performance on benchmark tasks. Alignment performed with data generated using the SAGE process requires only a fraction of the data needed by traditional datasets, such as PKU-SafeRLHF and Anthropic HH-RLHF, to achieve better alignment results, offering significant improvements in computational efficiency. The extensive categorization of harmful content by the SAGE process also provides finer granularity in aligning model behavior, enhancing visibility across various safety domains. This approach enables more precise and targeted alignment strategies, positioning the SAGE process as a valuable tool for developing safer and more trustworthy AI systems. Overall, we conclude that the SAGE process outperforms other popularly used open source alignment datasets, both in terms of mitigating harmful responses, and conserving computational resources.

Chat is not available.

Poster in Workshop: Pluralistic Alignment Workshop

Efficacy of the SAGE-RT Dataset for Model Safety Alignment: A Comparative Study

Tanay Baswa · Nitin Aravind Birur · Divyanshu Kumar · Jatan Loya · Anurakt Kumar · Prashanth Harshangi · Sahil Agarwal

Poster
in
Workshop: Pluralistic Alignment Workshop