Poster
in
Workshop: Red Teaming GenAI: What Can We Learn from Adversaries?
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming
Anurakt Kumar · Divyanshu Kumar · Jatan Loya · Nitin Aravind Birur · Tanay Baswa · Sahil Agarwal · Prashanth Harshangi
Keywords: [ Synthetic Data Generation ] [ LLM Security ] [ Safety and Robustness ]
We introduce Synthetic Alignment data Generation for Safety Evaluation and Red Teaming (SAGE-RT or SAGE) a novel pipeline for generating synthetic alignment and red-teaming data. Existing methods fall short in creating nuanced and di- verse datasets, providing necessary control over the data gen- eration and validation processes, or require large amount of manually generated seed data. SAGE addresses these lim- itations by using a detailed taxonomy to produce safety- alignment and red-teaming data across a wide range of topics. Our pipeline generates 51,000 diverse and in-depth prompt- response pairs, encompassing over 1,500 topics of harmful- ness and covering variations of the most frequent types of jail- breaking prompts faced by large language models (LLMs). We show that the red-teaming data generated through SAGE jailbreaks state-of-the-art LLMs in more than 27 out of 32 sub-categories, and in more than 58 out of 279 leaf-categories (sub-sub categories). The attack success rate for GPT-4o, GPT-3.5-turbo is 100% over the sub-categories of harmful- ness. Our approach avoids the pitfalls of synthetic safety- training data generation such as mode collapse and lack of nuance in the generation pipeline by ensuring a detailed cov- erage of harmful topics using iterative expansion of the topics and conditioning the outputs on the generated raw-text. This method can be used to generate red-teaming and alignment data for LLM Safety completely synthetically to make LLMs safer or for red-teaming the models over a diverse range of topics.