Skip to yearly menu bar Skip to main content


Oral
in
Workshop: Red Teaming GenAI: What Can We Learn from Adversaries?

Contributed Talk 1: iART - Imitation guided Automated Red Teaming

Sajad Mousavi · Desik Rengarajan · Ashwin Ramesh Babu · Vineet Gundecha · Avisek Naug · Sahand Ghorbanpour · Ricardo Luna Gutierrez · Antonio Guillen-Perez · Paolo Faraboschi · Soumyendu Sarkar

Keywords: [ Imitation Learning ] [ Large Language Models (LLMs) ] [ Reinforcement Learning ] [ Automated Red-teaming ]

[ ] [ Project Page ]
Sun 15 Dec 10:45 a.m. PST — 10:55 a.m. PST
 
presentation: Red Teaming GenAI: What Can We Learn from Adversaries?
Sun 15 Dec 9 a.m. PST — 5:30 p.m. PST

Abstract:

The potential of large language models (LLMs) is substantial, yet they also carry the risk of generating harmful responses. An automatic "red teaming" process constructs test cases designed to elicit unfavorable responses from these models. A successful generator must provoke undesirable responses from the target LLMs with test cases that exemplify diversity. Current methods often struggle to balance quality (i.e., the harmfulness of responses) and diversity (i.e., the range of scenarios) in testing, typically sacrificing one to enhance the other and relying on non-optimal exhaustive comparison approaches. To address these challenges, we introduce an imitation-guided reinforcement learning approach to learn optimal red teaming strategies that generate both diverse and high-quality test cases without exhaustive searching. Our proposed method, Imitation-guided Automated Red Teaming (iART), is evaluated across various LLMs fine-tuned for different tasks. We demonstrate that iART achieves not only diverse test sets but also elicits undesirable responses from the target LLM in a computationally efficient manner.

Chat is not available.