Workshop
Red Teaming GenAI: What Can We Learn from Adversaries?
Valeriia Cherepanova · Bo Li · Niv Cohen · Yifei Wang · Yisen Wang · Avital Shafran · Nil-Jana Akpinar · James Zou
West Meeting Room 301
Sun 15 Dec, 9 a.m. PST
The development and proliferation of modern generative AI models has introduced valuable capabilities, but these models and their applications also introduce risks to human safety. How do we identify risks in new systems before they cause harm during deployment? This workshop focuses on red teaming, an emergent adversarial approach to probing model behaviors, and its applications towards making modern generative AI safe for humans.
Chat is not available.
Timezone: America/Los_Angeles
Schedule
Sun 9:00 a.m. - 9:30 a.m.
|
Coffee break
|
🔗 |
Sun 9:30 a.m. - 9:35 a.m.
|
Opening Remark
|
🔗 |
Sun 9:35 a.m. - 10:10 a.m.
|
Invited talk 1: Andy Zou and Q&A
(
Invited Talk
)
>
SlidesLive Video |
Andy Zou 🔗 |
Sun 10:10 a.m. - 10:45 a.m.
|
Invited talk 2: Danqi Chen on Uncovering Simple Failures in Generative Models and How to Fix Them
(
Invited Talk
)
>
SlidesLive Video |
Danqi Chen 🔗 |
Sun 10:45 a.m. - 10:55 a.m.
|
Contributed Talk 1: iART - Imitation guided Automated Red Teaming
(
Oral
)
>
link
SlidesLive Video |
Sajad Mousavi · Desik Rengarajan · Ashwin Ramesh Babu · Vineet Gundecha · Avisek Naug · Sahand Ghorbanpour · Ricardo Luna Gutierrez · Antonio Guillen-Perez · Paolo Faraboschi · Soumyendu Sarkar 🔗 |
Sun 10:55 a.m. - 11:05 a.m.
|
Contributed Talk 2: Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
(
Oral
)
>
link
SlidesLive Video |
16 presentersRylan Schaeffer · Dan Valentine · Luke Bailey · James Chua · Zane Durante · Cristobal Eyzaguirre · Joe Benton · Brando Miranda · Henry Sleight · Tony Wang · John Hughes · Rajashree Agrawal · Mrinank Sharma · Scott Emmons · Sanmi Koyejo · Ethan Perez |
Sun 11:05 a.m. - 11:15 a.m.
|
Contributed Talk 3: LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
(
Oral
)
>
link
SlidesLive Video |
Nathaniel Li · Ziwen Han · Ian Steneker · Willow Primack · Riley Goodside · Hugh Zhang · Zifan Wang · Cristina Menghini · Summer Yue 🔗 |
Sun 11:20 a.m. - 12:00 p.m.
|
Panel Discussion
(
Panel
)
>
SlidesLive Video |
Roei Schuster · Yaron Singer · Alex Tamkin · Bo Li 🔗 |
Sun 12:00 p.m. - 1:00 p.m.
|
Lunch
(
Lunch
)
>
|
🔗 |
Sun 12:00 p.m. - 1:50 p.m.
|
Poster Session
(
Poster Session
)
>
|
🔗 |
Sun 1:50 p.m. - 2:15 p.m.
|
Invited talk 3: Niloofar Mireshghallah on A False Sense of Privacy: Semantic Leakage and Non-literal Copying in LLMs
(
Invited Talk
)
>
SlidesLive Video |
Niloofar Mireshghallah 🔗 |
Sun 2:15 p.m. - 3:00 p.m.
|
Invited talk 4: Jonas Geiping on When do adversarial attacks against language models matter?
(
Invited Talk
)
>
SlidesLive Video |
Jonas Geiping 🔗 |
Sun 3:00 p.m. - 3:30 p.m.
|
Coffee Break
|
🔗 |
Sun 3:30 p.m. - 4:15 p.m.
|
Invited talk 5: Vitaly Shmatikov and Q&A
(
Invited Talk
)
>
SlidesLive Video |
Vitaly Shmatikov 🔗 |
Sun 4:15 p.m. - 4:30 p.m.
|
Invited talk 6: Gowthami Somepalli and Q&A
(
Invited Talk
)
>
SlidesLive Video |
Gowthami Somepalli 🔗 |
Sun 4:30 p.m. - 4:40 p.m.
|
Contributed Talk 4: Rethinking LLM Memorization through the Lens of Adversarial Compression
(
Oral
)
>
link
SlidesLive Video |
Avi Schwarzschild · Zhili Feng · Pratyush Maini · Zachary Lipton · J. Zico Kolter 🔗 |
Sun 4:40 p.m. - 4:50 p.m.
|
Contributed Talk 5: A Realistic Threat Model for Large Language Model Jailbreaks
(
Oral
)
>
link
SlidesLive Video |
Valentyn Boreiko · Alexander Panfilov · Vaclav Voracek · Matthias Hein · Jonas Geiping 🔗 |
Sun 4:50 p.m. - 5:00 p.m.
|
Contributed Talk 6: Infecting LLM Agents via Generalizable Adversarial Attack
(
Oral
)
>
link
SlidesLive Video |
Weichen Yu · Kai Hu · Tianyu Pang · Chao Du · Min Lin · Matt Fredrikson 🔗 |
Sun 5:00 p.m. - 5:20 p.m.
|
Invited Talk 7: Max Kaufmann on Red-teaming AI systems in government
(
Invited Talk
)
>
SlidesLive Video |
Max Kaufmann 🔗 |
Sun 5:20 p.m. - 5:30 p.m.
|
Closing Remarks
SlidesLive Video |
🔗 |
-
|
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI ( Poster ) > link |
13 presentersAmbrish Rawat · Stefan Schoepf · Giulio Zizzo · Giandomenico Cornacchia · Muhammad Zaid Hameed · Kieran Fraser · Erik Miehling · Beat Buesser · Elizabeth Daly · Mark Purcell · Prasanna Sattigeri · Pin-Yu Chen · Kush Varshney |
-
|
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning ( Poster ) > link | Alex Beutel · Kai Xiao · Johannes Heidecke · Lilian Weng 🔗 |
-
|
MedAIScout: Automated Retrieval of Known Machine Learning Vulnerabilities in Medical Applications ( Poster ) > link | Athish Pranav Dharmalingam · Gargi Mitra 🔗 |
-
|
Infecting LLM Agents via Generalizable Adversarial Attack ( Poster ) > link | Weichen Yu · Kai Hu · Tianyu Pang · Chao Du · Min Lin · Matt Fredrikson 🔗 |
-
|
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints ( Poster ) > link | Jonathan Noether · Adish Singla · Goran Radanovic 🔗 |
-
|
Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System ( Poster ) > link | Julian Collado · Kevin Stangl 🔗 |
-
|
Decoding Biases: An Analysis of Automated Methods and Metrics for Gender Bias Detection in Language Models ( Poster ) > link | Shachi H. Kumar · Saurav Sahay · Sahisnu Mazumder · Eda Okur · Ramesh Manuvinakurike · Nicole Beckage · Hsuan Su · Hung-yi Lee · Lama Nachman 🔗 |
-
|
Interactive Semantic Interventions for VLMs: Breaking VLMs with Human Ingenuity ( Poster ) > link | Lukas Klein · Kenza Amara · Carsten Lüth · Hendrik Strobelt · Mennatallah El-Assady · Paul Jaeger 🔗 |
-
|
Semantic Membership Inference Attack against Large Language Models ( Poster ) > link | Hamid Mozaffari · Virendra Marathe 🔗 |
-
|
Rethinking LLM Memorization through the Lens of Adversarial Compression ( Poster ) > link | Avi Schwarzschild · Zhili Feng · Pratyush Maini · Zachary Lipton · J. Zico Kolter 🔗 |
-
|
Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning ( Poster ) > link |
11 presentersSeanie Lee · Minsu Kim · Lynn Cherif · David Dobre · Juho Lee · Sung Ju Hwang · Kenji Kawaguchi · Gauthier Gidel · Yoshua Bengio · Nikolay Malkin · Moksh Jain |
-
|
Between the Bars: Gradient-based Jailbreaks are Bugs that induce Features ( Poster ) > link | Kaivalya Hariharan · Uzay Girit 🔗 |
-
|
Large Language Model Detoxification: Data and Metric Solutions ( Poster ) > link | SungJoo Byun · HYOPIL SHIN 🔗 |
-
|
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming ( Poster ) > link | Anurakt Kumar · Divyanshu Kumar · Jatan Loya · Nitin Aravind Birur · Tanay Baswa · Sahil Agarwal · Prashanth Harshangi 🔗 |
-
|
An Adversarial Perspective on Machine Unlearning for AI Safety ( Poster ) > link | Jakub Łucki · Boyi Wei · Yangsibo Huang · Peter Henderson · Florian Tramer · Javier Rando 🔗 |
-
|
Stability Evaluation of Large Language Models via Distributional Perturbation Analysis ( Poster ) > link | Jiashuo Liu · Jiajin Li · Peng Cui · Jose Blanchet 🔗 |
-
|
Lessons From Red Teaming 100 Generative AI Products ( Poster ) > link |
20 presentersBlake Bullwinkel · Amanda Minnich · Shiven Chawla · Gary Lopez Munoz · Martin Pouliot · Whitney Maxwell · Joris de Gruyter · Katherine Pratt · Saphir Qi · Nina Chikanov · Roman Lutz · Raja Sekhar Rao Dheekonda · Bolor-Erdene Jagdagdorj · Rich Lundeen · Sam Vaughan · Victoria Westerhoff · Pete Bryan · Ram Shankar Siva Kumar · Yonatan Zunger · Mark Russinovich |
-
|
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet ( Poster ) > link | Nathaniel Li · Ziwen Han · Ian Steneker · Willow Primack · Riley Goodside · Hugh Zhang · Zifan Wang · Cristina Menghini · Summer Yue 🔗 |
-
|
Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage ( Poster ) > link | Rafi Rashid · Jing Liu · Toshiaki Koike-Akino · Shagufta Mehnaz · Ye Wang 🔗 |
-
|
Steganography in Large Language Models: Investigating Emergence and Mitigations ( Poster ) > link | Yohan Mathew · Robert McCarthy · Ollie Matthews · Joan Velja · Nandi Schoots · Dylan Cope 🔗 |
-
|
A Realistic Threat Model for Large Language Model Jailbreaks ( Poster ) > link | Valentyn Boreiko · Alexander Panfilov · Vaclav Voracek · Matthias Hein · Jonas Geiping 🔗 |
-
|
Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries ( Poster ) > link | Julius Broomfield · George Ingebretsen · Reihaneh Iranmanesh · Sara Pieri · Ethan Kosak-Hine · Tom Gibbs · Reihaneh Rabbany · Kellin Pelrine 🔗 |
-
|
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models ( Poster ) > link |
16 presentersRylan Schaeffer · Dan Valentine · Luke Bailey · James Chua · Zane Durante · Cristobal Eyzaguirre · Joe Benton · Brando Miranda · Henry Sleight · Tony Wang · John Hughes · Rajashree Agrawal · Mrinank Sharma · Scott Emmons · Sanmi Koyejo · Ethan Perez |
-
|
SkewAct: Red Teaming Large Language Models via Activation-Skewed Adversarial Prompt Optimization ( Poster ) > link | Hanxi Guo · Siyuan Cheng · Guanhong Tao · Guangyu Shen · Zhuo Zhang · Shengwei An · Kaiyuan Zhang · Xiangyu Zhang 🔗 |
-
|
Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs ( Poster ) > link | Aly Kassem · Omar Mahmoud · Niloofar Mireshghallah · Hyunwoo Kim · Yulia Tsvetkov · Yejin Choi · Sherif Saad · Santu Rana 🔗 |
-
|
TOFU: A Task of Fictitious Unlearning for LLMs ( Poster ) > link | Pratyush Maini · Zhili Feng · Avi Schwarzschild · Zachary Lipton · J. Zico Kolter 🔗 |
-
|
iART - Imitation guided Automated Red Teaming ( Poster ) > link | Sajad Mousavi · Desik Rengarajan · Ashwin Ramesh Babu · Vineet Gundecha · Avisek Naug · Sahand Ghorbanpour · Ricardo Luna Gutierrez · Antonio Guillen-Perez · Paolo Faraboschi · Soumyendu Sarkar 🔗 |
-
|
Does Refusal Training in LLMs Generalize to the Past Tense? ( Poster ) > link | Maksym Andriushchenko · Nicolas Flammarion 🔗 |
-
|
Plentiful Jailbreaks with String Compositions ( Poster ) > link | Brian Huang 🔗 |
-
|
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models ( Poster ) > link | Hongfu Liu · Yuxi Xie · Ye Wang · Michael Qizhe Shieh 🔗 |
-
|
CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation ( Poster ) > link | Tong Chen · Akari Asai · Niloofar Mireshghallah · Sewon Min · James Grimmelmann · Yejin Choi · Hannaneh Hajishirzi · Luke Zettlemoyer · Pang Wei Koh 🔗 |
-
|
Curiosity-driven Red teaming for Large Language Models ( Poster ) > link | Zhang-Wei Hong · Idan Shenfeld · Tsun-Hsuan Johnson Wang · Yung-Sung Chuang · Aldo Pareja · Jim Glass · Akash Srivastava · Pulkit Agrawal 🔗 |
-
|
Adversarial Negotiation Dynamics in Generative Language Models ( Poster ) > link | Arinbjörn Kolbeinsson · Benedikt Kolbeinsson 🔗 |
-
|
LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded" ( Poster ) > link | Som Sagar · Aditya Taparia · Ransalu Senanayake 🔗 |
-
|
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding ( Poster ) > link | Haneul Yoo · Yongjin Yang · Hwaran Lee 🔗 |
-
|
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks ( Poster ) > link | Nathalie Kirch · Severin Field · Stephen Casper 🔗 |
-
|
Algorithmic Oversight for Deceptive Reasoning ( Poster ) > link | Ege Onur Taga · Mingchen Li · Yongqi Chen · Samet Oymak 🔗 |
-
|
A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation ( Poster ) > link | aviral srivastava · Sourav Panda 🔗 |