Workshop
Safe Generative AI
Dianbo Liu · Ling Pan · Tailin Wu · Bonaventure F. P. Dossou · Emmanuel Bengio · Yilun Du · Dinghuai Zhang · Yoshua Bengio
East Exhibition Hall A
Sun 15 Dec, 9 a.m. PST
In the past two years, generative AI has been the major driving force behind the development of advanced AI productssuch as ChatGPT4, AlphaFold, and StableDiffusion. These technologies, while significantly improving productivity for many, have raised significant safety concerns. However, there has been no workshop focusing on this topic in the past two years. This workshop, emphasizing AI safety concerns related to the use of generative AI, is very needed for the community. Generative AI, including large language models, vision-language models, diffusion models, and many more, has significantly aided various aspects of both academia and industry. In scientific discovery, these aspects encompass experimental design, hypothesis formulation, theoretical reasoning, and observation organization. In commercial applications, generative models such as large language models and diffusion algorithms have changed the lifestyles and workflows of billions around the world. This workshop aims to convene experts from various fields to address these challenges and explore potential solutions.
Schedule
Sun 9:00 a.m. - 9:40 a.m.
|
Opening remarks by Prof. Yoshua Bengio
(
talk
)
>
SlidesLive Video |
🔗 |
Sun 9:40 a.m. - 10:20 a.m.
|
talk by Prof. Max Tegmark talk
(
talk
)
>
SlidesLive Video |
🔗 |
Sun 10:20 a.m. - 11:00 a.m.
|
talk by Prof. Chelsea Finn talk
(
talk
)
>
SlidesLive Video |
🔗 |
Sun 11:00 a.m. - 11:40 a.m.
|
talk by Prof. Dawn Song
(
talk
)
>
SlidesLive Video |
🔗 |
Sun 11:40 a.m. - 1:30 p.m.
|
Lunch break
|
🔗 |
Sun 1:30 p.m. - 1:40 p.m.
|
Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks
SlidesLive Video |
🔗 |
Sun 1:40 p.m. - 1:50 p.m.
|
On Calibration of LLM-based Guard Models for Reliable Content Moderationpdf icon
SlidesLive Video |
🔗 |
Sun 1:50 p.m. - 2:00 p.m.
|
Controllable Generation via Locally Constrained Resampling
SlidesLive Video |
🔗 |
Sun 2:00 p.m. - 2:10 p.m.
|
Who Speaks Matters: Analysing the Influence of the Speaker’s Ethnicity on Hate Classification
SlidesLive Video |
🔗 |
Sun 2:10 p.m. - 2:20 p.m.
|
The effect of fine-tuning on language model toxicity
SlidesLive Video |
🔗 |
Sun 2:20 p.m. - 2:30 p.m.
|
GuardFormer: Guardrail Instruction Pretraining for Efficient SafeGuarding
SlidesLive Video |
🔗 |
Sun 2:30 p.m. - 2:40 p.m.
|
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
SlidesLive Video |
🔗 |
Sun 2:40 p.m. - 2:50 p.m.
|
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs
SlidesLive Video |
🔗 |
Sun 2:50 p.m. - 3:00 p.m.
|
Does Refusal Training in LLMs Generalize to the Past Tense?
SlidesLive Video |
🔗 |
Sun 3:00 p.m. - 5:00 p.m.
|
Poster session
|
🔗 |
Sun 3:00 p.m. - 5:00 p.m.
|
Poster session
|
🔗 |
-
|
HSpace Sparse Autoencoders ( Poster ) > link | Ayodeji Ijishakin · Ming Ang · Levente Baljer · Daniel Tan · Hugo Fry · Ahmed Abdulaal · Aengus Lynch 🔗 |
-
|
Measuring Steerability in Large Language Models ( Poster ) > link | Trenton Chang · Jenna Wiens · Tobias Schnabel · Adith Swaminathan 🔗 |
-
|
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge ( Poster ) > link |
12 presentersJiayi Ye · Yanbo Wang · Yue Huang · Dongping Chen · Qihui Zhang · Nuno Moniz · Tian Gao · Werner Geyer · Chao Huang · Pin-Yu Chen · Nitesh Chawla · Xiangliang Zhang |
-
|
Towards Safe and Honest AI Agents with Neural Self-Other Overlap ( Poster ) > link | Marc Carauleanu · Michael Vaiana · Diogo de Lucena · Judd Rosenblatt · Cameron Berg 🔗 |
-
|
Towards Safe and Honest AI Agents with Neural Self-Other Overlap ( Oral ) > link | Marc Carauleanu · Michael Vaiana · Diogo de Lucena · Judd Rosenblatt · Cameron Berg 🔗 |
-
|
Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks ( Poster ) > link | Alex Unnervik · Hatef Otroshi Shahreza · Anjith George · Sébastien Marcel 🔗 |
-
|
Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks ( Oral ) > link | Alex Unnervik · Hatef Otroshi Shahreza · Anjith George · Sébastien Marcel 🔗 |
-
|
Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models ( Poster ) > link | Neel Jain · Aditya Shrivastava · Chenyang Zhu · Daben Liu · Alfy Samuel · Ashwinee Panda · Anoop Kumar · Micah Goldblum · Tom Goldstein 🔗 |
-
|
GuardFormer: Guardrail Instruction Pretraining for Efficient SafeGuarding ( Poster ) > link | James O' Neill · Santhosh Subramanian · Eric Lin · Abishek Satish · Vaikkunth Mugunthan 🔗 |
-
|
GuardFormer: Guardrail Instruction Pretraining for Efficient SafeGuarding ( Oral ) > link | James O' Neill · Santhosh Subramanian · Eric Lin · Abishek Satish · Vaikkunth Mugunthan 🔗 |
-
|
Hidden in the Noise: Two-Stage Robust Watermarking for Images ( Poster ) > link | Kasra Arabi · Benjamin Feuer · R. Teal Witter · Chinmay Hegde · Niv Cohen 🔗 |
-
|
Auditing Empirical Privacy Protection of Private LLM Adaptations ( Poster ) > link | Bartłomiej Marek · Vincent Hanke · Xun Wang · Michael Backes · Adam Dziedzic · Franziska Boenisch 🔗 |
-
|
Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent ( Poster ) > link | Linfeng He · Yiming Sun · Sihao Wu · Jiaxu Liu · Xiaowei Huang 🔗 |
-
|
Controllable Generation via Locally Constrained Resampling ( Poster ) > link | Kareem Ahmed · Kai-Wei Chang · Guy Van den Broeck 🔗 |
-
|
Controllable Generation via Locally Constrained Resampling ( Oral ) > link | Kareem Ahmed · Kai-Wei Chang · Guy Van den Broeck 🔗 |
-
|
Retention Score: Quantifying Jailbreak Risks for Vision Language Models ( Poster ) > link | ZAITANG LI · Pin-Yu Chen · Tsung-Yi Ho 🔗 |
-
|
The Impact of Inference Acceleration Strategies on Bias of Large Language Models ( Poster ) > link | Elisabeth Kirsten · Ivan Habernal · Vedant Nanda · Muhammad Bilal Zafar 🔗 |
-
|
AnyPrefer: An Automatic Framework for Preference Data Synthesis ( Poster ) > link |
16 presentersYiyang Zhou · Zhaoyang Wang · Tianle Wang · Shangyu Xing · Peng Xia · Bo Li · Kaiyuan Zheng · Zijian Zhang · Zhaorun Chen · Wenhao Zheng · Xuchao Zhang · Chetan Bansal · Weitong Zhang · Ying Wei · Mohit Bansal · Huaxiu Yao |
-
|
Steering Without Side Effects: Improving Post-Deployment Control of Language Models ( Poster ) > link | Asa Cooper Stickland · Aleksandr Lyzhov · Jacob Pfau · Salsabila Mahdi · Samuel Bowman 🔗 |
-
|
Safe and Sound: Evaluating Language Models for Bias Mitigation and Understanding ( Poster ) > link | Shaina Raza · Deval Pandya · Shardul ghuge · Nifemi 🔗 |
-
|
Investigating Implicit Bias in Large Language Models: A Large-Scale Study of Over 50 LLMs ( Poster ) > link | Divyanshu Kumar · Umang Jain · Sahil Agarwal · Prashanth Harshangi 🔗 |
-
|
Self-Preference Bias in LLM-as-a-Judge ( Poster ) > link | Koki Wataoka · Tsubasa Takahashi · Ryokan Ri 🔗 |
-
|
Zer0-Jack: A memory-efficient gradient-based jailbreaking method for black box Multi-modal Large Language Models ( Poster ) > link | Tiejin Chen · Kaishen Wang · Hua Wei 🔗 |
-
|
The Probe Paradigm: A Theoretical Foundation for Explaining Generative Models ( Poster ) > link | Amit Rege 🔗 |
-
|
LLM Improvement for Jailbreak Defense: Analysis Through the Lens of Over-Refusal ( Poster ) > link | Swetasudha Panda · Naveen Jafer Nizar · Michael Wick 🔗 |
-
|
Network Inversion for Training-Like Data Reconstruction ( Poster ) > link | Pirzada Suhail · Amit Sethi 🔗 |
-
|
Lexically-constrained automated prompt augmentation: A case study using adversarial T2I data ( Poster ) > link |
12 presentersJessica Quaye · Alicia Parrish · Oana Inel · Minsuk Kahng · Charvi Rastogi · Hannah Rose Kirk · Jess Tsang · Nathan Clement · Rafael Mosquera-Gomez · Juan Ciro · Vijay Janapa Reddi · Lora Aroyo |
-
|
Detecting Origin Attribution for Text-to-Image Diffusion Models in RGB and Beyond ( Poster ) > link | Katherine Xu · Lingzhi Zhang · Jianbo Shi 🔗 |
-
|
GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence ( Poster ) > link | Kundan Krishna · Sanjana Ramprasad · Prakhar Gupta · Byron Wallace · Zachary Lipton · Jeffrey Bigham 🔗 |
-
|
The Structural Safety Generalization Problem ( Poster ) > link | Tom Gibbs · Julius Broomfield · George Ingebretsen · Ethan Kosak-Hine · Tia Nasir · Jason Zhang · Reihaneh Iranmanesh · Sara Pieri · Reihaneh Rabbany · Kellin Pelrine 🔗 |
-
|
Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning ( Poster ) > link | Chongyu Fan · Jiancheng Liu · Licong Lin · Jinghan Jia · Ruiqi Zhang · Song Mei · Sijia Liu 🔗 |
-
|
Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System ( Poster ) > link | Julian Collado · Kevin Stangl 🔗 |
-
|
Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations ( Poster ) > link | neale ratzlaff · Matthew Olson · Musashi Hinck · Shao-Yen Tseng · VASUDEV LAL · Phillip Howard 🔗 |
-
|
GRE Score: Generative Risk Evaluation for Large Language Models ( Poster ) > link | ZAITANG LI · Mohamed Mouhajir · Pin-Yu Chen · Tsung-Yi Ho 🔗 |
-
|
Identifying and Addressing Delusions for Target-Directed Decision Making ( Poster ) > link | Mingde Zhao · Tristan Sylvain · Doina Precup · Yoshua Bengio 🔗 |
-
|
Cream: Consistency Regularized Self-Rewarding Language Models ( Poster ) > link | Zhaoyang Wang · Weilei He · Zhiyuan Liang · Xuchao Zhang · Chetan Bansal · Ying Wei · Weitong Zhang · Huaxiu Yao 🔗 |
-
|
Fine-Tuning Large Language Models to Appropriately Abstain with Semantic Entropy ( Poster ) > link | Benedict Aaron Tjandra · Muhammed Razzak · Jannik Kossen · Yarin Gal 🔗 |
-
|
Epistemic Integrity in Large Language Models ( Poster ) > link | Bijean Ghafouri · Shahrad Mohammadzadeh · James Zhou · Pratheeksha Nair · Jacob-Junqi Tian · Mayank Goel · Reihaneh Rabbany · Jean-François Godbout · Kellin Pelrine 🔗 |
-
|
An Adversarial Behavior Model for Contextual Ethical Alignment in Large Language Models ( Poster ) > link | Edward Chang 🔗 |
-
|
Differentially Private Sequential Data Synthesis with Structured State Space Models and Diffusion Models ( Poster ) > link | Tomoya Matsumoto · Takayuki Miura · Toshiki Shibahara · Masanobu Kii · Kazuki Iwahana · Osamu Saisho · Shingo OKAMURA 🔗 |
-
|
Do LLMs estimate uncertainty well in instruction-following? ( Poster ) > link | Juyeon Heo · Miao Xiong · Christina Heinze-Deml · Jaya Narain 🔗 |
-
|
Concept Unlearning for Large Language Models ( Poster ) > link | Tomoya Yamashita · Takayuki Miura · Yuuki Yamanaka · Toshiki Shibahara · Masanori Yamada 🔗 |
-
|
Mitigating Hallucinations in LVLMs via Summary-Guided Decoding ( Poster ) > link | Kyungmin Min · Minbeom Kim · Kang-il Lee · Dongryeol Lee · Kyomin Jung 🔗 |
-
|
HyperFace: Generating Synthetic Face Recognition Datasets by Exploring Face Embedding Hypersphere ( Poster ) > link | Hatef Otroshi Shahreza · Sébastien Marcel 🔗 |
-
|
Permute-and-Flip: An optimally stable and watermarkable decoder for LLMs ( Poster ) > link | Xuandong Zhao · Lei Li · Yu-Xiang Wang 🔗 |
-
|
Investigating LLM Memorization: Bridging Trojan Detection and Training Data Extraction ( Poster ) > link | Manoj Acharya · Xiao Lin · Susmit Jha 🔗 |
-
|
DiffTextPure: Defending Large Language Models with Diffusion Purifiers ( Poster ) > link | Huanran Chen · Ziruo Wang · Yihan Yang · Shuo Zhang · Zeming Wei · Fusheng Jin · Yinpeng Dong 🔗 |
-
|
Which LLMs are Difficult to Detect? A Detailed Analysis of Potential Factors Contributing to Difficulties in LLM Text Detection ( Poster ) > link | Shantanu Thorat · Tianbao Yang 🔗 |
-
|
Can Generative AI Solve Your In-Context Learning Problem? A Martingale Perspective ( Poster ) > link | Andrew Jesson · Nicolas Beltran Velez · David Blei 🔗 |
-
|
On the Protocol for Evaluating Uncertainty in Generative Question-Answering Tasks ( Poster ) > link | Andrea Santilli · Miao Xiong · Michael Kirchhof · Pau Rodriguez · Federico Danieli · Xavier Suau · Luca Zappella · Sinead Williamson · Adam Golinski 🔗 |
-
|
Pruning for Robust Concept Erasing in Diffusion Models ( Poster ) > link | Tianyun Yang · Ziniu Li · Juan Cao · Chang Xu 🔗 |
-
|
Concept Denoising Score Matching for Responsible Text-to-Image Generation ( Poster ) > link | Silpa Vadakkeeveetil Sreelatha · Sauradip Nag · Serge Belongie · Muhammad Awais · Anjan Dutta 🔗 |
-
|
Applying Sparse Autoencoders to Unlearn Knowledge in Language Models ( Poster ) > link | Eoin Farrell · Yeu-Tong Lau · Arthur Conmy 🔗 |
-
|
Can Knowledge Editing Really Correct Hallucinations? ( Poster ) > link | Baixiang Huang · Canyu Chen · Xiongxiao Xu · Ali Payani · Kai Shu 🔗 |
-
|
Imitation guided Automated Red Teaming ( Poster ) > link | Desik Rengarajan · Sajad Mousavi · Ashwin Ramesh Babu · Vineet Gundecha · Avisek Naug · Sahand Ghorbanpour · Antonio Guillen-Perez · Ricardo Luna Gutierrez · Soumyendu Sarkar 🔗 |
-
|
Improving LLM Group Fairness on Tabular Data via In-Context Learning ( Poster ) > link | Valeriia Cherepanova · Chia-Jung Lee · Nil-Jana Akpinar · Riccardo Fogliato · Martin Bertran · Michael Kearns · James Zou 🔗 |
-
|
Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review ( Poster ) > link | Sungduk Yu · Man Luo · Avinash Madasu · VASUDEV LAL · Phillip Howard 🔗 |
-
|
Can Editing LLMs Inject Harm? ( Poster ) > link |
15 presentersCanyu Chen · Baixiang Huang · Zekun Li · Zhaorun Chen · Shiyang Lai · Xiongxiao Xu · Jia-Chen Gu · Jindong Gu · Huaxiu Yao · Chaowei Xiao · Xifeng Yan · William Yang Wang · Philip Torr · Dawn Song · Kai Shu |
-
|
Targeted Unlearning with Single Layer Unlearning Gradient ( Poster ) > link | Zikui Cai · Yaoteng Tan · M. Salman Asif 🔗 |
-
|
Stronger Universal and Transfer Attacks by Suppressing Refusals ( Poster ) > link | David Huang · Avidan Shah · Alexandre Araujo · David Wagner · Chawin Sitawarin 🔗 |
-
|
Weak-to-Strong Confidence Prediction ( Poster ) > link | Yukai Yang · Tracy Zhu · Marco Morucci · Tim G. J. Rudner 🔗 |
-
|
Fair Image Generation from Pre-trained Models by Probabilistic Modeling ( Poster ) > link | Mahdi Ahmadi · John Leland · Agneet Chatterjee · YooJung Choi 🔗 |
-
|
Differentially Private Attention Computation ( Poster ) > link | Yeqi Gao · Zhao Song · Xin Yang · Yufa Zhou 🔗 |
-
|
Has My System Prompt Been Used? Large Language Model Prompt Membership Inference ( Poster ) > link | Roman Levin · Valeriia Cherepanova · Abhimanyu Hans · Avi Schwarzschild · Tom Goldstein 🔗 |
-
|
Red Teaming Language-Conditioned Robot Models via Vision Language Models ( Poster ) > link | Sathwik Karnik · Zhang-Wei Hong · NISHANT ABHANGI · Yen-Chen Lin · Tsun-Hsuan Johnson Wang · Pulkit Agrawal 🔗 |
-
|
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data ( Poster ) > link | Spencer Whitehead · Jacob Phillips · Sean Hendryx 🔗 |
-
|
Privacy Protection in Personalized Diffusion Models via Targeted Cross-Attention Adversarial Attack ( Poster ) > link | Xide Xu · Muhammad Atif Butt · Sandesh Kamath · Bogdan Raducanu 🔗 |
-
|
DeepInception: Hypnotize Large Language Model to Be Jailbreaker ( Poster ) > link | Xuan Li · Zhanke Zhou · Jianing Zhu · Jiangchao Yao · Tongliang Liu · Bo Han 🔗 |
-
|
HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection ( Poster ) > link | Theo King · Zekun Wu · Adriano Koshiyama · Emre Kazim · Philip Treleaven 🔗 |
-
|
Testing the Limits of Jailbreaking with the Purple Problem ( Poster ) > link | Taeyoun Kim · Suhas Kotha · Aditi Raghunathan 🔗 |
-
|
Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models ( Poster ) > link | Xiaomeng Hu · Pin-Yu Chen · Tsung-Yi Ho 🔗 |
-
|
How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model? ( Poster ) > link | Saeid Asgari · Joseph G Lambourne · Alana Mongkhounsavath 🔗 |
-
|
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs ( Poster ) > link | Giulio Zizzo · Giandomenico Cornacchia · Kieran Fraser · Muhammad Zaid Hameed · Ambrish Rawat · Beat Buesser · Mark Purcell · Pin-Yu Chen · Prasanna Sattigeri · Kush Varshney 🔗 |
-
|
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs ( Oral ) > link | Giulio Zizzo · Giandomenico Cornacchia · Kieran Fraser · Muhammad Zaid Hameed · Ambrish Rawat · Beat Buesser · Mark Purcell · Pin-Yu Chen · Prasanna Sattigeri · Kush Varshney 🔗 |
-
|
PoisonedParrot: Subtle Data Poisoning Attacks to Elicit Copyright-Infringing Content from Large Language Models ( Poster ) > link | Michael-Andrei Panaitescu-Liess · Pankayaraj Pathmanathan · Yigitcan Kaya · Zora Che · Bang An · Sicheng Zhu · Aakriti Agrawal · Furong Huang 🔗 |
-
|
Hallucination Detox: Sensitive Neuron Dropout (SeND) for Large Language Model Training ( Poster ) > link | Shahrad Mohammadzadeh · Juan D. Guerra · Marco Bonizzato · Reihaneh Rabbany · Golnoosh Farnadi 🔗 |
-
|
Addressing Uncertainty in LLMs to Enhance Reliability in Generative AI ( Poster ) > link |
11 presentersRamneet Kaur · Colin Samplawski · Adam Cobb · Anirban Roy · Brian Matejek · Manoj Acharya · Daniel Elenius · Alexander Berenbeim · John Pavlik · Nathaniel Bastian · Susmit Jha |
-
|
Jogging the Memory of Unlearned LLMs Through Targeted Relearning Attacks ( Poster ) > link | Shengyuan Hu · Yiwei Fu · Steven Wu · Virginia Smith 🔗 |
-
|
A Closer Look at System Message Robustness ( Poster ) > link | Norman Mu · Jonathan Lu · Michael Lavery · David Wagner 🔗 |
-
|
The effect of fine-tuning on language model toxicity ( Poster ) > link | Will Hawkins · Brent Mittelstadt · Chris Russell 🔗 |
-
|
The effect of fine-tuning on language model toxicity ( Oral ) > link | Will Hawkins · Brent Mittelstadt · Chris Russell 🔗 |
-
|
Universal Jailbreak Backdoors in Large Language Model Alignment ( Poster ) > link | Thomas Baumann 🔗 |
-
|
Auto-Enhance: Towards a Meta-Benchmark to Evaluate AI Agents' Ability to Improve Other Agents ( Poster ) > link | Samuel Brown · Basil Labib · Codruta Lugoj · Sai Sasank Y 🔗 |
-
|
Waste not, want not; Recycled Gumbel noise improves consistency in natural language generation ( Poster ) > link | Damien de Mijolla · Hannan Saddiq · Kim Moore 🔗 |
-
|
Model Manipulation Attacks Enable More Rigorous Evaluations of LLM Unlearning ( Poster ) > link |
12 presentersZora Che · Stephen Casper · Anirudh Satheesh · Rohit Gandikota · Domenic Rosati · Stewart Slocum · Lev McKinney · Zichu Wu · Zikui Cai · Bilal Chughtai · Furong Huang · Dylan Hadfield-Menell |
-
|
Large Language Model Benchmarks Do Not Test Reliability ( Poster ) > link | Joshua Vendrow · Edward Vendrow · Sara Beery · Aleksander Madry 🔗 |
-
|
EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports ( Poster ) > link | Lama Moukheiber · Mira Moukheiber · Dana Moukheiber · Jae-Woo Ju · Hyung-Chul Lee 🔗 |
-
|
On Calibration of LLM-based Guard Models for Reliable Content Moderation ( Poster ) > link | Hongfu Liu · Hengguan Huang · Hao Wang · Xiangming Gu · Ye Wang 🔗 |
-
|
On Calibration of LLM-based Guard Models for Reliable Content Moderation ( Oral ) > link | Hongfu Liu · Hengguan Huang · Hao Wang · Xiangming Gu · Ye Wang 🔗 |
-
|
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks ( Poster ) > link | Yifan Zeng · Yiran Wu · Xiao Zhang · Huazheng Wang · Qingyun Wu 🔗 |
-
|
How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold ( Poster ) > link | Sahil Verma · Royi Rassin · Arnav Das · Gantavya Bhatt · Preethi Seshadri · Chirag Shah · Jeff A Bilmes · Hannaneh Hajishirzi · Yanai Elazar 🔗 |
-
|
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents ( Poster ) > link | Simon Lermen · Mateusz Dziemian · Govind Pimpale 🔗 |
-
|
Language Models Can Articulate Their Implicit Goals ( Poster ) > link | Jan Betley · Xuchan Bao · Martín Soto · Anna Sztyber-Betley · James Chua · Owain Evans 🔗 |
-
|
Energy-Based Conceptual Diffusion Model ( Poster ) > link | Yi Qin · Xinyue Xu · Hao Wang · Xiaomeng Li 🔗 |
-
|
MultiVerse: Exposing Large Language Model Alignment Problems in Diverse Worlds ( Poster ) > link | Xiaolong Jin · Zhuo Zhang · Guangyu Shen · Hanxi Guo · Kaiyuan Zhang · Siyuan Cheng · Xiangyu Zhang 🔗 |
-
|
$\textit{Who Speaks Matters}$: Analysing the Influence of the Speaker’s Ethnicity on Hate Classification ( Poster ) > link | Ananya Malik · Kartik Sharma · Lynnette Hui Xian Ng · Shaily Bhatt 🔗 |
-
|
$\textit{Who Speaks Matters}$: Analysing the Influence of the Speaker’s Ethnicity on Hate Classification ( Oral ) > link | Ananya Malik · Kartik Sharma · Lynnette Hui Xian Ng · Shaily Bhatt 🔗 |
-
|
HalLoc: Token-level Localization of Hallucinations for Large Vision Language Models ( Poster ) > link | Eunkyu Park · Minyeong Kim · Gunhee Kim 🔗 |
-
|
Safety-Aware Fine-Tuning of Large Language Models ( Poster ) > link | Hyeong Kyu Choi · Xuefeng Du · Sharon Li 🔗 |
-
|
Buffer Overflow in Mixture of Experts ( Poster ) > link | Jamie Hayes · I Shumailov · Itay Yona 🔗 |
-
|
Preserving Safety in Fine-Tuned Large Language Models: A Systematic Evaluation and Mitigation Strategy ( Poster ) > link | Tsung-Huan Yang · Ko-Wei Huang · Yung-Hui Li · Lun-Wei Ku 🔗 |
-
|
Extracting Unlearned Information from LLMs with Activation Steering ( Poster ) > link | Atakan Seyitoğlu · Aleksei Kuvshinov · Leo Schwinn · Stephan Günnemann 🔗 |
-
|
Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption ( Poster ) > link | Leo de Castro · Antigoni Polychroniadou · Daniel Escudero 🔗 |
-
|
Datasets for Navigating Sensitive Topics in Peference Data and Recommendations ( Poster ) > link | Amelia Kovacs · Jerry Chee · Sarah Dean 🔗 |
-
|
Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity ( Poster ) > link | David Williams-King · Linh Le · Adam Oberman · Yoshua Bengio 🔗 |
-
|
Efficient and Effective Uncertainty Quantification for LLMs ( Poster ) > link | Miao Xiong · Andrea Santilli · Michael Kirchhof · Adam Golinski · Sinead Williamson 🔗 |
-
|
EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM? ( Poster ) > link | Aakriti Agrawal · Mucong Ding · Zora Che · Chenghao Deng · Anirudh Satheesh · John Langford · Furong Huang 🔗 |
-
|
MED: Exploring LLM Memorization on Encrypted Data ( Poster ) > link | Panagiotis christodoulou · Giulio Zizzo · Sergio Maffeis 🔗 |
-
|
An Examination of AI-Generated Text Detectors Across Multiple Domains and Models ( Poster ) > link | Brian Tufts · Xuandong Zhao · Lei Li 🔗 |
-
|
Towards Resource Efficient and Interpretable Bias Mitigation in Natural Language Generation ( Poster ) > link | Schrasing Tong · Eliott Zemour · Rawisara Lohanimit · Lalana Kagal 🔗 |
-
|
NMT-Obfuscator Attack: Ignore a sentence in translation with only one word ( Poster ) > link | Sahar Sadrizadeh · César Descalzo · Ljiljana Dolamic · Pascal Frossard 🔗 |
-
|
A Probabilistic Generative Method for Safe Physical System Control Problems ( Poster ) > link |
11 presentersPeiyan Hu · Xiaowei Qian · Wenhao Deng · Rui Wang · Haodong Feng · Ruiqi Feng · Tao Zhang · Long Wei · Yue Wang · Zhi-Ming Ma · Tailin Wu |
-
|
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models ( Poster ) > link | Peng Xia · Kangyu Zhu · Haoran Li · Tianze Wang · Weijia Shi · Sheng Wang · Linjun Zhang · James Zou · Huaxiu Yao 🔗 |
-
|
PopAlign: Population-Level Alignment for Fair Text-to-Image Generation ( Poster ) > link | Shufan Li · Harkanwar Singh · Aditya Grover 🔗 |
-
|
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack ( Poster ) > link | Leo McKee-Reid · Christoph Sträter · Maria Martinez · Joe Needham · Mikita Balesni 🔗 |
-
|
SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models ( Poster ) > link | Carter Teplica · Yixin Liu · Arman Cohan · Tim G. J. Rudner 🔗 |
-
|
CoS: Enhancing Personalization and Mitigating Bias with Context Steering ( Poster ) > link | Sashrika Pandey · Jerry He · Mariah Schrum · Anca Dragan 🔗 |
-
|
What You See Is What You Get: Entity-Aware Summarization for Reliable Sponsored Search ( Poster ) > link |
12 presentersXiao Liang · Xinyu Hu · Simiao Zuo · Jimi He · Yu Wang · Victor Dong · Yeyun Gong · Kushal Dave · Yi Liu · Qiang Lou · Shao-Lun Huang · Jian Jiao |
-
|
How new data pollutes LLM knowledge and how to dilute it ( Poster ) > link | Chen Sun · Renat Aksitov · Andrey Zhmoginov · Nolan Miller · Max Vladymyrov · Ulrich Rueckert · Been Kim · Mark Sandler 🔗 |
-
|
Mix Data or Merge Models? Optimizing for Performance and Safety in Multilingual Contexts ( Poster ) > link | Aakanksha · Arash Ahmadian · Seraphina Goldfarb-Tarrant · Beyza Ermis · Marzieh Fadaee · Sara Hooker 🔗 |
-
|
Simulation System Towards Solving Societal-Scale Manipulation ( Poster ) > link |
15 presentersMaximilian Puelma Touzel · Sneheel Sarangi · Austin Welch · Gayatri K · Dan Zhao · Zachary Yang · Hao Yu · Tom Gibbs · Ethan Kosak-Hine · Andreea Musulan · Camille Thibault · Busra Gurbuz · Reihaneh Rabbany · Jean-François Godbout · Kellin Pelrine |
-
|
Red Teaming: Everything Everywhere All at Once ( Poster ) > link | Alexandra Chouldechova · A. Feder Cooper · Abhinav Palia · Dan Vann · Chad Atalla · Hannah Washington · Emily Sheng · Hanna Wallach 🔗 |
-
|
Inference, Fast and Slow: Reinterpreting VAEs for OOD Detection ( Poster ) > link | Sicong (Sheldon) Huang · Jiawei He · Kry Yik Chau Lui 🔗 |
-
|
The Empirical Impact of Data Sanitization on Language Models ( Poster ) > link | Anwesan Pal · Radhika Bhargava · Kyle Hinsz · Jacques Esterhuizen · Sudipta Bhattacharya 🔗 |
-
|
HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment ( Poster ) > link | Yannis Belkhiter · Giulio Zizzo · Sergio Maffeis 🔗 |
-
|
IncogniText: Privacy-enhancing Conditional Text Anonymization via LLM-based Private Attribute Randomization ( Poster ) > link | Ahmed Frikha · Nassim Walha · Krishna Nakka · Ricardo Mendes · Xue Jiang · Xuebing Zhou 🔗 |
-
|
Quantifying Likeness: A Simple Machine Learning Approach to Identifying Copyright Infringement in (AI-Generated) Artwork ( Poster ) > link | Michaela Drouillard · Ryan Spencer · Nikée Nantambu-Allen · Tegan Maharaj 🔗 |
-
|
An Undetectable Watermark for Generative Image Models ( Poster ) > link | Sam Gunn · Xuandong Zhao · Dawn Song 🔗 |
-
|
RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation ( Poster ) > link | Kaiqu Liang · Haimin Hu · Ryan Liu · Tom Griffiths · Jaime Fisac 🔗 |
-
|
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates ( Poster ) > link | Xiaosen Zheng @ SMU · Tianyu Pang · Chao Du · Qian Liu · Jing Jiang · Min Lin 🔗 |
-
|
Semantic Membership Inference Attack against Large Language Models ( Poster ) > link | Hamid Mozaffari · Virendra Marathe 🔗 |
-
|
Shallow Diffuse: Robust and Invisible Watermarking through Low-Dimensional Subspaces in Diffusion Models ( Poster ) > link | Wenda Li · Huijie Zhang · Qing Qu 🔗 |
-
|
Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queries ( Poster ) > link | Adam Yang · CHEN CHEN · Konstantinos Pitas 🔗 |
-
|
Differential Privacy of Cross-Attention with Provable Guarantee ( Poster ) > link | Yingyu Liang · Zhenmei Shi · Zhao Song · Yufa Zhou 🔗 |
-
|
Smoothed Embeddings for Robust Language Models ( Poster ) > link | Ryo Hase · Rafi Rashid · Ashley Lewis · Jing Liu · Toshiaki Koike-Akino · Kieran Parsons · Ye Wang 🔗 |
-
|
What do we learn from inverting CLIP models? ( Poster ) > link | Hamid Kazemi · Atoosa Chegini · Jonas Geiping · Soheil Feizi · Tom Goldstein 🔗 |
-
|
Does Refusal Training in LLMs Generalize to the Past Tense? ( Poster ) > link | Maksym Andriushchenko · Nicolas Flammarion 🔗 |
-
|
Does Refusal Training in LLMs Generalize to the Past Tense? ( Oral ) > link | Maksym Andriushchenko · Nicolas Flammarion 🔗 |
-
|
LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users ( Poster ) > link | Elinor Poole-Dayan · Deb Roy · Jad Kabbara 🔗 |
-
|
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment ( Poster ) > link | Pankayaraj Pathmanathan · Udari Sehwag · Michael-Andrei Panaitescu-Liess · Furong Huang 🔗 |
-
|
Safe Decision Transformer with Learning-based Constraints ( Poster ) > link | Ruhan Wang · Dongruo Zhou 🔗 |
-
|
MU-Bench: A Multitask Multimodal Benchmark for Machine Unlearning ( Poster ) > link | Jiali Cheng · Hadi Amiri 🔗 |
-
|
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy ( Poster ) > link | Tong Wu · Shujian Zhang · Kaiqiang Song · Silei Xu · Sanqiang Zhao · Ravi Agrawal · Sathish Indurthi · Chong Xiang · Prateek Mittal · Wenxuan Zhou 🔗 |
-
|
INVESTIGATING ANNOTATOR BIAS IN LARGE LANGUAGE MODELS FOR HATE SPEECH DETECTION ( Poster ) > link |
15 presentersAmit Das · Zheng Zhang · Md. Najib Hasan · Souvika Sarkar · Fatemeh Jamshidi · Tathagata Bhattacharya · Mostafa Rahgouy · Nilanjana Raychawdhary · Dongji Feng · Vinija Jain · Aman Chadha · Mary Sandage · Lauramarie Pope · Gerry Dozier · Cheryl Seals |
-
|
Towards Inference-time Category-wise Safety Steering for Large Language Models ( Poster ) > link | Amrita Bhattacharjee · Shaona Ghosh · Traian Rebedea · Christopher Parisien 🔗 |
-
|
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs ( Poster ) > link | Yohan Mathew · Ollie Matthews · Robert McCarthy · Joan Velja · Christian Schroeder de Witt · Dylan Cope · Nandi Schoots 🔗 |
-
|
Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts ( Poster ) > link | Emily Zhixuan Zeng · Yuhao Chen · Alexander Wong 🔗 |
-
|
Designing Physical-World Universal Attacks on Vision Transformers ( Poster ) > link | Mingzhen Shao 🔗 |
-
|
Rethinking Adversarial Attacks as Protection Against Diffusion-based Mimicry ( Poster ) > link | Haotian Xue · Yongxin Chen 🔗 |
-
|
INTERPRETABILITY OF LLM DECEPTION: UNIVERSAL MOTIF ( Poster ) > link | Wannan Yang · Chen Sun · Gyorgy Buzsaki 🔗 |
-
|
Towards a Theory of AI Personhood ( Poster ) > link | Francis Ward 🔗 |
-
|
Towards Scalable Exact Machine Unlearning Using Parameter-Efficient Fine-Tuning ( Poster ) > link | Somnath Basu Roy Chowdhury · Krzysztof M Choromanski · Arijit Sehanobish · Kumar Avinava Dubey · Snigdha Chaturvedi 🔗 |
-
|
Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit ( Poster ) > link | Joshua Freeman · Chloe Rippe · Edoardo Debenedetti · Maksym Andriushchenko 🔗 |
-
|
Is What You Ask For What You Get? Investigating Concept Associations in Text-to-Image Models ( Poster ) > link | Salma Abdel Magid · Weiwei Pan · Simon Warchol · Grace Guo · Junsik Kim · Wanhua Li · Mahia Rahman · Hanspeter Pfister 🔗 |
-
|
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompt ( Poster ) > link | Yusu Qian · Haotian Zhang · Yinfei Yang · Zhe Gan 🔗 |
-
|
Variational Diffusion Unlearning: a variational inference framework for unlearning in diffusion models ( Poster ) > link | Subhodip Panda · M S Varun · Shreyans Jain · Sarthak Kumar Maharana · Prathosh AP 🔗 |
-
|
Memorization Detection Benchmark for Generative Image models ( Poster ) > link | Marc Molina · Felice Burn 🔗 |
-
|
Dynamic Negative Guidance of Diffusion Models: Towards Immediate Content Removal ( Poster ) > link | Felix Koulischer · Johannes Deleu · Gabriel Raya · Thomas Demeester · Luca Ambrogioni 🔗 |
-
|
Gaussian Splatting Under Attack: Investigating Adversarial Noise in 3D Objects ( Poster ) > link | Abdurrahman Zeybey · Mehmet Ergezer · Tommy Nguyen 🔗 |
-
|
Choose Your Anchor Wisely: Effective Unlearning Diffusion Models via Concept Reconditioning ( Poster ) > link | Jingyu Zhu · Ruiqi Zhang · Licong Lin · Song Mei 🔗 |
-
|
Insights on Disagreement Patterns in Multimodal Safety Perception across Diverse Rater Groups ( Poster ) > link |
13 presentersCharvi Rastogi · Tian Huey Teh · Pushkar Mishra · Roma Patel · Zoe Ashwood · Aida Mostafazadeh Davani · Mark Díaz · Michela Paganini · Alicia Parrish · Ding Wang · Vinodkumar Prabhakaran · Lora Aroyo · Verena Rieser |
-
|
Rule-Guided Language Model Alignment for Text Generation Management in Industrial Use Cases ( Poster ) > link | Shunichi Akatsuka · Aman Kumar · Xian Yeow Lee · Lasitha Vidyaratne · Dipanjan Ghosh · Ahmed Farahat 🔗 |
-
|
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates ( Poster ) > link | Fengqing Jiang · Zhangchen Xu · Luyao Niu · Bill Yuchen Lin · Radha Poovendran 🔗 |
-
|
Efficiently Identifying Watermarked Segments in Mixed-Source Texts ( Poster ) > link | Xuandong Zhao · Chenwen Liao · Yu-Xiang Wang · Lei Li 🔗 |
-
|
miniCodeProps: a Minimal Benchmark for Proving Code Properties ( Poster ) > link | Evan Lohn · Sean Welleck 🔗 |
-
|
Adversarial Vulnerabilities in Large Language Models for Time Series Forecasting ( Poster ) > link | Fuqiang Liu · Sicong Jiang · Luis Miranda-Moreno · Seongjin Choi · Lijun Sun 🔗 |
-
|
SolidMark: Evaluating Image Memorization in Generative Models ( Poster ) > link | Nicky Kriplani · Minh Pham · Gowthami Somepalli · Chinmay Hegde · Niv Cohen 🔗 |
-
|
Self-Supervised Bisimulation Action Chunk Representation for Efficient RL ( Poster ) > link | Lei Shi · Jianye Hao · Hongyao Tang · Zibin Dong · YAN ZHENG 🔗 |
-
|
Anchored Optimization and Contrastive Revisions: Addressing Reward Hacking in Alignment ( Poster ) > link | Karel Doosterlinck · Winnie Xu · Chris Develder · Thomas Demeester · Amanpreet Singh · Christopher Potts · Douwe Kiela · Shikib Mehri 🔗 |
-
|
Interactive Semantic Interventions for VLMs: A Human-in-the-Loop Investigation of VLM Failure ( Poster ) > link | Lukas Klein · Kenza Amara · Carsten Lüth · Hendrik Strobelt · Mennatallah El-Assady · Paul Jaeger 🔗 |
-
|
Can LLMs Verify Arabic Claims? Evaluating the Arabic Fact-Checking Abilities of Multilingual LLMs ( Poster ) > link | Aryan Singhal · Ayushman Gupta · Ryan L Li · Evan Duan · Thomas Law · Veekshith Rao 🔗 |
-
|
MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs ( Poster ) > link | Saeid Asgari · Aliasghar Khani · Amir Khasahmadi 🔗 |
-
|
Unlearning in- vs. out-of-distribution data in LLMs under gradient-based methods ( Poster ) > link | Teodora Baluta · Gintare Karolina Dziugaite · Pascal Lamblin · Fabian Pedregosa · Danny Tarlow 🔗 |
-
|
A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage ( Poster ) > link | Rui Xin · Niloofar Mireshghallah · Stella Li · Michael Duan · Hyunwoo Kim · Yejin Choi · Yulia Tsvetkov · Sewoong Oh · Pang Wei Koh 🔗 |
-
|
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity ( Poster ) > link | Rheeya Uppaal · Apratim Dey · Yiting He · Yiqiao Zhong · Junjie Hu 🔗 |
-
|
LoReUn: Data Itself Implicitly Provides Cues to Improve Machine Unlearning ( Poster ) > link | Xiang Li · Qianli Shen · Haonan Wang · Kenji Kawaguchi 🔗 |
-
|
CPSample: Classifier Protected Sampling for Guarding Training Data During Diffusion ( Poster ) > link | Joshua Kazdan · Hao Sun · Jiaqi Han · Felix Petersen · Frederick Vu · Stefano Ermon 🔗 |
-
|
Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance ( Poster ) > link | Linxi Zhao · Yihe Deng · Weitong Zhang · Quanquan Gu 🔗 |
-
|
Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts? ( Poster ) > link | Sravanti Addepalli · Yerram Varun · Arun Suggala · Karthikeyan Shanmugam · Prateek Jain 🔗 |
-
|
AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails ( Poster ) > link | Shaona Ghosh · Prasoon Varshney · Makesh Narsimhan Sreedhar · Aishwarya Padmakumar · Traian Rebedea · Jibin Varghese · Christopher Parisien 🔗 |