Workshop
Workshop on Responsibly Building Next Generation of Multimodal Foundation Models
Maitreya Patel · Changhoon Kim · Siwon Kim · Chaowei Xiao · Zhe Gan · 'YZ' Yezhou Yang
West Meeting Room 217-219
Sat 14 Dec, 8:15 a.m. PST
The rapid evolution of multimodal foundation models, capable of processing and generating language, images, video, and audio, has transformed numerous fields, including robotics, healthcare, and AI-driven media. However, these advancements bring forth significant challenges related to reliability, security, and societal impact. Instances of model hallucinations and the inadvertent generation of harmful content by Text-to-Image (T2I) models underscore the need for responsible and sustainable development practices.Our workshop aims to address these critical issues by establishing design principles that prioritize precautionary measures over reactive solutions. We will explore methodologies to enhance the reliability and robustness of multimodal models, focusing on fairness, security, and the mitigation of misinformation. By emphasizing preemptive strategies during dataset curation and model pre-training, we aim to reduce the extensive resource demands traditionally associated with iterative refinement processes.Key topics of discussion will include the identification of reliability concerns stemming from data quality, model architecture, and training strategies. Additionally, we will explore novel design principles that ensure the responsible and sustainable advancement of multimodal generative models. Our goal is to foster a collaborative environment where leading researchers and practitioners can develop actionable frameworks that align with ethical standards and maximize societal benefits.Through keynote talks, panel discussions, and interactive sessions, this workshop will provide a comprehensive platform for the AI community to converge on best practices for building the next generation of multimodal foundation models. We seek to ensure these models are not only technologically advanced but also secure, equitable, and environmentally sustainable.
Schedule
Sat 8:15 a.m. - 8:30 a.m.
|
Opening Remarks
SlidesLive Video |
🔗 |
Sat 8:30 a.m. - 9:10 a.m.
|
Keynote 1: Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
(
Invited Talk
)
>
SlidesLive Video |
Aniruddha Kembhavi · Aniruddha Kembhavi 🔗 |
Sat 9:10 a.m. - 9:50 a.m.
|
Keynote 2: Understanding How Knowledge Can Be Localized, Unlearned, or Verified in Foundation Models
(
Invited Talk
)
>
SlidesLive Video |
Soheil Feizi 🔗 |
Sat 9:50 a.m. - 10:45 a.m.
|
Poster Session 1
(
Poster Session
)
>
|
🔗 |
Sat 10:44 a.m. - 12:15 p.m.
|
Oral Presentations
(
Spotlight
)
>
|
🔗 |
Sat 10:45 a.m. - 11:00 a.m.
|
When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
(
Oral
)
>
link
SlidesLive Video |
16 presentersRylan Schaeffer · Dan Valentine · Luke Bailey · James Chua · Cristobal Eyzaguirre · Zane Durante · Joe Benton · Brando Miranda · Henry Sleight · Tony Wang · John Hughes · Rajashree Agrawal · Mrinank Sharma · Scott Emmons · Sanmi Koyejo · Ethan Perez |
Sat 11:00 a.m. - 11:15 a.m.
|
PopAlign: Population-Level Alignment for Fair Text-to-Image Generation ( Oral ) > link | Shufan Li · Aditya Grover · Harkanwar Singh 🔗 |
Sat 11:15 a.m. - 11:30 a.m.
|
LLAVAGUARD: VLM-based Safeguards for Vision Dataset Curation and Safety Assessment
(
Oral
)
>
link
SlidesLive Video |
Lukas Helff · Felix Friedrich · Manuel Brack · Kristian Kersting · Patrick Schramowski 🔗 |
Sat 11:30 a.m. - 11:45 a.m.
|
Multimodal Situational Safety
(
Oral
)
>
link
SlidesLive Video |
Kaiwen Zhou · Chengzhi Liu · Xuandong Zhao · Anderson Compalas · Xin Eric Wang 🔗 |
Sat 11:45 a.m. - 12:00 p.m.
|
MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs
(
Oral
)
>
link
SlidesLive Video |
Wenqian Ye · Guangtao Zheng · Yunsheng Ma · Xu Cao · Bolin Lai · James Rehg · Aidong Zhang 🔗 |
Sat 12:00 p.m. - 12:15 p.m.
|
Consistency-diversity-realism Pareto fronts of conditional image generative models
(
Oral
)
>
link
SlidesLive Video |
Pietro Astolfi · Melissa Hall · Jakob Verbeek · Marlene Careil · Oscar Mañas · Matthew Muckley · Adriana Romero · Michal Drozdzal 🔗 |
Sat 12:15 p.m. - 1:15 p.m.
|
Lunch Break
|
🔗 |
Sat 1:15 p.m. - 1:55 p.m.
|
Keynote 3: Risk assessment, safety alignment, and guardrails for multimodal foundation models
(
Invited Talk
)
>
SlidesLive Video |
Bo Li 🔗 |
Sat 2:00 p.m. - 2:45 p.m.
|
Panel Discussion
(
Discussion Panel
)
>
SlidesLive Video |
🔗 |
Sat 2:45 p.m. - 3:30 p.m.
|
Coffee Break + Poster Session 2
(
Poster Session
)
>
|
🔗 |
Sat 3:30 p.m. - 4:10 p.m.
|
Keynote 4: TextAttack for Improving Toxicity Detectors’ Adversarial Robustness
(
Invited Talk
)
>
SlidesLive Video |
Yanjun Qi · Yanjun Qi 🔗 |
Sat 4:10 p.m. - 4:50 p.m.
|
Keynote 5: Responsibility, Robustness, and Interpretability in the era of Generative AI
(
Invited Talk
)
>
SlidesLive Video |
David Bau · David Bau 🔗 |
Sat 4:50 p.m. - 5:15 p.m.
|
Closing Remarks and Awards
SlidesLive Video |
🔗 |
-
|
Towards Secure and Private AI: A Framework for Decentralized Inference ( Poster ) > link | Hongyang Zhang · Yue Zhao · Harry Yang · Ahmad Farhan · Fielding Johnston 🔗 |
-
|
Position Paper: Decentralized Frontier Risk and the No-Off Problem ( Poster ) > link | Alexander Long 🔗 |
-
|
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model ( Poster ) > link |
11 presentersWenqi Zhang · Zhenglin Cheng · Yuanyu He · Mengna Wang · Yongliang Shen · Zeqi Tan · Guiyang Hou · Mingqian He · Yanna Ma · Weiming Lu · Yueting Zhuang |
-
|
MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs ( Poster ) > link | Saeid Asgari · Aliasghar Khani · Amir Khasahmadi 🔗 |
-
|
Skipping Computations in Multimodal LLMs ( Poster ) > link | Mustafa Shukor · Matthieu Cord 🔗 |
-
|
Aligning to What? Limits to RLHF Based Alignment ( Poster ) > link | Logan Barnhart · Reza Akbarian Bafghi · Maziar Raissi · Stephen Becker 🔗 |
-
|
Exploring Intrinsic Fairness in Stable Diffusion ( Poster ) > link | Eunji Kim · Siwon Kim · Robin Rombach · Rahim Entezari · Sungroh Yoon 🔗 |
-
|
MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models ( Poster ) > link | Mohammad Shahab Sepehri · Zalan Fabian · Maryam Soltanolkotabi · Mahdi Soltanolkotabi 🔗 |
-
|
Building and better understanding vision-language models: insights and future directions ( Poster ) > link | Hugo Laurençon · Andrés Marafioti · Victor Sanh · Leo Tronchon 🔗 |
-
|
Rethinking Artistic Copyright Infringements in the Era of Text-to-Image Generative Models ( Poster ) > link | Mazda Moayeri · Samyadeep Basu · Sriram Balasubramanian · Priyatham Kattakinda · Atoosa Chegini · Robert Brauneis · Soheil Feizi 🔗 |
-
|
How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model? ( Poster ) > link | Saeid Asgari · Joseph G Lambourne · Alana Mongkhounsavath 🔗 |
-
|
BigDocs: A Permissively-Licensed Dataset for Training Vision-Language Models on Document and Code Tasks ( Poster ) > link |
34 presentersJuan Rodriguez · Xiangru Jian · Siba Smarak Panigrahi · Tianyu Zhang · Aarash Feizi · Abhay Puri · Akshay Kalkunte Suresh · François Savard · Amirhossein Abaskohi · Ahmed Masry · Shravan Nayak · Mahsa Massoud · Rabiul Awal · Pierre-André Noël · Mats L Richter · Saverio Vadacchino · Shubham Agarwal · Sanket Biswas · Ying Zhang · Sathwik Tejaswi Madhusudhan · Joao Monteiro · Krishnamurthy Dvijotham · Torsten Scholak · Nicolas Chapados · Sean Hughes · M. Tamer Özsu · Aishwarya Agrawal · Marco Pedersoli · Chris Pal · Perouz Taslakian · David Vazquez · Issam Hadj Laradji · Spandana Gella · Sai Rajeswar Mudumba |
-
|
Trust but Verify: Reliable VLM evaluation in-the-wild with program synthesis ( Poster ) > link | Viraj Uday Prabhu · Senthil Purushwalkam · Jieyu Zhang · An Yan · Caiming Xiong · Ran Xu 🔗 |
-
|
Comparison Visual Instruction Tuning ( Poster ) > link | Wei Lin · Muhammad Jehanzeb Mirza · Sivan Doveh · Rogerio Feris · Raja Giryes · Sepp Hochreiter · Leonid Karlinsky 🔗 |
-
|
Adversarial Robust Deep Reinforcement Learning is Neither Robust Nor Safe ( Poster ) > link | Ezgi Korkmaz 🔗 |
-
|
Attention Shift: Steering AI Away from Unsafe Content ( Poster ) > link | Shivank Garg · Manyana Tiwari 🔗 |
-
|
LEMoN: Label Error Detection using Multimodal Neighbors ( Poster ) > link | Haoran Zhang · Aparna Balagopalan · Nassim Oufattole · Hyewon Jeong · Yan Wu · Jiacheng Zhu · Marzyeh Ghassemi 🔗 |
-
|
GUIDE: A Responsible Multimodal Approach for Enhanced Glaucoma Risk Modeling and Patient Trajectory Analysis ( Poster ) > link | Heman Shakeri · Behnaz Moradijamei 🔗 |
-
|
The Multi-faceted Monosemanticity in Multimodal Representations ( Poster ) > link | Hanqi Yan · Yulan He · Yifei Wang 🔗 |
-
|
You Never Know: Quantization Induces Inconsistent Biases in Vision-Language Foundation Models ( Poster ) > link | Eric Slyman · Anirudh Kanneganti · Sanghyun Hong · Stefan Lee 🔗 |
-
|
Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries ( Poster ) > link | Julius Broomfield · George Ingebretsen · Reihaneh Iranmanesh · Sara Pieri · Ethan Kosak-Hine · Tom Gibbs · Reihaneh Rabbany · Kellin Pelrine 🔗 |
-
|
Coordinated Robustness Evaluation Framework for Vision Language Models ( Poster ) > link | Ashwin Ramesh Babu · Sajad Mousavi · Desik Rengarajan · Vineet Gundecha · Sahand Ghorbanpour · Avisek Naug · Antonio Guillen-Perez · Ricardo Luna Gutierrez · Soumyendu Sarkar 🔗 |
-
|
WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models ( Poster ) > link | Pavan Kalyan Tankala · Piyush Pasi · Sahil Dharod · Azeem Motiwala · Preethi Jyothi · Aditi Chaudhary · Krishna Srinivasan 🔗 |
-
|
Probabilistic Active Few-Shot Learning in Vision-Language Models ( Poster ) > link | Anton Baumann · Marcus Klasson · Rui Li · Arno Solin · Martin Trapp 🔗 |
-
|
Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queries ( Poster ) > link | Adam Yang · CHEN CHEN · Konstantinos Pitas 🔗 |
-
|
Seeing Through Their Eyes: Evaluating Visual Perspective Taking in Vision Language Models ( Poster ) > link | Gracjan Góral · Alicja Ziarko · Michal Nauman · Maciej Wolczyk 🔗 |
-
|
CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models ( Poster ) > link | Guangzhi Sun · Potsawee Manakul · Adian Liusie · Kunat Pipatanakul · Chao Zhang · Phil Woodland · Mark Gales 🔗 |
-
|
Incorporating Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models ( Poster ) > link | Ce Zhang · Zifu Wan · Zhehan Kan · Martin Q. Ma · Simon Stepputtis · Deva Ramanan · Ruslan Salakhutdinov · Louis-Philippe Morency · Katia Sycara · Yaqi Xie 🔗 |