Workshop
Workshop on Video-Language Models
Aiden Lee · Minjoon Seo · Sangdoo Yun · Sangho Lee · Jiasen Lu · Md Mohaiminul Islam · Yanbei Chen · Linjie Li
East Meeting Room 13
Sat 14 Dec, 9:20 a.m. PST
The growing relevance of video-language models in both academia and industry highlights the necessity for a dedicated workshop to address the unique challenges and opportunities this field presents. This workshop is designed to accelerate the development and practical application of video foundation models, which are crucial for interpreting and utilizing the extensive amounts of video data that make up a significant portion of global data. These models are increasingly vital for a range of applications, from video search and content creation to surveillance and robotics. Confirmed speakers are leading researchers in this field from UT Austin, University of Tübingen, and University of Bristol (Tentative), as well as prominent industry figures from Meta, Google DeepMind, and Microsoft, ensuring a rich exchange of knowledge. The diverse organizing team from universities, industry, and non-profit research institutes aims to foster broad participation and collaboration. This workshop aims to push the boundaries of video-language models, ensuring their development and deployment are ethical and responsible. It will serve as a platform for sharing knowledge, fostering collaborations, and setting future research directions in this rapidly advancing field.
Schedule
Sat 9:20 a.m. - 9:30 a.m.
|
Opening Remarks
(
Intro
)
>
SlidesLive Video |
🔗 |
Sat 9:30 a.m. - 10:10 a.m.
|
Invited Talk 1 (Speaker: Dima Damen)
(
Invited talk
)
>
SlidesLive Video |
Dima Damen 🔗 |
Sat 10:10 a.m. - 10:50 a.m.
|
Invited Talk 2 (Speaker: Gedas Bertasius)
(
Invited talk
)
>
SlidesLive Video |
Gedas Bertasius 🔗 |
Sat 10:50 a.m. - 11:30 a.m.
|
Invited Talk 3 (Speaker: Yong Jae Lee)
(
Invited talk
)
>
SlidesLive Video |
Yong Jae Lee 🔗 |
Sat 11:30 a.m. - 1:00 p.m.
|
Lunch Break
|
🔗 |
Sat 1:00 p.m. - 1:50 p.m.
|
Oral Session
(
Oral session
)
>
SlidesLive Video |
🔗 |
Sat 1:50 p.m. - 2:30 p.m.
|
Invited Talk 4 (Speker: Ishan Misra)
(
Invited talk
)
>
SlidesLive Video |
Ishan Misra 🔗 |
Sat 2:30 p.m. - 3:30 p.m.
|
Poster Session
(
Poster session
)
>
|
🔗 |
Sat 3:00 p.m. - 3:30 p.m.
|
Break
|
🔗 |
Sat 3:30 p.m. - 4:10 p.m.
|
Invited Talk 5 (Spekaer: Jianwei Yang)
(
Invited talk
)
>
SlidesLive Video |
Jianwei Yang 🔗 |
Sat 4:10 p.m. - 4:50 p.m.
|
Invited Talk 6 (Speaker: Doyup Lee)
(
Invited talk
)
>
SlidesLive Video |
Doyup Lee 🔗 |
Sat 4:50 p.m. - 5:20 p.m.
|
Panel Discussion
(
Panel Discussion
)
>
SlidesLive Video |
🔗 |
Sat 5:20 p.m. - 5:30 p.m.
|
Closing Remarks
(
Closing Remarks
)
>
|
🔗 |
-
|
Exploring In-Context Ensemble with Video-Language Models for Low-Level Workflow Understanding ( Poster ) > link | Moucheng Xu · Evangelos Chatzaroulas · Luc McCutcheon · Abdul Ahad · Hamzah Azeem · Janusz Marecki · Ammar Anwar 🔗 |
-
|
VideoPhy: Evaluating Physical Commonsense for Video Generation ( Oral ) > link | Hritik Bansal · Zongyu Lin · Tianyi Xie · Zeshun Zong · Michal Yarom · Yonatan Bitton · Chenfanfu Jiang · Yizhou Sun · Kai-Wei Chang · Aditya Grover 🔗 |
-
|
Read, Watch and Scream! Sound Generation from Text and Video ( Poster ) > link | Yujin Jeong · Yunji Kim · Sanghyuk Chun · Jiyoung Lee 🔗 |
-
|
Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties ( Poster ) > link | Keunwoo P Yu · Zheyuan Zhang · Fengyuan Hu · Shane Storks · Joyce Chai 🔗 |
-
|
Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution ( Poster ) > link | Timothy Wei · Hsien Xin Peng · Elaine Xu · Bryan Zhao · Lei Ding · Diji Yang 🔗 |
-
|
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos ( Poster ) > link |
14 presentersXuehai He · Weixi Feng · Kaizhi Zheng · Yujie Lu · Wanrong Zhu · Jiachen Li · Yue Fan · Jianfeng Wang · Linjie Li · Zhengyuan Yang · Kevin Lin · William Yang Wang · Lijuan Wang · Xin Eric Wang |
-
|
RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives ( Poster ) > link | Jaehong Yoon · Shoubin Yu · Mohit Bansal 🔗 |
-
|
Click & Describe: Multimodal Grounding and Tracking for Aerial Objects ( Poster ) > link | Rupanjali Kukal · Jay Patravali · FUXUN YU · Simranjit Singh · Nikolaos Karianakis · Rishi Madhok 🔗 |
-
|
LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living ( Poster ) > link | Rajatsubhra Chakraborty · Arkaprava Sinha · Dominick Reilly · Manish Govind · Pu Wang · francois bremond · Srijan Das 🔗 |
-
|
Matryoshka Multimodal Models ( Poster ) > link | Mu Cai · Jianwei Yang · Jianfeng Gao · Yong Jae Lee 🔗 |
-
|
VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing ( Poster ) > link | Jing Gu · Yuwei Fang · Ivan Skorokhodov · Peter Wonka · Xinya Du · Sergey Tulyakov · Xin Eric Wang 🔗 |
-
|
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models ( Oral ) > link |
15 presentersMu Cai · Reuben Tan · Jianrui Zhang · Bocheng Zou · Kai Zhang · Yao Feng · Fangrui Zhu · Jing Gu · Yiwu Zhong · Yuzhang Shang · Yao Dou · Jaden Park · Jianfeng Gao · Yong Jae Lee · Jianwei Yang |
-
|
Wolf: Captioning Everything with a World Summarization Framework ( Oral ) > link |
20 presentersBoyi Li · Ligeng Zhu · Ran Tian · Shuhan Tan · Yuxiao Chen · Yao Lu · Yin Cui · Sushant Veer · Max Ehrlich · Jonah Philion · Xinshuo Weng · Fuzhao Xue · Andrew Tao · Ming-Yu Liu · Sanja Fidler · Boris Ivanovic · Trevor Darrell · Jitendra Malik · Song Han · Marco Pavone |
-
|
CinePile: A Long Video Question Answering Dataset and Benchmark ( Poster ) > link | Ruchit Rawal · Khalid Saifullah · Ronen Basri · David Jacobs · Gowthami Somepalli · Tom Goldstein 🔗 |
-
|
Generative Timelines for Instructed Visual Assembly ( Poster ) > link | Alejandro Pardo · Jui-Hsien Wang · Bernard Ghanem · Josef Sivic · Bryan Russell · Fabian Caba 🔗 |
-
|
GUI-WORLD: A GUI-oriented Video Dataset for Multimodal LLM-based Agents ( Poster ) > link |
20 presentersDongping Chen · Yue Huang · Siyuan Wu · Jingyu Tang · Huichi Zhou · Qihui Zhang · Zhigang He · Yilin Bai · Gao Chujie · Liuyi Chen · Yiqiang Li · Chenlong Wang · Yue Yu · Tianshuo Zhou · Zhen Li · Yi Gui · Yao Wan · Pan Zhou · Jianfeng Gao · Lichao Sun |
-
|
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding ( Poster ) > link | Ahmad Mahmood · Ashmal Vayani · Muhammad Muzammal Naseer · Salman Khan · Fahad Shahbaz Khan 🔗 |
-
|
Mobile OS Task Procedure Extraction from YouTube ( Poster ) > link | Yunseok Jang · Yeda Song · Sungryull Sohn · Lajanugen Logeswaran · Tiange Luo · Honglak Lee 🔗 |
-
|
HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation ( Poster ) > link | Zirui Wang · Xinran Zhao · Simon Stepputtis · Woojun Kim · Tongshuang Wu · Katia Sycara · Yaqi Xie 🔗 |
-
|
Too many frames, not all useful: Efficient Strategies for Long-Form Video QA ( Poster ) > link | Jongwoo Park · Kanchana Ranasinghe · Kumara Kahatapitiya · Wonjeong Ryoo · Donghyun Kim · Michael S Ryoo 🔗 |
-
|
Quo Vadis, Video Understanding with Vision-Language Foundation Models? ( Poster ) > link | Mahmoud ALI · Di Yang · Arkaprava Sinha · Dominick Reilly · Srijan Das · Gianpiero Francesca · francois bremond 🔗 |
-
|
IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning ( Poster ) > link | Soeun Lee · Si-Woo Kim · taewhan Kim · Dong-Jin Kim 🔗 |
-
|
Can Video Large Language Models Comprehend Language in Videos? ( Poster ) > link | Minjoon Jung · Junbin Xiao · Byoung-Tak Zhang · Angela Yao 🔗 |
-
|
TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation ( Oral ) > link | Hritik Bansal · Yonatan Bitton · Michal Yarom · Idan Szpektor · Aditya Grover · Kai-Wei Chang 🔗 |
-
|
Language Repository for Long Video Understanding ( Poster ) > link | Kumara Kahatapitiya · Kanchana Ranasinghe · Jongwoo Park · Michael S Ryoo 🔗 |
-
|
MuMA-ToM: Multi-modal Multi-Agent Theory of Mind ( Poster ) > link | Haojun Shi · Suyu Ye · Xinyu Fang · Chuanyang Jin · Leyla Isik · Yen-Ling Kuo · Tianmin Shu 🔗 |
-
|
Taskverse: A Benchmark Generation Engine for Multi-modal Language Model ( Oral ) > link | Jieyu Zhang · Weikai Huang · Zixian Ma · Oscar Michel · Dong He · Tanmay Gupta · Wei-Chiu Ma · Ali Farhadi · Aniruddha Kembhavi · Ranjay Krishna 🔗 |