Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

Workshop

Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

Anurag Kumar · Zhaoheng Ni · Shinji Watanabe · Wenwu Wang · Yapeng Tian · Berrak Sisman

West Meeting Room 114, 115

Sat 14 Dec, 8:15 a.m. PST

[ Abstract ]

[ OpenReview]

Generative AI has been at the forefront of AI research in the most recent times. A large number of research works across different modalities (e.g., text, image and audio) have shown remarkable generation capabilities. Audio generation brings its own unique challenges and this workshop is aimed at highlighting these challenges and their solutions. It will bring together researchers working on different audio generation problems and enable a concentrated discussions on the topic. The workshop will include invited talks, high-quality papers presented through oral and poster sessions, and a panel discussion including experts in the area to further enhance the quality of discussion on audio generation research. A crucial part of audio generation research is its perceptual experience by humans. To enable this, \emph{we also propose to have an onsite demo session during the workshop where presenters can showcase their audio generation methods and technologies}, leading to a unique experience for all workshop participants.

Chat is not available.

Timezone: America/Los_Angeles

Schedule

Sat 8:15 a.m. - 8:30 a.m.	Welcome and opening remarks ( Opening ) > SlidesLive Video	🔗
Sat 8:30 a.m. - 9:00 a.m.	Alexis Conneau ( Invited Talk ) > SlidesLive Video	Alexis CONNEAU 🔗
Sat 9:00 a.m. - 9:30 a.m.	Joon Soon Chung ( Invited Talk ) > SlidesLive Video	Joon Son Chung 🔗
Sat 9:30 a.m. - 9:45 a.m.	Improving Musical Accompaniment Co-creation via Diffusion Transformers ( Oral ) > link SlidesLive Video Link	Javier Nistal · Marco Pasini · Stefan Lattner 🔗
Sat 9:45 a.m. - 10:00 a.m.	AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation ( Oral ) > link SlidesLive Video Link	Kai Wang · Shijian Deng · Jing Shi · Dimitrios Hatzinakos · Yapeng Tian 🔗
Sat 10:00 a.m. - 10:15 a.m.	AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models ( Oral ) > link SlidesLive Video Link	Jisheng Bai · Haohe Liu · Mou Wang · Dongyuan Shi · Wenwu Wang · Mark Plumbley · Woon-Seng Gan · Jianfeng Chen 🔗
Sat 10:15 a.m. - 10:30 a.m.	Short Break ( Short Break ) >	🔗
Sat 10:30 a.m. - 12:00 p.m.	VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment ( Poster+Demo Session ) > link SlidesLive Video Link	Bing Han · Long Zhou · Shujie LIU · Sanyuan Chen · Lingwei Meng · Yanmin Qian · Eric Liu · sheng zhao · Jinyu Li · Furu Wei 🔗
Sat 10:30 a.m. - 12:00 p.m.	Decoding Musical Perception: Music Stimuli Reconstruction from Brain Activity ( Poster+Demo Session ) > link SlidesLive Video Link	Matteo Ciferri · Matteo Ferrante · Nicola Toschi 🔗
Sat 10:30 a.m. - 12:00 p.m.	Neural Audio Codec for Latent Music Representations ( Poster+Demo Session ) > link SlidesLive Video Link	Luca Lanzendörfer · Florian Grötschla · Amir Dellali · Roger Wattenhofer 🔗
Sat 10:30 a.m. - 12:00 p.m.	Do music LLMs learn symbolic concepts? A pilot study using probing and intervention ( Poster+Demo Session ) > link SlidesLive Video Link	Wenye Ma · Xinyue Li · Gus Xia 🔗
Sat 10:30 a.m. - 12:00 p.m.	Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses ( Poster+Demo Session ) > link SlidesLive Video Link	Suhita Ghosh · Frank Dreyer · Tim Thiele · Frederic Lorbeer · Sebastian Stober 🔗
Sat 10:30 a.m. - 12:00 p.m.	Contextual Speech Emotion Recognition with Large Language Models and ASR-Based Transcriptions ( Poster+Demo Session ) > link SlidesLive Video Link	Enshi Zhang · Christian Poellabauer 🔗
Sat 10:30 a.m. - 12:00 p.m.	What do MLLMs hear? Examining the interaction between LLM and audio encoder components in Multimodal Large Language Models ( Poster+Demo Session ) > link SlidesLive Video Link	Enis Çoban · Michael Mandel · Johanna Devaney 🔗
Sat 10:30 a.m. - 12:00 p.m.	Articulatory Synthesis of Speech and Diverse Vocal Sounds via Optimization ( Poster+Demo Session ) > link Link	Luke Mo · Manuel Cherep · Nikhil Singh · Quinn Langford · Patricia Maes 🔗
Sat 10:30 a.m. - 12:00 p.m.	Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM ( Poster+Demo Session ) > link SlidesLive Video Link	Robin Shing-Hei Yuen · Timothy Tse · Jian Zhu 🔗
Sat 10:30 a.m. - 12:00 p.m.	A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation ( Poster+Demo Session ) > link SlidesLive Video Link	Alexander Liu · Qirui Wang · Yuan Gong · Jim Glass 🔗
Sat 10:30 a.m. - 12:00 p.m.	Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation ( Poster+Demo Session ) > link SlidesLive Video Link	Marco Pasini · Javier Nistal · Stefan Lattner · George Fazekas 🔗
Sat 10:30 a.m. - 12:00 p.m.	One-shot Text-aligned Virtual Instrument Generation Utilizing Diffusion Transformer ( Poster+Demo Session ) > link Link	Qihui Yang · Jiahe Lei · Qiuqiang Kong 🔗
Sat 10:30 a.m. - 12:00 p.m.	Three-modal guidance for symbolic music generation: melody, structure, texture ( Poster+Demo Session ) > link Link	Daniel Lucht · David Leins · Dimitri von Rütte · Alexandra Moringen 🔗
Sat 10:30 a.m. - 12:00 p.m.	Parrot: Autoregressive Spoken Dialogue Language Modeling with Decoder-only Transformers ( Poster+Demo Session ) > link SlidesLive Video Link	Ziqiao Meng · Qichao Wang · Wenqian Cui · Yifei Zhang · Bingzhe Wu · Irwin King · Liang Chen · Peilin Zhao 🔗
Sat 10:30 a.m. - 12:00 p.m.	Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation ( Poster+Demo Session ) > link SlidesLive Video Link	Junwon Lee · Modan Tailleur · Laurie Heller · Keunwoo Choi · Mathieu Lagrange · Brian McFee · Keisuke Imoto · Yuki Okamoto 🔗
Sat 10:30 a.m. - 12:00 p.m.	High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching ( Poster+Demo Session ) > link SlidesLive Video Link	12 presenters Gael Le Lan · Bowen Shi · Zhaoheng Ni · Sidd Srinivasan · Anurag Kumar · Brian Ellis · David Kant · Varun Nagaraja · Ernie Chang · Wei-Ning Hsu · Yangyang Shi · Vikas Chandra 🔗
Sat 10:30 a.m. - 12:00 p.m.	SNAC: Multi-Scale Neural Audio Codec ( Poster+Demo Session ) > link SlidesLive Video Link	Hubert Siuzdak · Florian Grötschla · Luca Lanzendörfer 🔗
Sat 10:30 a.m. - 12:00 p.m.	Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation ( Poster+Demo Session ) > link SlidesLive Video Link	11 presenters Chenxu Xiong · Ruibo Fu · Shuchen Shi · Zhengqi Wen · Tao Wang · Chenxing Li · Chunyu Qiang · Yuankun Xie · XinQi · Guanjun Li · Zizheng Yang 🔗
Sat 10:30 a.m. - 12:00 p.m.	Latent Diffusion Model for Audio: Generation, Quality Enhancement, and Neural Audio Codec ( Poster+Demo Session ) > link SlidesLive Video Link	Haohe Liu · Wenwu Wang · Mark Plumbley 🔗
Sat 10:30 a.m. - 12:00 p.m.	3D Audio-Visual Segmentation ( Poster+Demo Session ) > link SlidesLive Video Link	Artem Sokolov · Swapnil Bhosale · Xiatian Zhu 🔗
Sat 10:30 a.m. - 12:00 p.m.	Decoding Strategy with Perceptual Rating Prediction for Language Model-Based Text-to-Speech Synthesis ( Poster+Demo Session ) > link SlidesLive Video Link	Kazuki Yamauchi · Wataru Nakata · Yuki Saito · Hiroshi Saruwatari 🔗
Sat 10:30 a.m. - 12:00 p.m.	Efficient Generative Multimodal Integration (EGMI): Enabling Audio Generation from Text-Image Pairs through Alignment with Large Language Models ( Poster+Demo Session ) > link SlidesLive Video Link	Taemin Kim · Wooyeol Baek · Heeseok Oh 🔗
Sat 10:30 a.m. - 12:00 p.m.	MusicScore: A Dataset for Music Score Modeling and Generation ( Poster+Demo Session ) > link Link	Yuheng Lin · Zheqi DAI · Qiuqiang Kong 🔗
Sat 10:30 a.m. - 12:00 p.m.	Improving Musical Accompaniment Co-creation via Diffusion Transformers ( Poster+Demo Session ) >	🔗
Sat 12:00 p.m. - 1:30 p.m.	Lunch Break ( Lunch Break ) >	🔗
Sat 1:30 p.m. - 1:45 p.m.	LOCKEY: A Novel Approach to Model Authentication and Deepfake Tracking ( Oral ) > link SlidesLive Video Link	Mayank Kumar Singh · Naoya Takahashi · Wei-Hsiang Liao · Yuki Mitsufuji 🔗
Sat 1:45 p.m. - 2:00 p.m.	BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning ( Oral ) > link SlidesLive Video Link	Luca Lanzendörfer · Constantin Pinkl · Nathanael Perraudin · Roger Wattenhofer 🔗
Sat 2:00 p.m. - 2:15 p.m.	Improving Source Extraction with Diffusion and Consistency Models ( Oral ) > link SlidesLive Video Link	Tornike Karchkhadze · Mohammad Rasool Izadi · Shuo Zhang 🔗
Sat 2:15 p.m. - 2:45 p.m.	Yao Xie ( Invited Talk ) > SlidesLive Video	Yao Xie 🔗
Sat 2:45 p.m. - 3:15 p.m.	Vikas Chandra ( Invited Talk ) > SlidesLive Video	Vikas Chandra 🔗
Sat 3:15 p.m. - 3:30 p.m.	Short Break ( Short Break ) >	🔗
Sat 3:30 p.m. - 4:00 p.m.	Panel Discussion ( Panel Discussion ) > SlidesLive Video	🔗
Sat 4:00 p.m. - 4:15 p.m.	Closing Remarks ( Closing Remarks ) > SlidesLive Video	🔗
Sat 4:15 p.m. - 5:30 p.m.	Text-to-Audio Generation via Bridging Audio Language Model and Latent Diffusion ( Poster+Demo Session ) > link SlidesLive Video Link	ZHENYU WANG · Chenxing Li · YONG XU · Chunlei Zhang · John H. L. Hansen · Dong Yu 🔗
Sat 4:15 p.m. - 5:30 p.m.	Diffusion-based Speech Enhancement: Demonstration of Performance and Generalization ( Poster+Demo Session ) > link SlidesLive Video Link	Julius Richter · Timo Gerkmann 🔗
Sat 4:15 p.m. - 5:30 p.m.	Contrastive Lyrics Alignment with a Timestamp-Informed Loss ( Poster+Demo Session ) > link SlidesLive Video Link	Timon Kick · Florian Grötschla · Luca Lanzendörfer · Roger Wattenhofer 🔗
Sat 4:15 p.m. - 5:30 p.m.	Generating Vocals from Lyrics and Musical Accompaniment ( Poster+Demo Session ) > link SlidesLive Video Link	Georg Streich · Luca Lanzendörfer · Florian Grötschla · Roger Wattenhofer 🔗
Sat 4:15 p.m. - 5:30 p.m.	Sound-VECaps: Improving Audio Generation With Visual Enhanced Captions ( Poster+Demo Session ) > link SlidesLive Video Link	12 presenters Yi Yuan · Dongya Jia · Xiaobin Zhuang · Yuanzhe Chen · Zhengxi Liu · Zhuo Chen · Wang Yuping · Yuxuan Wang · Xubo Liu · Xiyuan Kang · Mark Plumbley · Wenwu Wang 🔗
Sat 4:15 p.m. - 5:30 p.m.	DGFM: Full Body Dance Generation Driven by Music Foundation Models ( Poster+Demo Session ) > link SlidesLive Video Link	Xinran Liu · Zhenhua Feng · Diptesh Kanojia · Wenwu Wang 🔗
Sat 4:15 p.m. - 5:30 p.m.	MLADDC: Multi-Lingual Audio Deepfake Detection Corpus ( Poster+Demo Session ) > link SlidesLive Video Link	ARTH SHAH · Ravindrakumar M. Purohit · Dharmendra Vaghera · Hemant Patil 🔗
Sat 4:15 p.m. - 5:30 p.m.	Multi-Source Music Generation with Latent Diffusion ( Poster+Demo Session ) > link SlidesLive Video Link	Zhongweiyang Xu · Debottam Dutta · Yu-Lin Wei · Romit Roy Choudhury 🔗
Sat 4:15 p.m. - 5:30 p.m.	Disentangling Multi-instrument Music Audio for Source-level Pitch and Timbre Manipulation ( Poster+Demo Session ) > link SlidesLive Video Link	11 presenters Yin-Jyun Luo · Kin Wai Cheuk · Woosung Choi · Wei-Hsiang Liao · Keisuke Toyama · Toshimitsu Uesaka · Koichi Saito · Chieh-Hsin Lai · Yuhta Takida · Simon Dixon · Yuki Mitsufuji 🔗
Sat 4:15 p.m. - 5:30 p.m.	Spatially-Aware Losses for Enhanced Neural Acoustic Fields ( Poster+Demo Session ) > link SlidesLive Video Link	Christopher Ick · Gordon Wichern · Yoshiki Masuyama · François Germain · Jonathan Le Roux 🔗
Sat 4:15 p.m. - 5:30 p.m.	DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech ( Poster+Demo Session ) > link SlidesLive Video Link	Jan Melechovsky · Ambuj Mehrish · Berrak Sisman · Dorien Herremans 🔗
Sat 4:15 p.m. - 5:30 p.m.	Style Mixture of Experts for Expressive Text-To-Speech Synthesis ( Poster+Demo Session ) > link SlidesLive Video Link	Ahad Jawaid · Shreeram Suresh Chandra · Junchen Lu · Berrak Sisman 🔗
Sat 4:15 p.m. - 5:30 p.m.	Vision Language Models Are Few-Shot Audio Spectrogram Classifiers ( Poster+Demo Session ) > link SlidesLive Video Link	Satvik Dixit · Laurie Heller · Chris Donahue 🔗
Sat 4:15 p.m. - 5:30 p.m.	SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation ( Poster+Demo Session ) > link SlidesLive Video Link	Koichi Saito · Dongjun Kim · Takashi Shibuya · Chieh-Hsin Lai · Zhi Zhong · Yuhta Takida · Yuki Mitsufuji 🔗
Sat 4:15 p.m. - 5:30 p.m.	FSD: Acoustic Echo Cancellation with Fewer Step Diffusion ( Poster+Demo Session ) > link Link	Yang Liu · Li Wan · Yiteng Huang · Ming Sun · Changsheng Zhao · Zhaoheng Ni · Xinhao Mei · Yangyang Shi · Florian Metze 🔗
Sat 4:15 p.m. - 5:30 p.m.	Towards Temporally Synchronized Visually Indicated Sounds Through Scale-Adapted Positional Embeddings ( Poster+Demo Session ) > link Link	Xinhao Mei · Gael Le Lan · Haohe Liu · Zhaoheng Ni · Varun Nagaraja · Anurag Kumar · Yangyang Shi · Vikas Chandra 🔗
Sat 4:15 p.m. - 5:30 p.m.	LoVA: Long-form Video-to-Audio Generation ( Poster+Demo Session ) > link SlidesLive Video Link	Xin Cheng · Xihua Wang · Yihan Wu · Yuyue Wang · Ruihua Song 🔗
Sat 4:15 p.m. - 5:30 p.m.	Coarse-to-Fine Text-to-Music Latent Diffusion ( Poster+Demo Session ) > link SlidesLive Video Link	Luca Lanzendörfer · Tongyu Lu · Nathanael Perraudin · Dorien Herremans · Roger Wattenhofer 🔗
Sat 4:15 p.m. - 5:30 p.m.	Benchmarking Music Generation Models and Metrics via Human Preference Studies ( Poster+Demo Session ) > link SlidesLive Video Link	Ahmet Solak · Florian Grötschla · Luca Lanzendörfer · Roger Wattenhofer 🔗
Sat 4:15 p.m. - 5:30 p.m.	AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation ( Poster+Demo Session ) >	🔗
Sat 4:15 p.m. - 5:30 p.m.	AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models ( Poster+Demo Session ) >	🔗
Sat 4:15 p.m. - 5:30 p.m.	LOCKEY: A Novel Approach to Model Authentication and Deepfake Tracking ( Poster+Demo Session ) >	🔗
Sat 4:15 p.m. - 5:30 p.m.	BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning ( Poster+Demo Session ) >	🔗
Sat 4:15 p.m. - 5:30 p.m.	Improving Source Extraction with Diffusion and Consistency Models ( Poster+Demo Session ) >	🔗