Workshop
Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation
Anurag Kumar · Zhaoheng Ni · Shinji Watanabe · Wenwu Wang · Yapeng Tian · Berrak Sisman
West Meeting Room 114, 115
Sat 14 Dec, 8:15 a.m. PST
Generative AI has been at the forefront of AI research in the most recent times. A large number of research works across different modalities (e.g., text, image and audio) have shown remarkable generation capabilities. Audio generation brings its own unique challenges and this workshop is aimed at highlighting these challenges and their solutions. It will bring together researchers working on different audio generation problems and enable a concentrated discussions on the topic. The workshop will include invited talks, high-quality papers presented through oral and poster sessions, and a panel discussion including experts in the area to further enhance the quality of discussion on audio generation research. A crucial part of audio generation research is its perceptual experience by humans. To enable this, \emph{we also propose to have an onsite demo session during the workshop where presenters can showcase their audio generation methods and technologies}, leading to a unique experience for all workshop participants.
Schedule
Sat 8:15 a.m. - 8:30 a.m.
|
Welcome and opening remarks
(
Opening
)
>
SlidesLive Video |
🔗 |
Sat 8:30 a.m. - 9:00 a.m.
|
Alexis Conneau
(
Invited Talk
)
>
SlidesLive Video |
Alexis CONNEAU 🔗 |
Sat 9:00 a.m. - 9:30 a.m.
|
Joon Soon Chung
(
Invited Talk
)
>
SlidesLive Video |
Joon Son Chung 🔗 |
Sat 9:30 a.m. - 9:45 a.m.
|
Improving Musical Accompaniment Co-creation via Diffusion Transformers
(
Oral
)
>
link
SlidesLive Video |
Javier Nistal · Marco Pasini · Stefan Lattner 🔗 |
Sat 9:45 a.m. - 10:00 a.m.
|
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation
(
Oral
)
>
link
SlidesLive Video |
Kai Wang · Shijian Deng · Jing Shi · Dimitrios Hatzinakos · Yapeng Tian 🔗 |
Sat 10:00 a.m. - 10:15 a.m.
|
AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models
(
Oral
)
>
link
SlidesLive Video |
Jisheng Bai · Haohe Liu · Mou Wang · Dongyuan Shi · Wenwu Wang · Mark Plumbley · Woon-Seng Gan · Jianfeng Chen 🔗 |
Sat 10:15 a.m. - 10:30 a.m.
|
Short Break
(
Short Break
)
>
|
🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Bing Han · Long Zhou · Shujie LIU · Sanyuan Chen · Lingwei Meng · Yanmin Qian · Eric Liu · sheng zhao · Jinyu Li · Furu Wei 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Decoding Musical Perception: Music Stimuli Reconstruction from Brain Activity
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Matteo Ciferri · Matteo Ferrante · Nicola Toschi 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Neural Audio Codec for Latent Music Representations
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Luca Lanzendörfer · Florian Grötschla · Amir Dellali · Roger Wattenhofer 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Do music LLMs learn symbolic concepts? A pilot study using probing and intervention
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Wenye Ma · Xinyue Li · Gus Xia 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Suhita Ghosh · Frank Dreyer · Tim Thiele · Frederic Lorbeer · Sebastian Stober 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Contextual Speech Emotion Recognition with Large Language Models and ASR-Based Transcriptions
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Enshi Zhang · Christian Poellabauer 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
What do MLLMs hear? Examining the interaction between LLM and audio encoder components in Multimodal Large Language Models
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Enis Çoban · Michael Mandel · Johanna Devaney 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Articulatory Synthesis of Speech and Diverse Vocal Sounds via Optimization ( Poster+Demo Session ) > link | Luke Mo · Manuel Cherep · Nikhil Singh · Quinn Langford · Patricia Maes 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Robin Shing-Hei Yuen · Timothy Tse · Jian Zhu 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Alexander Liu · Qirui Wang · Yuan Gong · Jim Glass 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Marco Pasini · Javier Nistal · Stefan Lattner · George Fazekas 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
One-shot Text-aligned Virtual Instrument Generation Utilizing Diffusion Transformer ( Poster+Demo Session ) > link | Qihui Yang · Jiahe Lei · Qiuqiang Kong 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Three-modal guidance for symbolic music generation: melody, structure, texture ( Poster+Demo Session ) > link | Daniel Lucht · David Leins · Dimitri von Rütte · Alexandra Moringen 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Parrot: Autoregressive Spoken Dialogue Language Modeling with Decoder-only Transformers
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Ziqiao Meng · Qichao Wang · Wenqian Cui · Yifei Zhang · Bingzhe Wu · Irwin King · Liang Chen · Peilin Zhao 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Junwon Lee · Modan Tailleur · Laurie Heller · Keunwoo Choi · Mathieu Lagrange · Brian McFee · Keisuke Imoto · Yuki Okamoto 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching
(
Poster+Demo Session
)
>
link
SlidesLive Video |
12 presentersGael Le Lan · Bowen Shi · Zhaoheng Ni · Sidd Srinivasan · Anurag Kumar · Brian Ellis · David Kant · Varun Nagaraja · Ernie Chang · Wei-Ning Hsu · Yangyang Shi · Vikas Chandra |
Sat 10:30 a.m. - 12:00 p.m.
|
SNAC: Multi-Scale Neural Audio Codec
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Hubert Siuzdak · Florian Grötschla · Luca Lanzendörfer 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation
(
Poster+Demo Session
)
>
link
SlidesLive Video |
11 presentersChenxu Xiong · Ruibo Fu · Shuchen Shi · Zhengqi Wen · Tao Wang · Chenxing Li · Chunyu Qiang · Yuankun Xie · XinQi · Guanjun Li · Zizheng Yang |
Sat 10:30 a.m. - 12:00 p.m.
|
Latent Diffusion Model for Audio: Generation, Quality Enhancement, and Neural Audio Codec
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Haohe Liu · Wenwu Wang · Mark Plumbley 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
3D Audio-Visual Segmentation
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Artem Sokolov · Swapnil Bhosale · Xiatian Zhu 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Decoding Strategy with Perceptual Rating Prediction for Language Model-Based Text-to-Speech Synthesis
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Kazuki Yamauchi · Wataru Nakata · Yuki Saito · Hiroshi Saruwatari 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Efficient Generative Multimodal Integration (EGMI): Enabling Audio Generation from Text-Image Pairs through Alignment with Large Language Models
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Taemin Kim · Wooyeol Baek · Heeseok Oh 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
MusicScore: A Dataset for Music Score Modeling and Generation ( Poster+Demo Session ) > link | Yuheng Lin · Zheqi DAI · Qiuqiang Kong 🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Improving Musical Accompaniment Co-creation via Diffusion Transformers
(
Poster+Demo Session
)
>
|
🔗 |
Sat 12:00 p.m. - 1:30 p.m.
|
Lunch Break
(
Lunch Break
)
>
|
🔗 |
Sat 1:30 p.m. - 1:45 p.m.
|
LOCKEY: A Novel Approach to Model Authentication and Deepfake Tracking
(
Oral
)
>
link
SlidesLive Video |
Mayank Kumar Singh · Naoya Takahashi · Wei-Hsiang Liao · Yuki Mitsufuji 🔗 |
Sat 1:45 p.m. - 2:00 p.m.
|
BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning
(
Oral
)
>
link
SlidesLive Video |
Luca Lanzendörfer · Constantin Pinkl · Nathanael Perraudin · Roger Wattenhofer 🔗 |
Sat 2:00 p.m. - 2:15 p.m.
|
Improving Source Extraction with Diffusion and Consistency Models
(
Oral
)
>
link
SlidesLive Video |
Tornike Karchkhadze · Mohammad Rasool Izadi · Shuo Zhang 🔗 |
Sat 2:15 p.m. - 2:45 p.m.
|
Yao Xie
(
Invited Talk
)
>
SlidesLive Video |
Yao Xie 🔗 |
Sat 2:45 p.m. - 3:15 p.m.
|
Vikas Chandra
(
Invited Talk
)
>
SlidesLive Video |
Vikas Chandra 🔗 |
Sat 3:15 p.m. - 3:30 p.m.
|
Short Break
(
Short Break
)
>
|
🔗 |
Sat 3:30 p.m. - 4:00 p.m.
|
Panel Discussion
(
Panel Discussion
)
>
SlidesLive Video |
🔗 |
Sat 4:00 p.m. - 4:15 p.m.
|
Closing Remarks
(
Closing Remarks
)
>
SlidesLive Video |
🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
Text-to-Audio Generation via Bridging Audio Language Model and Latent Diffusion
(
Poster+Demo Session
)
>
link
SlidesLive Video |
ZHENYU WANG · Chenxing Li · YONG XU · Chunlei Zhang · John H. L. Hansen · Dong Yu 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
Diffusion-based Speech Enhancement: Demonstration of Performance and Generalization
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Julius Richter · Timo Gerkmann 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
Contrastive Lyrics Alignment with a Timestamp-Informed Loss
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Timon Kick · Florian Grötschla · Luca Lanzendörfer · Roger Wattenhofer 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
Generating Vocals from Lyrics and Musical Accompaniment
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Georg Streich · Luca Lanzendörfer · Florian Grötschla · Roger Wattenhofer 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
Sound-VECaps: Improving Audio Generation With Visual Enhanced Captions
(
Poster+Demo Session
)
>
link
SlidesLive Video |
12 presentersYi Yuan · Dongya Jia · Xiaobin Zhuang · Yuanzhe Chen · Zhengxi Liu · Zhuo Chen · Wang Yuping · Yuxuan Wang · Xubo Liu · Xiyuan Kang · Mark Plumbley · Wenwu Wang |
Sat 4:15 p.m. - 5:30 p.m.
|
DGFM: Full Body Dance Generation Driven by Music Foundation Models
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Xinran Liu · Zhenhua Feng · Diptesh Kanojia · Wenwu Wang 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
MLADDC: Multi-Lingual Audio Deepfake Detection Corpus
(
Poster+Demo Session
)
>
link
SlidesLive Video |
ARTH SHAH · Ravindrakumar M. Purohit · Dharmendra Vaghera · Hemant Patil 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
Multi-Source Music Generation with Latent Diffusion
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Zhongweiyang Xu · Debottam Dutta · Yu-Lin Wei · Romit Roy Choudhury 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
Disentangling Multi-instrument Music Audio for Source-level Pitch and Timbre Manipulation
(
Poster+Demo Session
)
>
link
SlidesLive Video |
11 presentersYin-Jyun Luo · Kin Wai Cheuk · Woosung Choi · Wei-Hsiang Liao · Keisuke Toyama · Toshimitsu Uesaka · Koichi Saito · Chieh-Hsin Lai · Yuhta Takida · Simon Dixon · Yuki Mitsufuji |
Sat 4:15 p.m. - 5:30 p.m.
|
Spatially-Aware Losses for Enhanced Neural Acoustic Fields
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Christopher Ick · Gordon Wichern · Yoshiki Masuyama · François Germain · Jonathan Le Roux 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Jan Melechovsky · Ambuj Mehrish · Berrak Sisman · Dorien Herremans 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
Style Mixture of Experts for Expressive Text-To-Speech Synthesis
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Ahad Jawaid · Shreeram Suresh Chandra · Junchen Lu · Berrak Sisman 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
Vision Language Models Are Few-Shot Audio Spectrogram Classifiers
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Satvik Dixit · Laurie Heller · Chris Donahue 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Koichi Saito · Dongjun Kim · Takashi Shibuya · Chieh-Hsin Lai · Zhi Zhong · Yuhta Takida · Yuki Mitsufuji 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
FSD: Acoustic Echo Cancellation with Fewer Step Diffusion ( Poster+Demo Session ) > link | Yang Liu · Li Wan · Yiteng Huang · Ming Sun · Changsheng Zhao · Zhaoheng Ni · Xinhao Mei · Yangyang Shi · Florian Metze 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
Towards Temporally Synchronized Visually Indicated Sounds Through Scale-Adapted Positional Embeddings ( Poster+Demo Session ) > link | Xinhao Mei · Gael Le Lan · Haohe Liu · Zhaoheng Ni · Varun Nagaraja · Anurag Kumar · Yangyang Shi · Vikas Chandra 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
LoVA: Long-form Video-to-Audio Generation
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Xin Cheng · Xihua Wang · Yihan Wu · Yuyue Wang · Ruihua Song 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
Coarse-to-Fine Text-to-Music Latent Diffusion
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Luca Lanzendörfer · Tongyu Lu · Nathanael Perraudin · Dorien Herremans · Roger Wattenhofer 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
Benchmarking Music Generation Models and Metrics via Human Preference Studies
(
Poster+Demo Session
)
>
link
SlidesLive Video |
Ahmet Solak · Florian Grötschla · Luca Lanzendörfer · Roger Wattenhofer 🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation
(
Poster+Demo Session
)
>
|
🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models
(
Poster+Demo Session
)
>
|
🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
LOCKEY: A Novel Approach to Model Authentication and Deepfake Tracking
(
Poster+Demo Session
)
>
|
🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning
(
Poster+Demo Session
)
>
|
🔗 |
Sat 4:15 p.m. - 5:30 p.m.
|
Improving Source Extraction with Diffusion and Consistency Models
(
Poster+Demo Session
)
>
|
🔗 |