Poster+Demo Session
in
Workshop: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation
Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM
Robin Shing-Hei Yuen · Timothy Tse · Jian Zhu
Current Speech LLMs are predominantly trained on extensive ASR and TTS datasets, excelling in tasks related to these domains.However, their ability to handle direct speech-to-speech conversations remains notably constrained.We find that Speech LLMs often rely on an ASR-to-TTS chain-of-thought pipeline (A-T-T-A chain) to generate good responses. The pipeline first recognizes speech into text and generates corresponding text responses before generating speech responses, which introduces significant latency.We propose a method that implicitly internalizes ASR chain of thought into a Speech LLM (A-T-A chain), allowing it to bypass the ASR transcript generation but still maintain speech conversation capabilities.Our approach reduces latency and improves the model’s native understanding of speech,paving the way for more efficient and natural real-time audio interactions. We also release a large-scale synthetic conversational dataset to facilitate further research.