Skip to yearly menu bar Skip to main content


Poster+Demo Session
in
Workshop: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

Contextual Speech Emotion Recognition with Large Language Models and ASR-Based Transcriptions

Enshi Zhang · Christian Poellabauer

[ ] [ Project Page ]
Sat 14 Dec 10:30 a.m. PST — noon PST

Abstract:

Speech Emotion Recognition (SER) is the task of automatically identifying emotions expressed in spoken language. With the rise of large language models (LLMs), many studies have applied them to SER, but several key challenges remain. Current approaches often focus on isolated utterances, overlooking the rich contextual information present in conversations and the dynamic nature of emotions. Additionally, most methods rely on transcripts from a single Automatic Speech Recognition (ASR) model, neglecting the variability in word error rates (WER) across different ASR systems. Furthermore, the optimal length of conversational context and the impact of prompt structure on SER performance have not been sufficiently explored. To tackle these challenges, we design models using ASR transcripts from multiple sources as input data. In addition, we integrate custom prompts and different context window lengths. Empirical evaluations demonstrate that our method outperforms state-of-the-art techniques on the IEMOCAP and MELD datasets, highlighting the importance of utilizing conversational context and the diversity of ASR in SER tasks. All codes from our experiments are publicly available.

Chat is not available.