Oral
in
Workshop: Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI
Cascaded to End-to-End: New Safety, Security, and Evaluation Questions for Audio Language Models
Luxi He · Xiangyu Qi · Inyoung Cheong · Prateek Mittal · Danqi Chen · Peter Henderson
Keywords: [ Multimodal ] [ Safety ] [ Audio-Language Model ] [ Evaluation Problems ] [ Security ]
Abstract:
Many large language models (LLMs) now process audio inputs alongside text (Audio LMs). Technical novelty of Audio LMs comes from its shift from using a $\textit{cascaded}$ pipeline that first transcribes audio to text to being $\textit{end-to-end}$, processing audio input directly as they would on text and capturing rich audio features. In this perspective paper, we underscore novel safety and security risks that could be introduced by including rich paralinguistic information in this new paradigm. We highlight tensions and gaps in current end-to-end Audio LM evaluation protocols. For examples, some major benchmarks reward ability to identify sensitive features from audio, including gender, age, and emotion. Open-source and closed-source models also begin to have diverging evaluation goals. We hope that our work spurs a re-alignment in open-source Audio LM safety, security, and capability evaluations.
Chat is not available.