Skip to yearly menu bar Skip to main content


Poster+Demo Session
in
Workshop: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

Decoding Strategy with Perceptual Rating Prediction for Language Model-Based Text-to-Speech Synthesis

Kazuki Yamauchi · Wataru Nakata · Yuki Saito · Hiroshi Saruwatari

[ ] [ Project Page ]
Sat 14 Dec 10:30 a.m. PST — noon PST

Abstract:

Recently, text-to-speech (TTS) synthesis models that use language models (LMs) to autoregressively generate discrete speech tokens, such as neural audio codec, have gained attention. They successfully improve the diversity and expressiveness of synthetic speech while addressing repetitive generation issues by incorporating sampling-based decoding strategies. However, sampling randomness can lead to undesirable output, such as artifacts, and destabilize the quality of synthetic speech. To address this issue, we propose BOK-PRP, a novel sampling-based decoding strategy for LM-based TTS. Our strategy incorporates best-of-K (BOK) selection process based on perceptual rating prediction (PRP), filtering out undesirable outputs while maintaining output diversity. Importantly, the perceptual rating predictor is trained with human ratings independently of TTS models, allowing BOK-PRP to be applied to various pre-trained LM-based TTS models without requiring additional TTS training. Results from subjective evaluations demonstrate that BOK-PRP significantly improves the naturalness of synthetic speech.

Chat is not available.