NeurIPS Speech generative modeling with little tokenization

KeyNote Talk
in
Workshop: The Fourth Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV): Highlighting New Architectures for Future Foundation Models

Speech generative modeling with little tokenization

Navdeep Jaitly

[ Abstract ]

Sat 14 Dec 10:30 a.m. PST — 11 a.m. PST

Abstract:

It is well accepted now that speech needs to be tokenized before it can be modeled with transformer based generative models. In fact there is a rich body of intricate work using semantic and other acoustic tokens for speech modeling. In this talk we show how tokenization may not be necessary and that, indeed, a simple way of discretizing Mel-spectrograms (which we call d-Mel) is enough to build generative models with transformers. We show how we can build conditional generative models of speech (text-to-speech) using d-Mel and transformer based models. We also demonstrate that the same technique can be applied to multi-modal generation of speech conditioned on text and video. It is our hope that this leads to more exploration on minimal preprocessing of speech for use in generative modeling.

Chat is not available.

KeyNote Talk in Workshop: The Fourth Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV): Highlighting New Architectures for Future Foundation Models

Speech generative modeling with little tokenization

Navdeep Jaitly

KeyNote Talk
in
Workshop: The Fourth Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV): Highlighting New Architectures for Future Foundation Models