KeyNote Talk
in
Workshop: The Fourth Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV): Highlighting New Architectures for Future Foundation Models
Speech generative modeling with little tokenization
Navdeep Jaitly
It is well accepted now that speech needs to be tokenized before it can be modeled with transformer based generative models. In fact there is a rich body of intricate work using semantic and other acoustic tokens for speech modeling. In this talk we show how tokenization may not be necessary and that, indeed, a simple way of discretizing Mel-spectrograms (which we call d-Mel) is enough to build generative models with transformers. We show how we can build conditional generative models of speech (text-to-speech) using d-Mel and transformer based models. We also demonstrate that the same technique can be applied to multi-modal generation of speech conditioned on text and video. It is our hope that this leads to more exploration on minimal preprocessing of speech for use in generative modeling.