Spotlight Poster
Aligner Encoders: Self-Attention Transformers Can Be Self-Transducers
Adam Stooke · Rohit Prabhavalkar · Khe Sim · Pedro Moreno Mengibar
Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position of information from the audio sequence into the embedding; alignment to the final text output is processed during decoding. We discover that the transformer-based encoder adopted in recent years is actually capable of performing the alignment internally during the forward pass. This new phenomenon enables simpler and more efficient models. To train, we discard the dynamic programming of RNN-T in favor of the frame-wise cross-entropy loss of AED, but to decode we employ the lighter text-only recurrence of RNN-T---we simply scan consecutive embedding frames from the beginning, accessing one at a time and producing one token each until predicting the end of message. We conduct experiments demonstrating performance remarkably close to the state-of-the-art, including a special inference configuration enabling long-form recognition. Lastly, we find that the audio-text alignment is clearly visible in the self-attention weights of a certain layer, which could be said to perform ``self-transduction''.
Live content is unavailable. Log in and register to view live content