Poster
in
Workshop: Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III): Towards the Future of Large Language Models and their Emerging Descendants
Less is More! A slim architecture, optimal for language tasks
Luca Herranz-Celotti · Ermal Rrapaj
The softmax attention has emerged as a noteworthy development in the field of Deep Learning, building on the successes of Transformer-based architectures. Their ever increasing sizes need increasing computational memory, that limits their usage. We propose QgV, a sigmoid gate that significantly boosts performance without increasing architecture size. We also leverage Tensor Chains to identify and prune the excess parameters. We find that such excess resides primarily within the embedding layer, and not in the output linear layer. To further improve performance and reduce parameters, we introduce H-SoftPOS, a hierarchical embedding layer. Remarkably, on the WMT14 English-German validation set, our approach yields a threefold reduction in perplexity, surpassing the current state-of-the-art, while reducing parameter counts also by a factor of 3. When we further reduce the number of parameters up to sevenfold, we can still achieve a 21\% decrease in perplexity with respect to the baseline Transformer. To test generalization capabilities, we conduct experiments on the 7 language pairs of the WMT17 dataset. Our model, Anthe, outperforms existing techniques in terms of test loss while simultaneously halving the number of parameters. Moreover, we observe a 70 times reduction in variance with respect to the prior state-of-the-art. In conclusion, our proposed method yields significant improvements in performance at lower memory cost.