NeurIPS IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs

Poster
in
Workshop: Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III): Towards the Future of Large Language Models and their Emerging Descendants

IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs

Yuzhen Mao · Martin Ester · Ke Li

[ Abstract ]

Abstract:

One limitation of existing transformer-based models is that they cannot handle very long sequences as input since their self-attention operations exhibit quadratic time and space complexity. This problem becomes especially acute when transformers are deployed on hardware platforms equipped only with CPUs. To address this issue, we propose a novel method for accelerating self-attention at inference time that works with pretrained transformer models out-of-the-box without requiring retraining. We experiment using our method to accelerate various long-sequence transformers on various benchmarks and demonstrate a greater speedup compared to the baselines.

Chat is not available.

Poster in Workshop: Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III): Towards the Future of Large Language Models and their Emerging Descendants

IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs

Yuzhen Mao · Martin Ester · Ke Li

Poster
in
Workshop: Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III): Towards the Future of Large Language Models and their Emerging Descendants