NeurIPS Poster NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention

Poster

NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention

Tianyi Zhang · Jonah Yi · Bowen Yao · Zhaozhuo Xu · Anshumali Shrivastava

East Exhibit Hall A-C #2011

[ Abstract ]

[ Paper] [ Poster] [ OpenReview]

Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract: Large Language Model (LLM) inference on Central Processing Units (CPU) is challenging due to the vast quantities of Multiply-Add (MAD) matrix operations in the attention computations. This paper highlights a rare gem in modern CPUs, Single-Instruction-Multiple-Data (SIMD) registers, which allows for ultra-low-latency lookups in a batch. We leverage this unique capability to propose NoMAD-Attention, an efficient attention algorithm that replaces MAD operations with in-register lookups. Through hardware-aware algorithmic designs, NoMAD-Attention achieves the computation of attention scores using repeated fast accesses to SIMD registers. NoMAD-Attention works with pre-trained attention-based LLMs without model finetuning. Extensive empirical evaluations demonstrate that NoMAD-Attention maintains the quality of the original LLMs well and speeds up the 4-bit quantized LLaMA-7B-based model by up to $2 \times$ at 16k context length.

Chat is not available.