Spotlight Poster
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Jay Shah · Ganesh Bikshandi · Ying Zhang · Vijay Thakkar · Pradeep Ramani · Tri Dao
[
Abstract
]
Wed 11 Dec 11 a.m. PST
— 2 p.m. PST
Abstract:
Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes. However, it fails to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35\% utilization on the H100 GPU. We develop three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. We demonstrate that out method, FlashAttention-3, achieves speedup on H100 GPUs by 1.5-2.0$\times$ with FP16 (65\% utilization), and with FP8 reaching up to 1.2 PFLOPs/s. We validate that FP8 FlashAttention-3 achieves 2.6$\times$ lower numerical error than a baseline FP8 attention.
Live content is unavailable. Log in and register to view live content