Skip to yearly menu bar Skip to main content


Poster

MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training

cheng Luo · Jiawei Zhao · Zhuoming Chen · Beidi Chen · Animashree Anandkumar

[ ] [ Project Page ]
Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

We introduce MINI-SEQUENCE TRANSFORMER (MsT), a simple and effective methodology for highly efficient and accurate LLM training with extremely long sequences. By partitioning the input sequence and iteratively processing mini-sequences, MsT reduces intermediate memory usage for activation. Integrated with activation recomputation, it enables significant memory savings in both forward and backward passes. It supports any sequence length. Experiments show that up to 4x longer sequence lengths, there is no degradation of throughput or convergence using our method. Experiments on the Llama3-8b model demonstrate that MsT supports 12x longer sequence lengths than standard implementation and 4x longer than activation recomputation, maintains throughput and convergence equivalents, and is fully general and implementation-agnostic.

Live content is unavailable. Log in and register to view live content