NeurIPS ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training

Poster
in
Workshop: Optimization for ML Workshop

ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training

Adel Nabli · Louis Fournier · Pierre ERBACHER · Louis Serrano · Eugene Belilovsky · Edouard Oyallon

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract: Training Large Language Models (LLMs) relies heavily on distributed implementations, employing multiple GPUs to compute stochastic gradients on model replicas in parallel. However, synchronizing gradients in data parallel settings induces a communication overhead increasing with the number of distributed workers, impeding the efficiency gains of parallelization. To address this challenge, local optimization algorithms such as the ones used in Federated Learning have emerged. While effective in minimizing communication overhead, they incur significant memory costs, hindering scalability: in addition to extra momentum variables, optimizer's states cannot be partitioned among workers as communications are only allowed between rounds of local optimization steps. To conceal communication costs, we propose instead to synchronize delayed gradients *while* computing new ones between each model’s update. Accumulating local gradients on the workers until the communication finishes naturally reduces the idle time of GPUs and even allows the use of heterogeneous hardware. However, we show that the one-step delay inherent in parallel execution of gradient computations and communications has drastic impacts on Transformers’ convergence. To compensate this delay we introduce a novel technique, $\textbf{AC}$cumulate while $\textbf{CO}$mmunicate ($\textbf{ACCO}$), a memory-efficient optimization algorithm tailored for distributed training of LLMs which leads to training dynamics aligned with standard distributed optimization. Compared to ZeRO, our implementation and experiments on several LLMs pre-training and fine-tuning tasks demonstrates that $\textbf{ACCO}$ reduces the learning time up to 87\% and successfully allows both sharding optimizer states across workers and the use of heterogeneous hardware.

Chat is not available.

Poster in Workshop: Optimization for ML Workshop

ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training

Adel Nabli · Louis Fournier · Pierre ERBACHER · Louis Serrano · Eugene Belilovsky · Edouard Oyallon

Poster
in
Workshop: Optimization for ML Workshop