Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Fine-Tuning in Modern Machine Learning: Principles and Scalability

Addax: Resource-Efficient Fine-Tuning of Language Models with a Combination of Forward-Backward and Forward-Only Passes

Zeman Li · Xinwei Zhang · Peilin Zhong · Yuan Deng · Vahab Mirrokni · Meisam Razaviyayn


Abstract: Fine-tuning language models (LMs) with the standard Adam optimizer often demands excessive memory, limiting accessibility. As a solution, recently, Memory-Efficient Zeroth-order Optimizer (MeZO) is introduced by Malladi et al. While MeZO uses less memory, it suffers from slow convergence and loss in performance. We introduce a novel method, called Addax, that integrates MeZO with ``in-place'' Stochastic Gradient Descent (SGD). Addax obtains zeroth-order and first-order gradient estimates and combines them as the update direction in each step. Theoretically, we establish the convergence of Addax under mild assumptions, demonstrating faster convergence and less restrictive hyper-parameter choices than MeZO. Our extensive experiments with diverse LMs and tasks show that Addax consistently outperforms zero-shot and MeZO in terms of accuracy and time, while having a comparable memory footprint to MeZO. In particular, our experiments using one H100 GPU on OPT-13B model reveal that, on average, Addax outperforms MeZO in terms of accuracy/F1 score by $10\%$, and runs $20\times$ faster, while having a comparable memory footprint to MeZO. We also developed an even more memory-efficient version of Addax, called Addax-P, for fine-tuning larger models. In our experiments on the larger OPT-30B model, on average, Addax-P outperforms MeZO in terms of accuracy/F1 score by $>10\%$, and runs $30\times$ faster while both running on a single H100 GPU.Moreover, Addax-P surpasses the performance of standard fine-tuning approaches, such as SGD and Adam, in most tasks in terms of Accuracy/F1 score with significantly less memory requirement.

Chat is not available.