NeurIPS GaLore-mini: Low Rank Gradient Learning with Fewer Learning Rates

Poster
in
Workshop: Fine-Tuning in Modern Machine Learning: Principles and Scalability

GaLore-mini: Low Rank Gradient Learning with Fewer Learning Rates

WH Huang · Zhenyu Zhang · Yushun Zhang · Zhiquan Luo · Ruoyu Sun · Zhangyang "Atlas" Wang

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract:

Training large language models (LLMs) requires a significant memory capacity, mainly because of its vast number of model weights and optimizer states. In this work, we introduce a new memory-efficient optimizer called Galore-mini. Galore-mini exploits the inherent low-rank properties of weight gradients and the Hessian structure of transformers. These two characteristics significantly reduce the memory overhead by allowing the optimization states to be maintained in a low-rank format, while parameters are grouped into blocks, each sharing the same learning rate. However, directly combining these two strategies can easily lead to significant training instabilities. We explore several possible combinations and propose a strategy that stabilizes the training process. GaLore-mini reduces memory by 40\% compared to GaLore and 81\% compared to AdamW, while maintaining performance on par with vanilla AdamW. Experiments on LLaMA models ranging from 130M to 1B parameters demonstrate the effectiveness of GaLore-mini.

Chat is not available.

Poster in Workshop: Fine-Tuning in Modern Machine Learning: Principles and Scalability

GaLore-mini: Low Rank Gradient Learning with Fewer Learning Rates

WH Huang · Zhenyu Zhang · Yushun Zhang · Zhiquan Luo · Ruoyu Sun · Zhangyang &quot;Atlas&quot; Wang

Poster
in
Workshop: Fine-Tuning in Modern Machine Learning: Principles and Scalability

WH Huang · Zhenyu Zhang · Yushun Zhang · Zhiquan Luo · Ruoyu Sun · Zhangyang "Atlas" Wang