Skip to yearly menu bar Skip to main content


Poster
in
Workshop: The Fourth Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV): Highlighting New Architectures for Future Foundation Models

Different Rates for Different Weights: Decoupled Relative Learning Rate Schedules

Jan Ludziejewski · Jan Małaśnicki · Maciej Pióro · Michał Krutul · Kamil Ciebiera · Jakub Krajewski · Marek Cygan · Kamil Adamczewski · Sebastian Jaszczur

Keywords: [ Efficient Training ]


Abstract:

In this work, we introduce a novel approach for optimizing neural network training by adjusting learning rates across weights of different components in Transformer models. Traditional methods often apply a uniform learning rate across all network layers, potentially overlooking the unique dynamics of each part. Remarkably, our introduced Relative Learning Rate Schedules (RLRS) method accelerates the training process by up to 23%, particularly in complex models such as Mixture of Experts (MoE). Hyperparameters of RLRS can be efficiently tuned on smaller models and then extrapolated to 27 times larger ones. This simple and effective method results in a substantial reduction in training time and computational resources, offering a practical and scalable solution for optimizing large-scale neural networks.

Chat is not available.