Poster
in
Workshop: Fine-Tuning in Modern Machine Learning: Principles and Scalability
Memory retaining finetuning via distillation
Zitong Yang · Aonan Zhang · Sam Wiseman · Xiang Kong · Ke Ye · Dong Yin
Large language models (LLMs) pretrained on large corpora of internet text possess much of the world knowledge. Following pretraining, one often needs to conduct continued pretraining on certain capabilities such as math and coding, or “posttraining” (a.k.a., alignment) techniques to make the models follow users’ instructions and align them with human preferences. One challenge during these finetuning stages is that the model can lose the pretraining knowledge or forget certain capabilities (e.g., in-context learning ability). Moreover, although there exist strong open-weight LLMs such as Llama 3, both their pretraining and posttraining data are not open to the public, making it difficult to mix the finetuning data with the models’ own pretraining data as a solution for mitigating forgetting. We propose label annealing, a method that mitigates forgetting during finetuning without requiring access to the original pretraining data. Label annealing distills pretraining knowledge during finetuing by adding a KL divergence term in the loss function, regularizing the divergence between the finetuned model’s predictions to those of the initial pretrained model. In mathematics and code finetuning, label annealing improves the model’s performance in target domains without sacrificing other capabilities of the pretrained model. In alignment finetuning, our method introduces a smooth tradeoff between the instruction-following capability and the pretraining knowledge. We complement our empirical investigation with a mathematical model with overparameterized linear regression that provides geometric intuition why label annealing would help.