Poster
in
Workshop: Fine-Tuning in Modern Machine Learning: Principles and Scalability
Entropic Distribution Matching for Supervised Fine-tuning of LLMs: Less Overfitting and Better Diversity
Ziniu Li · Congliang Chen · Tian Xu · Zeyu Qin · Jiancong Xiao · Ruoyu Sun · Zhiquan Luo
Sat 14 Dec 8:50 a.m. PST — 5:30 p.m. PST
Large language models rely on Supervised Fine-Tuning (SFT) to specialize in downstream tasks. Cross Entropy (CE) loss is the de facto choice in SFT. However, CE often results in overfitting and limited output diversity due to its aggressive distribution matching strategy, which forces the model's generative distribution to closely mimic the empirical data distribution. This paper aim to address these issues by introducing the maximum entropy principle, encouraging models to resist overfitting while preserving output diversity. Specifically, we develop a new distribution matching method called GEM, which solves reverse Kullback-Leibler divergence minimization with an entropy regularizer. For the SFT of Llama-3-8B models, GEM outperforms CE in several aspects. First, when applied to acquire general instruction-following abilities, GEM exhibits reduced overfitting, as evidenced by lower perplexity and better performance on the IFEval benchmark. Second, this advantage is also observed in domain-specific fine-tuning, where GEM continues to outperform CE in specialized math reasoning and code generation tasks. Last, we show that GEM-tuned models offer better output diversity, which helps scale up test-time compute: with the same sampling budget, they achieve performance gains of up to 10 points in math reasoning and code generation tasks, compared with CE-tuned models.