Poster
in
Workshop: MATH-AI: The 3rd Workshop on Mathematical Reasoning and AI
Temperature-scaled large language models for Lean proofstep prediction
Fabian Gloeckle · Baptiste Roziere · Amaury Hayat · Gabriel Synnaeve
Keywords: [ theorem proving ] [ overfitting ] [ Large language models ] [ finetuning ] [ Lean ] [ interactive theorem proving ]
Leveraging the reasoning capabilities of large language models (LLMs) for theorem proving is a promising but challenging task because it requires in-domain finetunings on which LLMs are known to be prone to overfit. This issue is exacerbated by two properties that set theorem proving apart from more mainstream applications of LLMs: training data in formal environments like Lean or Isabelle is very scarce and evaluation benchmarks are prohibitively costly to be used extensively for hyperparameter search and model selection. In this work, we propose temperature scaling as a regularization method for multi-epoch training on small datasets. We explain its theoretical purpose heuristically and demonstrate its effectiveness empirically, obtaining state-of-the-art supervised tactic generation models for Lean 3 of sizes 1.5B, 7B and 13B parameters. Model selection based on temperature-scaled perplexity increases scores on theorem proving benchmarks by up to four percentage points. We provide detailed ablations and analyses of the proof search behaviors of the resulting models, allowing practitioners to pick optimal model sizes for their respective use cases.