Skip to yearly menu bar Skip to main content


Poster
in
Workshop: MATH-AI: The 3rd Workshop on Mathematical Reasoning and AI

Temperature-scaled large language models for Lean proofstep prediction

Fabian Gloeckle · Baptiste Roziere · Amaury Hayat · Gabriel Synnaeve

Keywords: [ theorem proving ] [ overfitting ] [ Large language models ] [ finetuning ] [ Lean ] [ interactive theorem proving ]


Abstract:

Leveraging the reasoning capabilities of large language models (LLMs) for theorem proving is a promising but challenging task because it requires in-domain finetunings on which LLMs are known to be prone to overfit. This issue is exacerbated by two properties that set theorem proving apart from more mainstream applications of LLMs: training data in formal environments like Lean or Isabelle is very scarce and evaluation benchmarks are prohibitively costly to be used extensively for hyperparameter search and model selection. In this work, we propose temperature scaling as a regularization method for multi-epoch training on small datasets. We explain its theoretical purpose heuristically and demonstrate its effectiveness empirically, obtaining state-of-the-art supervised tactic generation models for Lean 3 of sizes 1.5B, 7B and 13B parameters. Model selection based on temperature-scaled perplexity increases scores on theorem proving benchmarks by up to four percentage points. We provide detailed ablations and analyses of the proof search behaviors of the resulting models, allowing practitioners to pick optimal model sizes for their respective use cases.

Chat is not available.