Poster
in
Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI
Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
Huajian Xin · Daya Guo · Zhihong Shao · Z.Z. Ren · Qihao Zhu · Bo Liu · Chong Ruan · Wenda Li · Xiaodan Liang
Keywords: [ Autoformalization ] [ neural theorem proving ] [ language model ] [ Formal Math ] [ MiniF2F ]
Proof assistants like Lean have revolutionized mathematical proof verification by providing high levels of accuracy and reliability. Although large language models (LLMs) have demonstrated potential in mathematical reasoning, their advancement in formal theorem proving is hindered by the scarcity of large, high-quality training datasets. To address this challenge, we present a novel approach to generate extensive Lean 4 proof data from natural language mathematical problems at the high school and undergraduate levels. Specifically, we synthesize 8 million formal statements with corresponding proofs, leveraging this dataset to fine-tune the DeepSeekMath 7B model. The resulting model, DeepSeek-Prover, achieves a pass rate of 50\% on the Lean 4 miniF2F benchmark, surpassing the previous state-of-the-art result of 41.0\%. These findings underscore the potential of large-scale synthetic data in significantly enhancing the theorem-proving capabilities of LLMs.