Skip to yearly menu bar Skip to main content


Poster
in
Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI

InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

Xiaotian Han · Yiren Jian · Xuefeng Hu · Haogeng Liu · Yiqi Wang · Qihang Fan · Yuang Ai · Huaibo Huang · Ran He · Zhenheng Yang · Quanzeng You

Keywords: [ LLM ] [ pretraining ] [ math ] [ multimodal ]


Abstract:

Pre-training on large, high-quality datasets is essential for improving the reasoning abilities of Large Language Models (LLMs), particularly in specialized fields like mathematics. However, the field of Multimodal LLMs (MLLMs) lacks a comprehensive, open-source dataset for mathematical reasoning.To fill this gap, we present InfiMM-WebMath-40B, a high-quality dataset of interleaved image-text documents. It consists of 24 million web pages, 85 million image URLs, and 40 billion text tokens, all carefully extracted and filtered from CommonCrawl. We outline our data collection and processing pipeline in detail.Models trained on InfiMM-WebMath-40B demonstrate strong performance in both text-only and multimodal settings, setting a new state-of-the-art on multimodal math benchmarks such as MathVerse and We-Math.We release our data at https://huggingface.co/datasets/Infi-MM/InfiMM-WebMath-40B.

Chat is not available.