Poster
in
Workshop: Foundation Models for Science: Progress, Opportunities, and Challenges
SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
Sihang Li · Jin Huang · Jiaxi Zhuang · Yaorui Shi · Cai Xiaochen · Mingjun Xu · Xiang Wang · Linfeng Zhang · Guolin Ke · Hengxing Cai
Keywords: [ Pre-training ] [ Large Language Model ] [ Scientific Literature Understanding ] [ Supervised Fine-tuning ]
Scientific literature understanding is a crucial task to extract targeted information and garner insights from scientific documents. Despite the success of Large Language Models (LLMs), they face challenges in scientific literature understanding due to two reasons: (1) insufficient scientific knowledge and (2) unfamiliarity with specialized tasks. To address this, we propose a hybrid strategy combining continual pre-training (CPT) and supervised fine-tuning (SFT) to enhance domain knowledge and instruction-following capabilities. Our approach tackles two key challenges: constructing high-quality CPT corpora and generating diverse SFT instructions. We address these challenges through a meticulous pipeline including PDF extraction, content correction, filtering, and synthetic instruction generation. Applying this strategy, we introduce SciLitLLM, a suite of LLMs tailored for scientific literature understanding. Specifically, the 7B model shows an average performance improvement of 3.6% on SciAssess and 10.1% on SciRIFF compared to leading LLMs with fewer than 15B parameters. Additionally, the 72B model, trained using QLoRA, achieves state-of-the-art performance among widely adopted open-source models.Our contributions are threefold: (1) We present an effective framework that inte- grates CPT and SFT to adapt LLMs to scientific literature understanding, which can also be easily adapted to other domains. (2) We propose an LLM-based synthesis method to generate diverse and high-quality scientific instructions, resulting in a new instruction set – SciLitIns – for supervised fine-tuning in less-represented scientific domains. (3) SciLitLLM achieves promising performance improvements on scientific literature understanding benchmarks. Our models are available anonymously. Code and data will be released after administrative procedure.