Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Foundation Models for Science: Progress, Opportunities, and Challenges

OPI: An Open Instruction Dataset for Adapting Large Language Models to Protein-Related Tasks

Hongwang Xiao · wenjun lin · Hui Wang · Zheng Liu · Qiwei Ye

Keywords: [ Protein modeling ] [ AI for Life Science ] [ Instruction Dataset ] [ Large Language Model ] [ Computational Biology ]


Abstract:

Large language models (LLMs) pretrained on extensive general corpora, such as GPT-4 and Llama series, have demonstrated remarkable excellence in a large variety of natural language processing (NLP) tasks. These models provide a user-friendly and efficient interface capable of better aligning users' preference with natural language instructions. However, LLMs' capabilities in the biomolecular science field, especially protein-related research, are still limited, and their capability boundary is also unclear. To address this limitation, we introduce a comprehensive instruction dataset - Open Protein Instructions (OPI) containing over 1.64M samples (training: 98.38\%, testing: 1.62\%), which is specifically dedicated to protein-related research. It enables LLMs to deal with a wide variety of protein-related tasks efficiently and cost-effectively through instruction tuning. Experiments on three types of tasks, including Sequence Understanding(SU), Annotation Prediction(AP), and Knowledge Mining(KM), demonstrate the efficacy of OPI in adapting LLMs to protein-related tasks. This study indicates that it is feasible to apply LLMs to handle biomolecule-related tasks via instruction tuning. Data, codes, and instruction-tuned models will be publicly available to promote research within the community. To access the codes, data and models, please visit the anonymous GitHub repository at https://anonymous.4open.science/r/OPI-9D0B.

Chat is not available.