Poster
in
Workshop: Foundation Models for Science: Progress, Opportunities, and Challenges
Learning temperature-aware representations from millions of annotated protein sequences
Mingchen Li · Liang Zhang · Zilan Wang · Bozitao Zhong · Pan Tan · Jiabei Cheng · Bingxin Zhou · Liang Hong · Huiqun Yu
Keywords: [ Pre-trained Protein Models ] [ Protein ] [ Protein Temperature Prediction ]
Sun 15 Dec 8:30 a.m. PST — 5 p.m. PST
Temperature plays a dominant environmental role in determining the efficiency of protein function. Accurately predicting the thermal stability of proteins is crucial for fundamental biology, drug discovery, and protein engineering. Here, we introduce ThermoFormer, a transformer-based protein language model that learns both temperature-aware representation and sequence patterns. Specifically, we first build a large-scale dataset comprising more than 96 million protein sequences anno-tated with their optimal growth temperature (OGT). ThermoFormer is pre-trained with a supervised OGT prediction task and an unsupervised masked language modeling (MLM) task on the dataset. We evaluated the performance of Thermo- Former on the pre-training and the performance of transferring ThermoFormer to other temperature prediction datasets, including two melting temperature (TM) datasets and an optimal catalytic temperature (OCT) dataset. The results show that ThermoFormer is able to achieve state-of-the-art performance in both OGT, TM, and OCT prediction tasks, outperforming previous unsupervised pre-trained models. In addition, we have also shown that ThermoFormer enables zero-shot temperature prediction, i.e., even without further fine-tuning, ThermoFormer can still achieve comparable performance. We believe that ThermoFormer can serve as a foundational model for encoding protein sequences with temperature-aware representations, providing better transfer ability for temperature-related down-stream tasks. The datasets, model weights, and source codes are available at https://github.com/ginnm/ThermoFormer.