Poster
in
Workshop: Machine Learning in Structural Biology
Retrieval Augmented Protein Language Models for Protein Structure Prediction
Peter Lee · Xingyi Cheng · Eric Xing · lee song
Abstract:
The advent of advanced artificial intelligence technology has significantly accelerated progress in protein structure prediction. AlphaFold2, a pioneering method in this field, has set a new benchmark for prediction accuracy by leveraging the Evoformer module to automatically extract co-evolutionary information from multiple sequence alignments (MSA). However, the efficacy of structure prediction methods like AlphaFold2 is heavily dependent on the depth and quality of the MSA. To address this limitation, we propose a novel approach termed Protein Language Model with Retrieved AuGmented MSA (RAGPLM). This approach integrates pre-trained protein language models with retrieved MSA, allowing for the incorporation of co-evolutionary information in structure prediction while compensating for insufficient MSA information through large-scale pre-training. Our method surpasses single-sequence protein language models in perplexity, contact prediction, and fitness prediction. We utilized RAGPLM as the feature extractor for protein structure prediction, resulting in the development of RAGFold. When sufficient MSA is available, RAGFold achieves TM-scores comparable to AlphaFold2 and operates up to eight times faster. In scenarios where MSA is insufficient, our method significantly outperforms AlphaFold2 ($\Delta$TM-score=0.379, 0.116 and 0.059 for 0, 5 and 10 MSA sequences as input). Additionally, we developed an MSA retriever for MSA searching from the UniClust30 database using hierarchical ID generation, which is 45 to 90 times faster than traditional methods, and is used to expand the MSA training set for RAGPLM by 32\%. Our findings suggest that RAGPLM provides an efficient and accurate solution for protein structure prediction.
Chat is not available.