Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Machine Learning in Structural Biology

Tradeoffs of alignment-based and protein language models for predicting viral mutation effects

Noor Youssef · Sarah Gurev · Debora Marks


Abstract:

Protein mutation effect prediction has advanced several fields, from designing enzymes to forecasting viral evolution. These models are typically trained on sequence data, structural data, or a combination of both. Sequence-based models learn constraints governing protein structure and function from sequence data and fall broadly into two categories: alignment-based models and protein language models (PLMs). We provide the first detailed comparison of modeling approaches specifically for viruses. We curated a dataset of over 59 standardized viral deep mutational scanning assays and assessed the relative performance of three alignment-based models (PSSM, EVmutation, and EVE), three PLMs (ESM-1v, Tranception, and VESPA), and two versions of SaProt, a structurally-aware protein language model (SaProt-AF2 and SaProt-PDB). Interestingly, deeper alignments often led to worse performance for alignment-based models. On the other hand, PLMs tended to perform better as the size of the training database increased. Overall, alignment-based models outperformed sequence-only PLMs, while the best alignment based model performed on par with SaProt-PDB. Our findings suggest that modeling strategies that are effective for other taxa may not translate directly to viruses, likely due to the limited number of viral sequences used for training. However, incorporating additional virus-specific data into PLMs could enhance their predictive power for viral mutation effects.

Chat is not available.