Poster
in
Workshop: Machine Learning in Structural Biology Workshop
Protein language models learn evolutionary statistics of interacting sequence motifs
Zhidian Zhang · Hannah Wayment-Steele · Garyk Brixi · Matteo Dal Peraro · Dorothee Kern · Sergey Ovchinnikov
Protein language models (pLMs) have emerged as potent tools for predicting protein structures and designing proteins, yet it is unknown to what degree these models actually understand the inherent biophysics of protein structure. Motivated by a discovery that pLMs erroneously predict non-physical structure fragments for protein isoforms, we investigated the nature of sequence context needed for contact predictions in ESM2 by developing a "categorical Jacobian" approach, allowing for a completely unsupervised way of assessing coevolutionary signal stored in models, as well as by artificially modifying sequences. We found that pLMs make contact predictions conditioned on sequence motifs and the relative linear distance between segment pairs. Our investigation highlights the limitations of current pLMs and underscores the importance of understanding the underlying mechanisms of these models.