Poster
in
Workshop: Machine Learning in Structural Biology Workshop
Optimizing protein language models with Sentence Transformers
Istvan Redl
Protein language models (pLMs) have appeared in a wide range of in-silico protein engineering tasks and have shown impressive results. However the ways they are applied remain mostly standardised. Here, we introduce a set of finetuning techniques based on Sentence Transformers (STs) integrated with a novel data augmentation procedure and show how it can offer new state-of-the-art performance. Despite having initially been developed in classic NLP space, STs hold a natural appeal in pLM related applications, largely due their use of sequence pairs and triplets in the process. We demonstrate this conceptual approach in two different settings that frequently occur in this domain; a residue and also a sequence level prediction task. Apart from showing how these tools can extract more and higher quality information from pLMs, we discuss the main differences between their applications in NLP and in the protein spaces. We conclude by discussing the related challenges and provide a comprehensive outlook on potential applications.