Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Machine Learning in Structural Biology

Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks

Young Su Ko


Abstract:

Recent methods have attempted to enhance protein language models (pLMs) by integrating text-based protein annotations, referred to as “text+protein” language models (tpLMs). We evaluate and compare six tpLMs against ESM2, a text-free baseline pLM, across six downstream tasks. We additionally investigate the potential of embedding fusion, exploring whether the combinations of tpLM embeddings can improve performance on the benchmarks by exploiting the strengths of multiple tpLMs. For five out of six benchmarks, there existed some combination of tpLM embeddings that outperformed single tpLM embeddings, highlighting the potential of embedding fusion as a useful strategy in the field of machine-learning for proteins. To facilitate the practical application of embedding fusion, we outline a heuristic to efficiently identify the optimal combination of embeddings, reducing the exponential time complexity of an exhaustive combination search down to a manageable linear time complexity. Using our embedding fusion framework, we achieve state-of-the-art performances on the protein-protein interaction prediction and homologous sequence recovery tasks.

Chat is not available.