Poster
in
Workshop: Machine Learning in Structural Biology
Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks
Young Su Ko
Recent methods have attempted to enhance protein language models (pLMs) by integrating text-based protein annotations, referred to as “text+protein” language models (tpLMs). We evaluate and compare six tpLMs against ESM2, a text-free baseline pLM, across six downstream tasks. We additionally investigate the potential of embedding fusion, exploring whether the combinations of tpLM embeddings can improve performance on the benchmarks by exploiting the strengths of multiple tpLMs. For five out of six benchmarks, there existed some combination of tpLM embeddings that outperformed single tpLM embeddings, highlighting the potential of embedding fusion as a useful strategy in the field of machine-learning for proteins. To facilitate the practical application of embedding fusion, we outline a heuristic to efficiently identify the optimal combination of embeddings, reducing the exponential time complexity of an exhaustive combination search down to a manageable linear time complexity. Using our embedding fusion framework, we achieve state-of-the-art performances on the protein-protein interaction prediction and homologous sequence recovery tasks.