Poster
MMSite: A Multi-modal Framework for the Identification of Active Sites in Proteins
Song Ouyang · Huiyu Cai · Kehua Su · Yong Luo · Lefei Zhang · Bo Du
Abstract:
The accurate identification of active sites in proteins is essential for the advancement of life science and pharmaceutical development, as these sites are of critical importance for enzyme activity and drug design. Recent advancements in protein language models (PLMs), trained on extensive datasets of amino acid sequences, have significantly improved our understanding of proteins. However, compared to the abundant protein sequence data, functional annotations, especially precise per-residue annotations, are scarce, which limits the performance of PLMs. On the other hand, textual descriptions of proteins, which could be annotated by human experts or a pretrained protein sequence-to-text model, provide meaningful context that could assist in the functional annotations, such as the localization of active sites. This motivates us to construct a $\textbf{ProT}$ein-$\textbf{A}$ttribute text $\textbf{D}$ataset ($\textbf{ProTAD}$), comprising over 570,000 pairs of protein sequences and multi-attribute textual descriptions. Based on this dataset, we propose $\textbf{MMSite}$, a multi-modal framework that enhances the ability of PLMs to identify active sites by leveraging biomedical language models (BLMs). In particular, we incorporate manual prompting and design a MACross module to deal with the multi-attribute characteristics of textual descriptions. MMSite is a two-stage ("First Align, Then Fuse") framework: first aligns the textual modality with the sequential modality through soft-label alignment, and then identifies active sites via multi-modal fusion. Experimental results demonstrate that MMSite achieves state-of-the-art performance compared to existing protein representation learning methods.
Live content is unavailable. Log in and register to view live content