Poster
in
Workshop: AI for New Drug Modalities
Probing the Embedding Space of Protein Foundation Models through Intrinsic Dimension Analysis
Soojung Yang · Juno Nam · Tynan Perez · Jinyeop Song · Xiaochen Du · Rafael Gomez-Bombarelli
Abstract:
Protein foundation models produce embeddings that are valuable for various downstream tasks, yet the structure and information content of these embeddings remain poorly understood, particularly in relation to diverse pre-training tasks and input modalities. We apply intrinsic dimension ($I_d$) analysis to quantify the complexity of protein embeddings from several widely used models, including ESM-2, ESM-IF, ProstT5, and ProteinMPNN. We also employ $I_d$ correlation ($I_d$Cor) to measure the shared information between different embeddings. Our results reveal a universality in protein embeddings, with similar $I_d$ scales across models and strong correlations between protein and residue embeddings. We observe significant redundancy, with $I_d$ values much smaller than the original embedding dimensions. We also show that models capture both spatial and sequential long-range correlation, with correlation decay rate differing based on the input modalities and pre-training tasks. Lastly, we analyze mutant embeddings, revealing that mutations cluster effectively by site, and fine-tuning further reduces the $I_d$ to capture task-specific representations.
Chat is not available.