Poster
in
Workshop: UniReps: Unifying Representations in Neural Models
Understanding Task Knowledge Entanglement in Protein Language Model Representations
Ria Vinod
Keywords: [ alignment ] [ subnetworks ] [ entanglement ] [ Representations ] [ relational knowledge ]
Pretraining large scale language models on protein sequences with a masked language modeling (MLM) objective has emerged as a popular modeling paradigm in recent years. In the MLM pretraining regime, protein language models (PLMs) are only given the objective to learn information about sequence patterns by improving accuracy on the task of imputing masked residues. However, using the representations from these pretrained PLMs has demonstrated markedly improved performance on structure and function prediction tasks when compared to models trained on simple sequence representations with explicit task-based objectives. Though PLM representations deliver general performance, it is unclear exactly how PLMs learn and organize task-specific knowledge within its parameters to be able to demonstrate these emergent properties. In this work, we propose an approach to understand task-knowledge organization of PLMs pretrained with only a sequence task objective, via the identification of knowledge-specific subnetworks. We initially identify 2 types subnetworks of the flagship ESM-2 PLM, with parameters pruned to retain (1) structure knowledge and (2) LM-task related knowledge. We identify both types of subnetworks for 3 ESM-2 model sizes of small (8M), medium (150M) and large (650M). We evaluate these subnetworks on 4 downstream global understanding tasks, of which knowledge emerges in the learned representations of PLMs: sequence fitness prediction, contact prediction, subcellular localization, and architecture classification. Our initial results show that we can identify knowledge-critical PLM subnetworks with decoupled sequence and structure embeddings, and that structure task performance is correlated with model parameter size.