Poster
in
Workshop: Foundation Models for Science: Progress, Opportunities, and Challenges
Scalable Universal T-Cell Receptor Embeddings from Adaptive Immune Repertoires
Paidamoyo Chapfuwa · Ilker Demirel · Lorenzo Pisani · Javier Zazo · Elon Portugaly · H. Zahid · Julia Greissl
Keywords: [ Scaling ] [ T-cell Receptor Embeddings ] [ GloVe ] [ Random Projection Theory ] [ Immunology ]
T cells are a key component of the adaptive immune system, targeting infections, cancers, and allergens with specificity encoded by their T cell receptors (TCRs) and retaining a memory of their targets. High-throughput TCR repertoire sequencing captures a cross-section of TCRs that encode the immune history of any subject, though the data are heterogenous, high dimensional, sparse and mostly unlabeled. Sets of TCRs responding to the same antigen, i.e. a protein fragment, co-occur in subjects sharing immune genetics and exposure history. Here we leverage TCR co-occurrence to derive a low-dimensional dense vector representation of TCRs, employing a previously proposed unsupervised natural language processing algorithm---GloVe---and leveraging random projection theory to improve computational efficiency in terms of memory and training time. Using TCR co-occurrence derived from a large set of TCR repertoires we derive TCR representations and aggregate the subset of TCRs observed in any subject to provide a subject-level vector representation. We show that vectors for TCRs targeting the same pathogen point in similar directions and that our subject-level representations encode both immune genetics and pathogenic exposure history. Our work paves the way for combining our novel TCR and subject-level representations with complementary representations from other modalities.