Poster
in
Workshop: AI for New Drug Modalities
Scaling Dense Representations for Single Cell Gene Expression with Transcriptome-Scale Context
Nicholas Ho · Caleb Ellington · Jinyu Hou · Sohan Addagudi · Shentong Mo · Tianhua Tao · Yonghao Zhuang · Hongyi Wang · Xingyi Cheng · Eric Xing · Le Song
Developing a unified model of cellular systems is a canonical challenge in biology. Recently, a wealth of public single-cell RNA sequencing data as well as rapid scaling of self-supervised learning methods have provided new avenues to address this longstanding challenge. However, rapid parameter scaling has been essential to the success of large language models in text and images, while similar scaling has not been attempted with Transformer architectures for cellular modeling. To produce accurate, transferable, and biologically meaningful representations of cellular systems, we develop CellFoundation, a series of 3M, 10M, 100M, and 650M parameter encoder-only dense Transformer models pre-trained on 50 million human cells from diverse tissues using a read-depth-aware masked gene expression pretraining objective. Unlike previous models, CellFoundation is capable of handling the entire human transcriptome as input without truncation or sampling tricks, thus learning accurate and general representations of the human cell's entire transcriptional context. This pretraining with a longer context was enabled through FlashAttention-2, mixed precision, and large-scale distributed systems training. CellFoundation(100M) achieves state-of-the-art results in tasks such as zero-shot clustering, cell-type classification, and perturbation modeling. Our findings reveal interesting loss scaling behaviors as we increase CellFoundation's parameters from 3M to 650M, providing insights for future directions in single-cell modeling.