Poster
in
Workshop: AI for New Drug Modalities
Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale
Caleb Ellington · Ning Sun · Nicholas Ho · Tianhua Tao · Sazan Mahbub · Yonghao Zhuang · Hongyi Wang · Eric Xing · Le Song
Language models applied to protein sequences have become a panacea, enabling therapeutics development, materials engineering, and core biology research.Despite the success of protein language models, genome language models are still nascent.Recent studies suggest the bottleneck is data volume or modeling context size, since long-range interactions are widely acknowledged but sparsely annotated.However, it may be the case that even short DNA sequences are modeled poorly by existing approaches, and current models are unable to represent the wide array of functions encoded by DNA.To study this, we develop DNA Foundation, a seven billion parameter encoder-only transformer trained on 10.6 billion nucleotides from a dataset of 796 species, achieving state-of-the-art performance on a wide range of quantitative and qualitative benchmarks related to functional genomics, therapeutics, and synthetic biology. By scaling model depth while maintaining a short context length of 4000 nucleotides, DNA FM shows substantial improvements across a breadth of tasks in functional genomics using transfer learning, sequence generation, and unsupervised annotation of functional elements.Notably, DNA FM outperforms prior encoder-only architectures without new data, suggesting that new scaling laws are needed to achieve compute-optimal DNA language models.