Poster
in
Workshop: AIM-FM: Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond
MSA-LM: Integrating DNA-level Inductive Biases into DNA Language Models
Vishrut Thoutam
Recent advances in DNA language modeling have been limited by computational constraints and the ability to capture long-range dependencies within genomic data effectively. While effective, traditional transformer-based models suffer from quadratic complexity and limited context windows, making them unsuitable for large-scale DNA modeling. In contrast, subquadratic models, while efficient, often lack bidirectionality and struggle with training scalability. We introduce MSA-LM, an inductive-bias-aware subquadratic DNA Multiple Sequence Alignment (MSA) model that addresses these limitations. MSA-LM integrates a bidirectional Mamba model for sequence mixing, providing transformer-like expressibility without the associated quadratic complexity. By utilizing a sparse attention mechanism, MSA-LM selectively processes the main DNA sequence while incorporating evolutionary information from MSA data, significantly reducing computational overhead. Our results demonstrate that MSA-LM achieves state-of-the-art performance on long-context variant effect prediction tasks and Genomic Benchmarks, particularly excelling in regulatory sequence analysis. The proposed model not only surpasses existing transformer-based and subquadratic approaches in efficiency but also maintains high accuracy across diverse genomic tasks, marking a significant improvement in DNA language modeling capabilities.