Poster
in
Workshop: Machine Learning in Structural Biology
RNA-GPT: Multimodal Generative System for RNA Sequence Understanding
Yijia Xiao · Edward Sun · Yiqiao Jin · Wei Wang
RNAs are essential molecules that carry genetic information crucial for life, with significant applications in drug development and biotechnology. However, the vast amount of literature often impedes RNA research. To address this challenge, we present RNA-GPT, a multi-modal RNA chat model designed to streamline RNA discovery by utilizing extensive RNA literature. RNA-GPT integrates RNA sequence encoders with linear projection layers and state-of-the-art large language models (LLMs) for precise representation alignment, enabling it to process user-uploaded RNA sequences and deliver concise, accurate responses. Our scalable training pipeline, powered by RNA-QA, automatically extracts RNA annotations from RNACentral using a divide-and-conquer strategy with GPT-4 and latent Dirichlet allocation (LDA) to manage large datasets and produce instruction tuning samples. Experimental results demonstrate that RNA-GPT efficiently addresses complex RNA queries, facilitating RNA research. Additionally, we introduce RNA-QA, a dataset consisting of 407,616 RNA sequences, aimed at modality alignment and instruction tuning.