Poster
in
Workshop: AI for New Drug Modalities
Latent Diffusion Models for Controllable RNA Sequence Generation
Kaixuan Huang · Yukang Yang · Kaidi Fu · Yanyi Chu · Le Cong · Mengdi Wang
This work presents RNADiffusion, a latent diffusion model for generating and optimizing discrete RNA sequences of variable lengths. RNA is a key intermediary between DNA and protein, exhibiting high sequence diversity and complex three-dimensional structures to support a wide range of functions. We utilize pretrained BERT-type models to encode raw RNA sequences into token-level, biologically meaningful representations. A Query Transformer is employed to compress such representations into a set of fixed-length latent vectors, with an autoregressive decoder trained to reconstruct RNA sequences from these latent variables. We then develop a continuous diffusion model within this latent space. To enable optimization, we integrate the gradients of reward models—surrogates for RNA functional properties—into the backward diffusion process, thereby generating RNAs with high reward scores. Empirical results confirm that RNADiffusion generates non-coding RNAs that align with natural distributions across various biological metrics. Further, fine-tuning on mRNA 5’ untranslated regions (5’-UTRs) optimizes generated sequences for high translation efficiency. Our guided diffusion model effectively generates diverse 5’-UTRs with high Mean Ribosome Loading (MRL) and Translation Efficiency (TE), outperforming baselines in balancing rewards and structural stability trade-offs. These findings hold potential for advancing RNA sequence-function research and therapeutic RNA design.