Poster
in
Workshop: Machine Learning in Structural Biology Workshop
Exploiting language models for protein discovery with latent walk-jump sampling
Sai Pooja Mahajan · Nathan Frey · Dan Berenberg · Joseph Kleinhenz · Richard Bonneau · Vladimir Gligorijevic · Andrew Watkins · Saeed Saremi
We introduce a single-step score-based denoising framework for generative modeling of antibody protein sequences from higher dimensional embeddings of pretrained language models. Our latent Walk-Jump Sampler (or L-WJS) framework learns the manifold of a smoothed latent space of a pretrained protein language model. New sequences are generated by score-based exploration using Langevin MCMC (walk) on the smoothed latent space and denoising (jump) to the latent space. Our framework thus combines the attractive properties of the rich and semantically meaningful representations from pretrained protein language models trained on large corpus of sequences and the improved sample quality of score-based modeling in the latent space. We demonstrate that latent-WJS is data efficient, generates novel, diverse and natural antibody sequences and opens-up avenues for sampling (both unguided and guided) from the latent space of various pretrained models.