Oral
in
Workshop: Machine Learning in Structural Biology
Controllable All-Atom Generation of Protein Sequence and Structure from Sequence-Only Inputs
Amy Lu · Wilson Yan · Kevin Yang · Vladimir Gligorijevic · Kyunghyun Cho · Richard Bonneau · Pieter Abbeel · Nathan Frey
Sun 15 Dec 8:30 a.m. PST — 5 p.m. PST
We propose PLAID (Protein Latent Induced Diffusion), a paradigm for generating all-atom structure and sequence of protein domains, by learning diffusions over the compressed latent space of pre-trained sequence-only input protein folding models. Since only sequence training data is required during generative model training, we augment the usable training dataset by 100x to 10,000x compared to other sequence-structure generative models. Further, this enlarges the annotations available for controllable generation, and we demonstrate compositional conditioning on function and organism, including a rich vocabulary of 2,219 Gene Ontology functions. Samples exhibit cross-modal consistency while possessing desired properties as measured by conditional Fréchet inception distance (FID). The PLAID paradigm avoids strong priors and massive imbalances from structure databases, scales readily with data and compute, and enables controllable generation of all-atom protein structures and sequences.