Poster
in
Workshop: Machine Learning in Structural Biology
Generating and evaluating diverse sequences for protein backbones
Yo Akiyama · Sergey Ovchinnikov
Generating diverse sequences for protein backbones remains an active challenge with important implications. De novo protein design typically requires screening large sets of diverse sequences to identify viable candidates under certain experimental conditions. Sequence design has also recently been employed to generate synthetic data for training models. Diverse sets of sequences can be trivially generated by increasing the sampling temperature of sequence design models; however, we find that the covariation between residues in these sequences do not recapitulate natural covariation or the structures for which they were designed. An alternative approach designs sequences for structural ensembles, motivated by previous studies demonstrating that natural sequence variation is strongly tied to structural variation rather than the constraints of a static backbone. RFdiffusion, with a reduced number of noising and denoising steps, has demonstrated the ability to diversify structures via learned potentials. Here, we compare sequences generated using single fixed backbones and partial RFdiffusion ensembles. Our analyses reveal that structural variation from RFdiffusion results in increased sequence diversity at a given sequence temperature without compromising AlphaFold2 designability metrics. Moreover, the covariance from partial diffusion MSAs better recapitulate natural covariation and contacts. Lastly, we propose a new approach to evaluate the quality of sequences, which tests AlphaFold2 self-consistency using shallow synthetic MSAs. This method enables evaluation of sequences for which the efficacy of the AlphaFold2 single-sequence self-consistency remains limited.