Poster
in
Workshop: Machine Learning in Structural Biology
Bayesian Optimisation for Protein Sequence Design: Gaussian Processes with Zero-Shot Protein Language Model Prior Mean
Carolin Benjamins · Shikha Surana · Oliver Bent · Marius Lindauer · Paul Duckworth
Bayesian optimisation (BO) is a popular sequential decision making approach for maximising black-box functions in low-data regimes. It can be used to find highly-fit protein sequence candidates since gradient information is not available in vitro.Recent in silico protein design methods have leveraged large pre-trained protein language models (PLMs) as fitness predictors. However PLMs have a number of shortcomings for sequential design tasks: i) their current capability to modeluncertainty, ii) no closed-form Bayesian updates in light of new experimental data, and iii) the challenge fine-tuning on small down-stream task datasets. We take a step back to traditional BO by using Gaussian process (GP) surrogate models withsequence kernels, which are able to properly model uncertainty and update their belief over multi-round design tasks. In this work we empirically demonstrate that BO with GP surrogates is consistent with large pre-trained PLMs on the multi-roundsequence design benchmark ProteinGym. Furthermore, we demonstrate improved performance by augmenting the GP with the strong zero-shot PLM predictions as a GP prior mean function, and show that by using a learned linear combination ofzero-shot PLM and constant prior mean the GP surrogate can regulate the effects of the PLM guided prior.