Poster
in
Workshop: AI4Mat-2024: NeurIPS 2024 Workshop on AI for Accelerated Materials Design
Bayesian Optimization for Protein Sequence Design: Back to Simplicity with Gaussian Processes
Carolin Benjamins · Shikha Surana · Oliver Bent · Marius Lindauer · Paul Duckworth
Keywords: [ Bayesian Optimization ] [ Gaussian Process ] [ protein sequence design ] [ String kernels ] [ Fingerprint kernels ] [ encoding ]
Bayesian optimization (BO) is a popular sequential decision making approach for maximizing black-box functions in low-data regimes. In biology, it has been used to find well-performing protein sequence candidates since gradient information is not available from in vitro experimentation. Recent in silico design methods have leveraged large pre-trained protein language models (PLMs) to predict proteinfitness. However PLMs have a number of shortcomings for sequential design tasks: i) their current limitation to model uncertainty, ii) the lack of closed-form Bayesian updates in light of new experimental data, and iii) the challenge of fine-tuning on small downstream task datasets. We take a step back to traditional BO by investigating Gaussian process (GP) surrogate models with various sequence kernels, which are able to properly model uncertainty and update their belief over multi-round design tasks. We empirically evaluate our method on the sequence design benchmark ProteinGym, and demonstrate that BO with GPs is competitive with large SOTA pre-trained PLMs at a fraction of the compute budget.