Skip to yearly menu bar Skip to main content


Poster
in
Workshop: AI4Mat-2024: NeurIPS 2024 Workshop on AI for Accelerated Materials Design

Bayesian Optimization for Protein Sequence Design: Back to Simplicity with Gaussian Processes

Carolin Benjamins · Shikha Surana · Oliver Bent · Marius Lindauer · Paul Duckworth

Keywords: [ Bayesian Optimization ] [ Gaussian Process ] [ protein sequence design ] [ String kernels ] [ Fingerprint kernels ] [ encoding ]


Abstract:

Bayesian optimization (BO) is a popular sequential decision making approach for maximizing black-box functions in low-data regimes. In biology, it has been used to find well-performing protein sequence candidates since gradient information is not available from in vitro experimentation. Recent in silico design methods have leveraged large pre-trained protein language models (PLMs) to predict proteinfitness. However PLMs have a number of shortcomings for sequential design tasks: i) their current limitation to model uncertainty, ii) the lack of closed-form Bayesian updates in light of new experimental data, and iii) the challenge of fine-tuning on small downstream task datasets. We take a step back to traditional BO by investigating Gaussian process (GP) surrogate models with various sequence kernels, which are able to properly model uncertainty and update their belief over multi-round design tasks. We empirically evaluate our method on the sequence design benchmark ProteinGym, and demonstrate that BO with GPs is competitive with large SOTA pre-trained PLMs at a fraction of the compute budget.

Chat is not available.