Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Foundation Models for Science: Progress, Opportunities, and Challenges

Small Molecule Optimization with Large Language Models

Menua Bedrosian · Philipp Guevorguian · Tigran Fahradyan · Gayane Chilingaryan · Hrant Khachatrian · Armen Aghajanyan

Keywords: [ large language models ] [ foundation models ] [ optimization ] [ small molecules ]


Abstract:

The rise of large language models has created an opportunity for practical applications of machine learning algorithms in different areas like life science. In this work, we take advantage of the immense learning abilities of large language models and combine that with a training corpus of 110M small molecules to train a model that can predict molecular properties and more. More specifically, we take three publicly available large language models of 125M, 1B and 2B parameter sizes and train them on roughly 40B tokens comprising of molecules in SMILES format and their respective properties. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. We introduce a novel optimization algorithm that leverages our language models to optimize molecules for arbitrary properties given limited access to a black box oracle. Our approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization. It achieves state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods. We publicly release the language models and the dataset

Chat is not available.