Spotlight
in
Workshop: AI for Accelerated Materials Design (AI4Mat-2023)
Exploring Organic Syntheses through Natural Language
Andres M Bran · Cheng-Hua Huang · Philippe Schwaller
Keywords: [ Exploration ] [ dataset ] [ Large language models ] [ organic synthesis ] [ chemical space ] [ large language models ] [ exploration ]
Chemists employ a number of levels of abstraction for describing objects and communicating ideas. Most of this knowledge is in the form of natural language, through books, articles and oral explanations, due to its flexibility and capacity to connect the different levels of abstraction. Despite of this, machine-learning chemical models are typically limited to low-level abstractions like graph representations or dynamic point clouds that, although powerful, ignore important aspects like procedural details. In this work, we propose methods for exploring the chemical space at the rich level of natural language. In this setting, synthetic procedure paragraphs are split into segments in four possible classes, and are subsequently mapped into a latent space where they can be conveniently studied. We explore the structure of this space, and find interesting connections with experimental realisation that are beyond the scope of commonly used reaction SMILES. This work aims to draw a path towards LLM-based data processing and chemical space exploration, by analyzing chemical data in previously inaccessible ways that will ultimately allow for better understanding of materials design.