Poster
in
Workshop: AI for New Drug Modalities
An Efficient Tokenization for Molecular Language Models
Seojin Kim · Jaehyun Nam · Jinwoo Shin
Recently, molecular language models have shown great potential in various chemical applications, e.g., drug-discovery. These models adapt auto-regressive language models to molecular data by considering molecules as sequences of atoms, where each atom is mapped to individual tokens of the language models. However, such atom-level tokenizations limit the models' ability to capture the global structural context of molecules. To tackle this issue, we propose a novel molecular language model, coined Context-Aware Molecular T5 (CAMT5). Inspired by the importance of the substructure-level contexts, e.g., ring systems, in understanding molecules, we introduce substructure-level tokenization for molecular language models. Specifically, we construct a tree structure for each molecule whose nodes correspond to important substructures, i.e., motifs. Then, we train our CAMT5 by considering a molecule as a sequence of motif tokens, whose order is determined by a tree-search algorithm. Under the proposed motif token space, one can incorporate chemical context with a significantly shorter token length (than atom-level tokenizations), which is useful for mitigating the issues during the auto-regressive molecular generation, e.g., error propagation. In addition, CAMT5 guarantees to generate a valid molecule with non-degeneracy, i.e., no ambiguity in the meaning of each token, which is also overlooked in previous models. Extensive experiments demonstrate the effectiveness of CAMT5 in the text-to-molecule generation task. Finally, we also propose a simple strategy of ensemble that can aggregate the outputs of molecular language models of different tokenizations, e.g., SMILES, SELFIES and ours, further boosting the quality of the generated molecules.