Poster
in
Workshop: New Frontiers of AI for Drug Discovery and Development
SALSA: Semantically-Aware Latent Space Autoencoder
Kathryn E. Kirchoff · Travis Maxfield · Alexander Tropsha · Shawn Gomez
Keywords: [ transformers ] [ Embedding Approaches ] [ Representation Learning ] [ Autoencoders ] [ Drug Discovery ] [ Molecular Data ] [ contrastive learning ]
For molecular representations, SMILES strings are a popular choice, as they allow for leveraging of modern NLP methodologies, one being the sequence-to-sequence autoencoder. However, an autoencoder trained solely on SMILES is insufficient to learn semantically meaningful representations, which capture structural similarities between molecules. We define native chemical similarity using chemical graphs, which enables the use of a rigorous metric, such as graph edit distance (GED). We demonstrate by example that a standard SMILES autoencoder may map structurally similar molecules to distant latent vectors, resulting in an incoherent latent space. To address this shortcoming we propose Semantically-Aware Latent Space Autoencoder (SALSA), a transformer-autoencoder modified with a contrastive objective of mapping structurally similar molecules to nearby vectors in the latent space. We evaluate semantic awareness of SALSA representations by comparing to a naive autoencoder as well as ECFP4, a molecular fingerprint commonly used in cheminformatics. We show empirically that \salsa{} learns a representation that maintains 1) structural awareness, 2) physicochemical property awareness, 3) biological property awareness, and 4) semantic continuity.