Poster
in
Workshop: Machine Learning in Structural Biology
Balancing Locality and Reconstruction in Protein Structure Tokenizer
Jiayou Zhang · Barthélémy Meynard · Jing Gong · Xingyi Cheng · Eric Xing · Le Song
The structure of a protein is crucial to its biological function. With the rapid expansion of available protein structures, such as those in the AlphaFold Protein Structure Database (AFDB), there is an increasing need for efficient methods to index, search, and generate these structures. Additionally, there is a growing interest in integrating structural information with models from other modalities, like protein sequence language models.We present a novel VQ-VAE-based protein structure tokenizer, Petal (Protein Equiformer Tokenizer for Aligning with Language models) that incorporates an equivariant encoder and an invariant decoder, trained as a large model with 300M parameters. During our experiments, we discovered an intriguing trade-off between the encoder’s locality and the decoder’s reconstruction capabilities.In addition to evaluating simple structure reconstruction, we compared our model with Foldseek, Protoken, and ESM3. Our results demonstrate that seeking for a better balance between retrieval and reconstruction enables better integration in a protein Language Model (pLM) and better structure prediction performance.