Poster
BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity
Zahra Gharaee · Scott C. Lowe · ZeMing Gong · Pablo Millan Arias · Nicholas Pellegrino · Austin T. Wang · Joakim Bruslund Haurum · Iuliia Eyriay · Lila Kari · Dirk Steinke · Graham Taylor · Paul Fieguth · Angel Chang
As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establishes several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, and geographical information.We propose three benchmark experiments to demonstrate the impact of the multi-modal data types on the classification and clustering accuracy. First, we pretrain a masked language model on the DNA barcode sequences of the BIOSCAN-5M dataset, and demonstrate the impact of using this large reference library on species- and genus-level classification performance.Second, we propose a zero-shot transfer learning task applied to images and DNA barcodes to cluster feature embeddings obtained from self-supervised learning, to investigate whether meaningful clusters can be derived from these representation embeddings. Third, we benchmark multi-modality by performing contrastive learning on DNA barcodes, image data, and taxonomic information. This yields a general shared embedding space enabling taxonomic classification using multiple types of information and modalities. The code repository of the BIOSCAN-5M Insect dataset is available at \url{https://github.com/zahrag/BIOSCAN-5M}.
Live content is unavailable. Log in and register to view live content