Poster
Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for Benchmarking Robust Machine Learning and Label Correction Methods
Jiamian Hu · Hong Yuanyuan · Yihua Chen · He Wang · Moriaki Yasuhara
West Ballroom A-D #5309
We present the Noisy Ostracods, a noisy dataset for genus and species classificationof crustacean ostracods with specialists’ annotations. Over the 71466 specimenscollected, 5.58% of them are estimated to be noisy (possibly problematic) at genuslevel. The dataset is created to addressing a real-world challenge: creating aclean fine-grained taxonomy dataset. The Noisy Ostracods dataset has diversenoises from multiple sources. Firstly, the noise is open-set, including new classesdiscovered during curation that were not part of the original annotation. Thedataset has pseudo-classes, where annotators misclassified samples that shouldbelong to an existing class into a new pseudo-class. The Noisy Ostracods datasetis highly imbalanced with a imbalance factor ρ = 22429. This presents a uniquechallenge for robust machine learning methods, as existing approaches have notbeen extensively evaluated on fine-grained classification tasks with such diversereal-world noise. Initial experiments using current robust learning techniqueshave not yielded significant performance improvements on the Noisy Ostracodsdataset compared to cross-entropy training on the raw, noisy data. On the otherhand, noise detection methods have underperformed in error hit rate comparedto naive cross-validation ensembling for identifying problematic labels. Thesefindings suggest that the fine-grained, imbalanced nature, and complex noisecharacteristics of the dataset present considerable challenges for existing noiserobustalgorithms. By openly releasing the Noisy Ostracods dataset, our goalis to encourage further research into the development of noise-resilient machinelearning methods capable of effectively handling diverse, real-world noise in finegrainedclassification tasks. The dataset, along with its evaluation protocols, can beaccessed at https://github.com/H-Jamieu/Noisy_ostracods.