Poster
Norms for Managing Datasets: A Systematic Review of NeurIPS Datasets
Yiwei Wu · Leah Ajmani · Shayne Longpre · Hanlin Li
As new ML methods require larger training datasets, researchers and developers are left to resolve key challenges around data management. Despite the establishment of ethics review, documentation, and checklist practices, it remains unclear whether the community as a whole has consistent dataset management practices. A lack of a comprehensive overview delays us from systematically diagnosing and addressing core tensions and ethical issues in managing large datasets. We present a systematic review of datasets published under the NeurIPS Datasets and Benchmarks track, focusing on four aspects: provenance, distribution, ethical disclosure, and licensing. We find that dataset provenance is not always traceable due to unclear filtering or curation processes. A variety of sites were used for dataset hosting and only a few sites support structured metadata and version control. These inconsistencies highlight the need for standardized data infrastructures for publishing and managing datasets.
Live content is unavailable. Log in and register to view live content