Skip to yearly menu bar Skip to main content


Poster

Norms for Managing Datasets: A Systematic Review of NeurIPS Datasets

Yiwei Wu · Leah Ajmani · Shayne Longpre · Hanlin Li

[ ]
Wed 11 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

As new ML methods require larger training datasets, researchers and developers are left to resolve key challenges around data management. Despite the establishment of ethics review, documentation, and checklist practices, it remains unclear whether the community as a whole has consistent dataset management practices. A lack of a comprehensive overview delays us from systematically diagnosing and addressing core tensions and ethical issues in managing large datasets. We present a systematic review of datasets published under the NeurIPS Datasets and Benchmarks track, focusing on four aspects: provenance, distribution, ethical disclosure, and licensing. We find that dataset provenance is not always traceable due to unclear filtering or curation processes. A variety of sites were used for dataset hosting and only a few sites support structured metadata and version control. These inconsistencies highlight the need for standardized data infrastructures for publishing and managing datasets.

Live content is unavailable. Log in and register to view live content