Lightning Talk
in
Workshop: Data Centric AI
Data Cards: Purposeful and Transparent Documentation for Responsible AI
As we move towards large-scale models (BERT, LaMBDA, DALL-E) capable of numerous downstream tasks, the complexity of understanding multi-modal datasets that give shape and nuance to how these models might be used rapidly increases. As such, a clear and thorough understanding of a dataset's origins, development, intent, ethical considerations and evolution is a necessary step for the responsible and informed deployment of these models, especially in people-facing contexts used across high-risk domains. However, the burden of this understanding often falls on the intelligibility, conciseness, and comprehensiveness of its documentation. Moreover, with these models often dependent on multiple datasets, consistency and comparability across all dataset documentation demands a process likening to user-centric product development. In this position paper, we propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets within the practical contexts of industry and research. Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a dataset's lifecycle for responsible AI development. These summaries provide explanations of processes and rationales that shape the data and consequently the models—such as upstream sources, data collection and annotation methods; training and evaluation methods, intended use, or decisions affecting model performance. Using two case studies, we report on desirable characteristics that support adoption across domains, organizational structures, and audience groups. Finally, we present lessons learned from deploying over twenty Data Cards.