video
in
Workshop: Data Centric AI
Data vast and low in variance: Augment machine learning pipelines with dataset profiles to improve data quality without sacrificing scale
The recent discussion of data-centric artificial intelligence (DCAI) has galvanized researchers and practitioners to elevate data quality and dataset iteration practices to the level of importance given to model iteration on fixed datasets. Some DCAI techniques successfully increase training data quality but at the expense of the number of training examples. Meanwhile, production AI systems are being increasingly deployed in new settings producing even more inference data. Dataset profiling techniques provide systematic ways of transferring important characteristics and data examples from large, real-time inference data sources to the smaller datasets used for training--delivering higher quality data without sacrificing scalability.