Poster
in
Workshop: Machine Learning for Systems
$\texttt{Mycroft}$: Towards Effective and Efficient External Data Augmentation
Zain Sarwar · Van Tran · Arjun Bhagoji · Nicholas Feamster · Ben Zhao · Supriyo Chakraborty
Abstract:
In data-scarce domains like networked systems, external data augmentation may often be necessary to improve training data quality, as model trainers usually only have visibility into limited portions of the underlying data distribution. However, relevant data is often privately owned, making it both difficult and expensive for trainers to identify and acquire the needed training data. In this study, we introduce $\texttt{Mycroft}$, a data-efficient approach that allows model trainers to evaluate the utility of private data from various owners while operating under a limited data-sharing budget. $\texttt{Mycroft}$ leverages feature space distances to identify small, high-utility data subsets from each data owner, which serve as indicators of the overall dataset's utility. In domains with differentiable models, $\texttt{Mycroft}$ can effectively apply gradient matching techniques to identify these valuable data subsets. Our experiments, including novel threat detection in IoT networks and image classification in the vision domain, show that $\texttt{Mycroft}$ quickly reaches performance levels comparable to the baseline where all the data is shared.
Chat is not available.