Poster
Pre-training Differentially Private Models with Limited Public Data
Zhiqi Bu · Xinwei Zhang · Sheng Zha · Mingyi Hong · George Karypis
East Exhibit Hall A-C #4204
The superior performance of large foundation models can be attributed to the use of massive amounts of high-quality data. However, such datasets often contain sensitive, private and copyrighted material that requires formal protection. While differential privacy (DP) is a prominent method used to gauge the degree of security provided to large foundation models, its application in large foundation models has been met with limited success because there are often significant performance compromises when applying DP during the pre-training phase. Consequently, DP is more commonly implemented during the model fine-tuning stage, hence not capable of protecting a substantial portion of the data used during the initial pre-training process. In this work, we first provide a theoretical understanding of the efficacy of DP training by analyzing the per-iteration improvement of loss through the lens of the Hessian. We observe that DP optimizers' deceleration can be significantly mitigated by the use of limited public data, and thus propose the DP continual pre-training strategy. Our DP continual pre-training on vision models, using only 10% of public data, have achieved DP accuracy of 41.5% on ImageNet-21k (with epsilon=8) and non-DP accuracy of 55.7% on Places365 and 60.0% on iNaturalist-2021, which are on par with state-of-the-art standard pre-training and outperform existing DP pertained models. Our DP pre-trained models are released in fastDP library (https://github.com/awslabs/fast-differential-privacy/releases/tag/v2.1)