Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)
Accumulating Data Avoids Model Collapse
Joshua Kazdan · Apratim Dey · Rylan Schaeffer · Matthias Gerstgrasser · Rafael Rafailov · David Donoho · Sanmi Koyejo
The increasing prevalence of AI-generated content on the internet raises a critical and timely question: What happens when generative machine learning models are pretrained on web-scale datasets containing data created by earlier generative models? Recent studies have highlighted a phenomenon termed ``model collapse," whereby model performance degrades over iterations, rendering newer generative models unusable. However, other recent research questioned the practical relevance of model collapse by providing evidence that (1) model collapse was caused by deleting past data en masse and then training largely (or entirely) on purely synthetic data from the latest generative model, and (2) model collapse is avoided if new synthetic data are instead added to existing real and synthetic data. These two claims are particularly important in forecasting likely futures of deep generative models pretrained on web-scale data because, in practice, web-scale data is not deleted en masse and new synthetic data accumulates alongside existing real and synthetic data. In this work, we test whether these two claims hold on three new prominent settings for studying model collapse: multivariate Gaussian modeling, supervised finetuning of language models and kernel density estimation. In all three new settings, we find that the two claims hold: model collapse is indeed caused by deleting past data en masse, and model collapse is avoided by accumulating new synthetic data alongside past data.