Spotlight
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)
An Archival Perspective on Pretraining Data
Meera Desai · Abigail Jacobs · Dallas Card
Research in NLP on pretraining data has largely focused on identifying and mitigating downstream risks in models. However, we argue that more critical attention is needed to pretraining data and the systems that produce it. To consider a broader range of impacts of pretraining corpora, we consider the analogy between pretraining datasets and archives. In doing so, we surface impacts of these datasets beyond their role in directly shaping model behavior, including their existence as independent data artifacts, and the ways they can influence future practices. Within the broader ecosystem of datasets and models, we focus especially on processes involved in the creation of pretraining data. In particular, we explore research in NLP that relates to algorithmic filtering of pretraining data as a kind of appraisal. Using the lens of measurement, we critically examine the problem formulations taken on by this work. In doing so, we underscore how choices about what is included in pretraining data are necessarily choices about values. We conclude by drawing on archival studies to offer insights on paths forward.