Skip to yearly menu bar Skip to main content


Poster

Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

Rulin Shao · Jacqueline He · Akari Asai · Weijia Shi · Tim Dettmers · Sewon Min · Luke Zettlemoyer · Pang Wei Koh

[ ]
Wed 11 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

We consider the data used at inference time as a new dimension of scaling language models (LMs), in addition to the pretraining data and the number of parameters. This scaling is enabled by retrieval-based LMs, a class of LMs that can directly access a datastore—an external, large collection of text documents—during inference. Although retrieval-based models are commonly used, there has been very little study of datastore scaling trends. First, we build a 1.4 trillion-token datastore, named MassiveDS, which is the largest and the most diverse open-sourced datastore for retrieval-based LMs. We also design a pipeline that allows efficient study of the impact of different datastore features, such as data size, data filters, and decontamination strategies. Our experiments show that datastore scaling is log-linear across a variety of tasks without obvious saturation, much like the widely observed data and parameter scaling trends. We also report a range of new analyses to point to future directions to improve the scalability trends, such as improved retrieval. We will open source both data and code to facilitate future research.

Live content is unavailable. Log in and register to view live content