Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: Workshop on Scalable Continual Learning for Lifelong Foundation Models

Invited Talk 3 - Scaling LLMs with Synthetic Data Loops

Suchin Gururangan

[ ]
Sat 14 Dec 2:20 p.m. PST — 2:50 p.m. PST

Abstract:

Recent advances in synthetic data generation have transformed the development of LLMs. Focusing on the post-training phase of Llama 3, I explore how synthetic data pipelines can overcome traditional data limitations and introduce new model capabilities. I discuss specific techniques for using synthetic data for continual learning, like adapting to rare data distributions, improving data quality, and enabling specialized capabilities like error correction. Along the way, I highlight lessons learned and insights gained from our journey training Llama 3. Finally, I reflect on open challenges to training LLMs with synthetic data, in both offline and online settings.

Chat is not available.