Invited Talk
in
Workshop: Workshop on Scalable Continual Learning for Lifelong Foundation Models
Invited Talk 3 - Scaling LLMs with Synthetic Data Loops
Suchin Gururangan
Abstract:
Recent advances in synthetic data generation have transformed the development of LLMs. Focusing on the post-training phase of Llama 3, I explore how synthetic data pipelines can overcome traditional data limitations and introduce new model capabilities. I discuss specific techniques for using synthetic data for continual learning, like adapting to rare data distributions, improving data quality, and enabling specialized capabilities like error correction. Along the way, I highlight lessons learned and insights gained from our journey training Llama 3. Finally, I reflect on open challenges to training LLMs with synthetic data, in both offline and online settings.
Chat is not available.