Skip to yearly menu bar Skip to main content


Tutorial

Opening the Language Model Pipeline: A Tutorial on Data Preparation, Model Training, and Adaptation

Kyle Lo · Akshita Bhagia · Nathan Lambert

West Ballroom B
[ ]
Tue 10 Dec 9:30 a.m. PST — noon PST

Abstract:

Language models (LMs) have become a critical technology for tackling a wide range of natural language processing tasks, making them ubiquitous in both AI research and commercial products. As their commercial importance has surged, the most powerful models have become more secretive, gated behind proprietary interfaces, with important details of their training data, architectures, and develop- ment undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. In this tutorial, we provide a detailed walkthrough of the language model development pipeline, including pretraining data, model architecture and training, adaptation (e.g., instruction tuning, RLHF). For each of these development stages, we provide examples using open software and data, and discuss tips, tricks, pitfalls, and other- wise often inaccessible details about the full language model pipeline that we’ve uncovered in our own efforts to develop open models. We have opted not to have the optional panel given the extensive technical details and examples we need to include to cover this topic exhaustively.

Chat is not available.