Title: Manual curation vs. AI distillation: Lessons learned for instruction following and feedback fine-tuning
Abstract: There has been a slew of work in training helpful conversational agents using Large language models (LLMs). These models draw upon diverse datasets, including open-source repositories, private data, and even synthetic data generated from LLMs. However, curating datasets for SFT and RLHF involves critical decisions, such as defining task distributions, data volume, prompt length, and more. While prior research underscores the importance of data quality, the nuanced impact of these various dataset factors on model performance remains unclear. I’ll present and compare our approaches for data curation using human labor and AI distillation in the context of training helpful chatbots. I will delve into the results of experiments that illuminate the nuanced effects of different dataset attributes on the training process of helpfulness in chatbots.
Bio: Most recently, Nazneen was a Research lead at Hugging Face and worked on alignment and AI safety, and evaluation. Recently, she and her team released the Zephyr model, which is already part of You.com's product offerings. Nazneen is selected by the UN's secretary general to serve on the AI Advisory Body along with other global experts in AI https://www.un.org//ai-advisory-body. More details about my work can be found at https://www.nazneenrajani.com/