NeurIPS Multi-Teacher Distillation: An Ensemble-Then-Distill Approach

KeyNote Talk
in
Workshop: The Fourth Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV): Highlighting New Architectures for Future Foundation Models

Multi-Teacher Distillation: An Ensemble-Then-Distill Approach

Lili Mou

[ Abstract ]

Sat 14 Dec 9 a.m. PST — 9:30 a.m. PST

Abstract:

Knowledge distillation (KD) aims to transfer the knowledge in a large model (called a teacher) into a small one (called a student), and has become an emerging research topic as the sizes of deep learning models keep growing. Today, there are abundant readily available large models, such as ChatGPT, LLaMa, and T5. It then becomes natural to ask: Can we distill the knowledge from multiple teachers? At first glance, it appears easy to perform multi-teacher KD, as we can simply train the student from the union of teachers’ predictions. However, I would argue that such a naïve attempt may not work well for multi-teacher KD. This is because traditional KD adopts the cross-entropy loss, which tends to yield a smooth distribution. In this talk, I will present a novel ensemble-then-distill approach, which builds an ensemble of teacher models to train the student. I will also discuss applications to text generation and syntactic parsing.

Chat is not available.

KeyNote Talk in Workshop: The Fourth Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV): Highlighting New Architectures for Future Foundation Models

Multi-Teacher Distillation: An Ensemble-Then-Distill Approach

Lili Mou

KeyNote Talk
in
Workshop: The Fourth Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV): Highlighting New Architectures for Future Foundation Models