Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Machine Learning and Compression

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

Makoto Shing · Kou Misaki · Han Bao · Sho Yokoi · Takuya Akiba


Abstract: Large language models have demonstrated remarkable capabilities, but their size poses significant challenges for deployment in resource-constrained environments. Knowledge distillation (KD) offers a promising approach for model compression, yet existing methods struggle with mode averaging, mode collapse, and the substantial capacity gap between teacher and student models.To address these issues, we introduce $\textit{Temporally Adaptive Interpolated Distillation (TAID)}$, a novel KD approach that dynamically interpolates student and teacher distributions through an adaptive intermediate distribution, gradually shifting from the student's initial distribution towards the teacher's distribution. This approach provides a smooth knowledge transfer, effectively addressing the capacity gap while balancing mode-averaging and mode-collapse tendencies.Our comprehensive experiments demonstrate TAID's superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. Furthermore, we showcase TAID's practical impact by developing two state-of-the-art compact foundation models: $\texttt{TAID-LLM-1.5B}$ for language tasks and $\texttt{TAID-VLM-2B}$ for vision-language tasks.These results demonstrate TAID's effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.

Chat is not available.