Poster
in
Workshop: Workshop on Machine Learning and Compression
Formalizing Limits of Knowledge Distillation Using Partial Information Decomposition
Pasan Dissanayake · Faisal Hamman · Barproda Halder · Ilia Sucholutsky · Qiuyi (Richard) Zhang · Sanghamitra Dutta
Knowledge distillation provides an effective method for deploying complex machine learning models in resource-constrained environments. It typically involves training a smaller student model to emulate either the probabilistic outputs or the internal feature representations of a larger teacher model. By doing so, the student model often achieves substantially better performance on a downstream task compared to when it is trained independently. Nevertheless, the teacher's internal representations can also encode noise or additional information that may not be relevant to the downstream task. This observation motivates our primary question: What are the information-theoretic limits of knowledge transfer? To this end, we leverage a body of work in information theory called Partial Information Decomposition (PID) that unravels the joint information contained in several input random variables about another target variable, e.g., the downstream task labels. Our main contribution is to quantify the distillable and distilled knowledge of a teacher's representation for a given downstream task. Moreover, we demonstrate that this metric can be practically used in distillation to address challenges caused by the complexity gap between the teacher and the student representations.