Poster
ScaleKD: Strong Vision Transformers Could Be Excellent Teachers
Jiawei Fan · Chao Li · Xiaolong Liu · Anbang Yao
In this paper, we question if well pre-trained vision transformer (ViT) models could be used as teachers that exhibit scalable properties to advance cross architecture knowledge distillation research, in the context of adopting mainstream large-scale visual recognition datasets for evaluation. To make this possible, our analysis underlines the importance of seeking effective strategies to align (1) feature computing paradigm differences, (2) model scale differences, and (3) knowledge density differences. By combining three closely coupled components namely cross attention projector, dual-view feature mimicking and teacher parameter perception tailored to address the alignment problems stated above, we present a simple and effective knowledge distillation method, called ScaleKD. Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets with significantly improved performance, achieving state-of-the-art distillation performance. For instance, taking a well-trained Swin-L as the teacher model, our method gets 75.15\%|82.03\%|84.16\%|78.63\%|81.96\%|83.93\%|83.80\%|85.53\% top-1 accuracies for MobileNet-V1|ResNet-50|ConvNeXt-T|Mixer-S/16|Mixer-B/16|ViT-S/16|Swin-T|ViT-B/16 models trained on ImageNet-1K dataset from scratch, showing 3.05\%|3.39\%|2.02\%|4.61\%|5.52\%|4.03\%|2.62\%|3.73\% absolute gains to the individually trained counterparts with the same experimental settings. Intriguingly, when scaling up the size of teacher models or their pre-training datasets, our method showcases larger gains to student models. Empirically, the student backbones trained by our method transfer well on downstream MS-COCO and ADE20K datasets. Moreover, our method shows the potential to be an efficient substitution for the time-intensive pre-training of any target student on large-scale datasets if a strong pre-trained ViT is available. The code will be released soon.
Live content is unavailable. Log in and register to view live content