Poster
Linearly Decomposing and Recomposing Vision Transformers for Diverse-Scale Models
Shuxia Lin · Miaosen Zhang · Ruiming Chen · Xu Yang · Qiufeng Wang · Xin Geng
Vision Transformers (ViTs) are widely used in a variety of applications, while they usually have a fixed architecture that may not match the varying computational resources of different deployment environments. Thus it is necessary to adapt ViT architectures to the devices with diverse computation overheads to achieve accuracy-efficient trade-offs. To achieve this, inspired by polynomial decomposition in calculus that a function can be approximated by linearly combining several basic components, we propose to linearly decompose the ViT model into a set of components during element-wise training. These components can then be recomposed into differently scaled, pre-initialized models to satisfy different computational resource constraints. Such decomposition-recomposition strategy provides an economical and flexible approach to generating diverse scales of ViT models for different deployment scenarios. Compared to model compression or training from scratch, which require to repeatedly train on large datasets for diverse-scale models, such strategy reduces computational costs since it only requires to train on large datasets once. Extensive experiments are used to validate the effectiveness of our method: ViTs can be decomposed and the decomposed components can be recomposed into diverse-scale ViTs, which can achieve comparable or better performance compared to traditional model compression and pre-training methods. The code for our experiments is available in the supplemental material.
Live content is unavailable. Log in and register to view live content