Poster
Rethinking Decoders for Transformer-based Semantic Segmentation: Compression is All You Need
Qishuai Wen · Chun-Guang Li
State-of-the-art Transformer-based methods for semantic segmentation are based on variant specially designed decoders, which typically consist of learnable class embeddings, self- or cross-attention layers, and dot-product operation. However, these very designs still lack theoretical explanations, thus hindering the potentially principled improvements. In this paper, we attempt to derive their white-box counterparts from the perspective of compression. Consequently, we find that: 1) semantic segmentation can be viewed as Principal Component Analysis (PCA), where principal directions are classifiers and 2) self-attention minimizes the reconstruction error for better compression and cross-attention solves the principal directions for classification. Our derivation for the white-box decoders can not only unlock the mechanism behind the black-box decoders in Transformer-based methods for semantic segmentation but also the derived DEcoders via solving PrIncipal direCTions (DEPICT) achieve comparable performance to their black-box counterparts. In particular, when using ViT-B as the encoder, DEPICT outperforms the mask transformer on dataset ADE20K with merely 1/8 of the parameters, which validates the effectiveness of our derivation.
Live content is unavailable. Log in and register to view live content