[1st] Oral Presentation
in
Workshop: Vision Transformers: Theory and applications
Bi-Directional Self-Attention for Vision Transformers
George Stoica · Taylor Hearn · Bhavika Devnani · Judy Hoffman
Self-Attention (SA) maps a set of key-value pairs to an output by aggregating information from each pair according to its compatibility with a query. This allows SA to aggregate surrounding context (represented by key-value pairs) around a specific source (e.g. a query).Critically however, this process cannot also refine a source (e.g. a query) based on the surrounding context (e.g. key-value pairs). We address this limitation by inverting the way key-value pairs and queries are processed. We propose Inverse Self-Attention (ISA), which instead maps a query (source) to an output based on its compatibility with a set of key-value pairs (scene). Leveraging the inherent complementary nature of ISA and SA, we further propose Bi-directional Self-Attention (BiSA), an attention layer that couples SA and ISA by convexly combining their outputs. BiSA can be easily adapted into any existing transformer architecture to improve the expressibility of attention layers. We showcase this flexibility by extensively studying the effects of BiSA on CIFAR100[1], ImageNet1K[2], and ADE20K[3], and extend the Swin Transformer[4] and LeViT[5] with BiSA, and observe substantial improvements.