Chinese Researchers Offer Pale-Shaped Self-Attention (PS-Attention) and General Vision Transformer Backbone, Called Pale Transformer
Transformers have recently shown promising performance in a variety of visual tests. Inspired by Transformer success on a wide range of NLP tasks, Vision Transformer (ViT) first used a pure Transformer architecture for image classification, demonstrating the promising performance of the Transformer architecture for vision tasks.
However, the quadratic complexity of global self-attention results in high computational costs and memory usage, especially for high resolution situations, making it unsuitable for use in various visual tasks. Various strategies limit the focus of attention within a local region to increase efficiency and reduce the quadratic computational complexity generated by global self-attention. As a result, their receptive fields in a single layer of attention are insufficiently large, resulting in poor context modeling.
A new Pale-Shaped Self-Attention (PS-Attention) method performs Self-Attention inside a pale-shaped area to resolve this issue. Compared to overall self-attention, PS-Attention can significantly reduce compute and memory expenses. Meanwhile, it can collect more fantastic contextual information while retaining the same computational complexity as previous local self-attention techniques.
A standard method of increasing efficiency is to replace global self-attention with local self-attention. A critical and difficult question is how to improve modeling capacities in local situations. Here, the local attention region was considered as a single row or column of the feature map in axial self-attention. A cross-shaped self-attention window has been proposed, which can be seen as an enlargement of several rows and columns of axial self-attention.
While these approaches outperform CNN equivalents in performance, the dependencies in each layer of self-attention are insufficient to collect adequate contextual information.
The suggested pale form self-attention (PS-Attention) effectively collects more prosperous contextual relationships. Specifically, the input feature maps are first spatially divided into many pale-shaped sections. Each pale-shaped region (abbreviated as pale) has the same number of interlaced rows and columns of the feature map. The distances between neighboring rows or columns are the same for all the blades. One of the pale, for example, is represented by the pink shadow in part e of the following figure.
Then, within each pale, self-attention is accomplished. Any token can directly interact with other tokens within the same blade, allowing the technology to capture deeper contextual information in a single PS-Attention layer. A more efficient parallel implementation of PS-Attention has been developed to further improve performance. PS-Attention surpasses existing local self-attention mechanisms due to larger receptive fields and higher context modeling capacity.
The mainstream research on improving the efficiency of Vision Transformer backbones has been divided into two parts: removing unnecessary stones through pruning procedures and creating more effective self-attention mechanisms.
The Pale Transformer, a generic vision transformer backbone with a hierarchical architecture based on the suggested PS-Attention, extends the technique to produce a set of models that surpass previous efforts, including Pale-T (22M), Pale- S (48M), and Pale-B (85M). The new Pale-T outperforms leading backbones by + 0.7%, + 1.1%, + 0.7%, and + 0.5% on ImageNet1k, 50.4% mIoU at single scale on ADE20K (semantic segmentation ), 47.4 mAP boxes (object recognition) and 42.7 mAP masks (instance segmentation) on COCO.
Based on the suggested PS-Attention, the Pale Transformer is a generic Vision Transformer backbone that provides cutting-edge image classification performance over ImageNet-1K. Additionally, Pale Transformer outperforms previous Vision Transformer backbones on ADE20K for semantic segmentation and COCO for object identification and instance segmentation.