CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications
GitHub - Tianfang-Zhang/CAS-ViT
2408.03703 (arxiv.org)
Abstract
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer’s powerful global context capability. However, the pairwise token affinity and complex matrix operations limit its deployment on resource-constrained scenarios and real-time applications, such as mobile devices, although considerable efforts have been made in previous works.
In this paper, we introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications. Firstly, we argue that the capability of token mixers to obtain global contextual information hinges on multiple information interactions, such as spatial and channel domains. Subsequently, we construct a novel additive similarity function following this paradigm and present an efficient implementation named Convolutional Additive Token Mixer (CATM). This simplification leads to a significant reduction in computational overhead. We evaluate CAS-ViT across a variety of vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation.
Our experiments, conducted on GPUs, ONNX, and iPhones, demonstrate that CAS-ViT achieves a competitive performance when compared to other state-of-the-art backbones, establishing it as a viable option for efficient mobile vision applications. Our code and model are available at: https://github.com/Tianfang-Zhang/CAS-ViT
动机:Vision Transformer(ViTs)是一种在神经网络领域取得了革命性进展的模型,它通过标记混合器(token mixer)强大的全局上下文能力,实现了对图像分类、目标检测、实例分割和语义分割等多个视觉任务的高效处理。然而,在资源受限场景和移动设备等实时应用中,成对的标记亲和力(token affinity)和复杂的矩阵运算限制了ViTs的部署。
方法:为了解决这些问题,本文引入了CAS-ViT:卷积加性自注意力视觉转换器(Convolutional Additive Self-attention Vision Transformers)。CAS-ViT旨在在移动应用中平衡效率与性能之间的关系。首先,作者认为标记混合器获取全局上下文信息所依赖于多种信息交互方式,包括空间和通道域。基于这一理论,本文构建了一个新颖且高效实现方式——卷积加性相似度函数,并提出了一种名为卷积加性标记混合器(CATM) 的简化方法来降低计算开销。
结果:通过对图像分类、目标检测、实例分割和语义分割等多个视觉任务进行评估,在GPU、ONNX 和iPhone 上进行实验表明 CAS-ViT 在资源有限环境下具有竞争力并达到最先进骨干网络水平。这意味着CAS-ViT可以成为高效移动视觉应用中可行选择之一。
Method
1. 回顾自注意力及其变体
自注意力机制是Vision Transformer(ViT)的关键组成部分,它能够有效地捕获不同位置之间的关系。传统的Multi-head Self-Attention (MSA) 通过计算Query(Q)、Key(K)和Value(V)之间的相似度矩阵来生成输出。然而,这种方法的计算复杂度与输入图像大小的平方成正比,限制了其在资源受限场景下的应用。
为了改善这一问题,研究者们提出了多种自注意力的变体,包括:
Separable Self-Attention:将矩阵形式的特征度量简化为向量形式,减少计算复杂度。
Swift Self-Attention:减少Key的数量,仅使用Query的线性变换系数来加权每个Token,并沿空间维度求和,实现快速推理。
2. Convolutional Additive Self-attention
为了进一步提高自注意力机制的效率和性能,本文提出了Convolutional Additive Self-attention(CAS),其核心在于:
加性相似性函数:定义了一种新的加性相似性函数,该函数通过直接对Query(Q)和Key(K)的上下文分数求和来计算相似度,即
Sim(Q, K)=Φ(Q)+Φ(K)
。其中,Φ(·) 表示上下文映射函数,这里具体实现为Sigmoid激活的通道注意力C(·)和空间注意力S(·)。
CATM模块:基于加性相似性函数,设计了Convolutional Additive Token Mixer(CATM)模块。CATM模块消除了传统自注意力机制中的复杂操作(如矩阵乘法和Softmax),并通过卷积操作实现了高效的计算。CATM的输出可以表示为
O=Γ(Φ(Q)+Φ(K))·V
,其中Γ(·) 表示整合上下文信息的线性变换。
复杂度分析:CATM的计算复杂度与输入大小呈线性关系,具体为
Ω(CATM)=(47+10b)HWC
,其中b为批量大小,H、W和C分别为输入图像的高度、宽度和通道数。
3. CAS-ViT网络架构
CAS-ViT的网络架构如图3所示,主要包括以下几个部分:
输入处理:通过两个步长为2的卷积层将输入图像下采样到
H/4 × W/4 × C1
。
阶段编码层:包括四个阶段,每个阶段之间通过Patch Embedding进行2倍下采样,得到不同尺度的特征图。每个阶段包含多个堆叠的块(Block),每个块内部包含Integration subnet、CATM和MLP三个部分,通过残差连接进行连接。
块结构:每个块的设计参考了混合网络(如EfficientViT和EdgeViT),Integration subnet由三个深度可分离卷积层组成,用于特征融合;CATM模块用于实现高效的自注意力;MLP用于进一步提取特征。
Official Implementation of our proposed method CAS-ViT.
Comparison of diverse self-attention mechanisms. (a) is the classical multi-head self-attention in ViT. (b) is the separable self-attention in MobileViTv2, which reduces the feature metric of a matrix to a vector. (c) is the swift self-attention in SwiftFormer, which achieves efficient feature association only with Q and K. (d) is proposed convolutional additive self-attention.
Upper: Illustration of the classification backbone network. Four stages downsample the original image to 1/4, 1/8, 1/16, 1/32 . Lower: Block architecture with N_i blocks stacked in each stage.
版权归原作者 Phoenixtree_DongZhao 所有, 如有侵权,请联系我们删除。