0


Vision Transformer(Pytorch版)代码阅读注释

在这里插入图片描述

Vision Transformer(Pytorch版)代码阅读

前言

因为Google Research官方的Vision Transformer源码是tensorflow版本,而笔者平时多用pytorch,所以在github上找了作者rwightman版本的代码:rwightman/pytorch-image-models/timm/models/vision_transformer.py

Vision Transformer介绍博客:论文阅读笔记:Vision Transformer

下面的代码介绍以

  1. vit_base_patch16_224

(ViT-B/16:patch_size=16, img_size=224)为例。

VIT Model

原文中模型由三个模块组成:
· Linear Projection of Flattened Patches
· Transformer Encoder
· MLP Head

对应代码中的三个模块:
· patch embedding layer
· Block
· Representation layer + Classifier head

Linear Projection of Flattened Patches

在这里插入图片描述
如图,Linear Projection of Flattened Patches的实现的通过一个

  1. kernel_size=stride=16

的卷积加上一个flatten实现的。他的功能是将

  1. 244
  2. ×
  3. 244
  4. ×
  5. 3
  6. 244×244×3
  7. 244×244×3 的的2D Image转换为
  8. 196
  9. ×
  10. 768
  11. 196×768
  12. 196×768 Patch Embedding。具体代码及注释如下:
  1. classPatchEmbed(nn.Module):"""
  2. 2D Image to Patch Embedding
  3. """def__init__(self, img_size=224, patch_size=16, in_c=3, embed_dim=768, norm_layer=None):super().__init__()'''
  4. image_size = (244,244)
  5. patch_size = (16,16)
  6. gird_size = (244/16,244/16)=(14,14)
  7. num_patches = 14 * 14 = 196
  8. '''
  9. img_size =(img_size, img_size)
  10. patch_size =(patch_size, patch_size)
  11. self.img_size = img_size
  12. self.patch_size = patch_size
  13. self.grid_size =(img_size[0]// patch_size[0], img_size[1]// patch_size[1])
  14. self.num_patches = self.grid_size[0]* self.grid_size[1]'''
  15. 使用大小为16,stride为16的卷积核实现embeding,
  16. 输出14*14大小,通道为768(768 = 16*16*3,相当于将每个patch部分转换为1维向量)的patch
  17. '''
  18. self.proj = nn.Conv2d(in_c, embed_dim, kernel_size=patch_size, stride=patch_size)'''
  19. 如果norm_layer为true则使用layerNorm,这里作者没有使用,
  20. 所以self.norm = nn.Identity(),对输入不做任何改变直接输出
  21. '''
  22. self.norm = norm_layer(embed_dim)if norm_layer else nn.Identity()defforward(self, x):
  23. B, C, H, W = x.shape
  24. assert H == self.img_size[0]and W == self.img_size[1], \
  25. f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."'''
  26. self.proj(x):[B,3,244,244]->[B,768,14,14]
  27. flatten(2):[B,768,14,14]->[B,768,14*14]=[B,768,196]
  28. transpose(1, 2):[B,768,196]->[B,196,768]
  29. self.norm(x)不对输入做处理直接输出
  30. '''
  31. x = self.proj(x).flat1ten(2).transpose(1,2)
  32. x = self.norm(x)return x

Transformer Encoder

Transformer Encoder由Attention、MLP和DropPath代码组成,其结构图如下:
在这里插入图片描述

Multi-Head Attention

关于 Multi-Head Attention 的结构图和详细介绍可查看博文,论文阅读笔记:Attention Is All You Need。
Attention具体代码及注释如下:

  1. classAttention(nn.Module):def__init__(self,
  2. dim,# 输入tokendim 768
  3. num_heads=8,
  4. qkv_bias=False,
  5. qk_scale=None,
  6. attn_drop_ratio=0.,
  7. proj_drop_ratio=0.):super(Attention, self).__init__()'''
  8. num_heads = 12
  9. head_dim = 768 // 12 = 64 (Attention is all you need论文中提到的dk=dv=dmodel/h)
  10. scale = 64 ^ -0.5 = 1/8(Attention is all you need论文中Scaled Dot-Product Attention提到的公式Attention(Q,K,V)中的根号dk分之一)
  11. qkv:将输入线性映射到q,k,v
  12. proj:Attention is all you need论文中Multi-Head Attention最后的融合矩阵 Wo,使用 Linear 的实现
  13. '''
  14. self.num_heads = num_heads
  15. head_dim = dim // num_heads
  16. self.scale = qk_scale or head_dim **-0.5
  17. self.qkv = nn.Linear(dim, dim *3, bias=qkv_bias)
  18. self.attn_drop = nn.Dropout(attn_drop_ratio)
  19. self.proj = nn.Linear(dim, dim)
  20. self.proj_drop = nn.Dropout(proj_drop_ratio)defforward(self, x):'''
  21. B = batch_size
  22. N = 197
  23. C = 768
  24. '''
  25. B, N, C = x.shape
  26. '''
  27. qkv(x) : [B,197,768] -> [B,197,768*3]
  28. reshape : [B,197,768*3] -> [B,197,3,12,64] (3分别代表qkv,12个head,每个head为64维向量)
  29. permute:[B,197,3,12,64] -> [3,B,12,197,64]
  30. '''
  31. qkv = self.qkv(x).reshape(B, N,3, self.num_heads, C // self.num_heads).permute(2,0,3,1,4)'''
  32. q,k,v = [B,12,197,64]
  33. '''
  34. q, k, v = qkv[0], qkv[1], qkv[2]# make torchscript happy (cannot use tensor as tuple)'''
  35. K.transpose(-2, -1) : [B,12,197,64] = [B,12,64,197]
  36. q @ K.transpose(-2, -1) : [B,12,197,64] @ [B,12,64,197] = [B,12,197,197]
  37. attn : [B,12,197,197]
  38. attn.softmax(dim=-1)对最后一个维度(即每一行)进行softmax处理
  39. '''
  40. attn =(q @ k.transpose(-2,-1))* self.scale
  41. attn = attn.softmax(dim=-1)
  42. attn = self.attn_drop(attn)'''
  43. attn @ v = [B,12,197,197] @ [B,12,197,64] = [B,12,197,64]
  44. transpose(1, 2) : [B,197,12,64]
  45. reshape : [B,197,768]
  46. '''
  47. x =(attn @ v).transpose(1,2).reshape(B, N, C)
  48. x = self.proj(x)
  49. x = self.proj_drop(x)return x

MLP

在这里插入图片描述
MLP结构和代码都很简单,就是全连接加激活函数加dropout,这里的激活函数用的GELU:

  1. G
  2. E
  3. L
  4. U
  5. (
  6. x
  7. )
  8. =
  9. 0.5
  10. x
  11. (
  12. 1
  13. +
  14. t
  15. a
  16. n
  17. h
  18. [
  19. 2
  20. π
  21. (
  22. x
  23. +
  24. 0.044715
  25. x
  26. 3
  27. )
  28. ]
  29. )
  30. GELU(x)=0.5x(1+tanh[\frac{2}{π}(x+0.044715x^3)])
  31. GELU(x)=0.5x(1+tanh2​(x+0.044715x3)])

MLP模块代码如下:

  1. classMlp(nn.Module):"""
  2. MLP as used in Vision Transformer, MLP-Mixer and related networks
  3. """def__init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):super().__init__()
  4. out_features = out_features or in_features
  5. hidden_features = hidden_features or in_features
  6. self.fc1 = nn.Linear(in_features, hidden_features)
  7. self.act = act_layer()
  8. self.fc2 = nn.Linear(hidden_features, out_features)
  9. self.drop = nn.Dropout(drop)defforward(self, x):
  10. x = self.fc1(x)
  11. x = self.act(x)
  12. x = self.drop(x)
  13. x = self.fc2(x)
  14. x = self.drop(x)return x

DropPath

在Transformer Encoder中代码使用DropPath代替论文中的Dropout,具体代码及注释如下:

  1. defdrop_path(x, drop_prob:float=0., training:bool=False):'''
  2. x.shape : [B,197,768]
  3. '''if drop_prob ==0.ornot training:return x
  4. keep_prob =1- drop_prob
  5. '''
  6. shape = [B,1,1]
  7. 即将X的第一维度保留,其他维度改为1
  8. '''
  9. shape =(x.shape[0],)+(1,)*(x.ndim -1)# work with diff dim tensors, not just 2D ConvNets'''
  10. 生成形状为shape的随机张量并加上keep_prob
  11. '''
  12. random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)'''
  13. 将随机张量向下取整,一部分为0,一部分为1
  14. '''
  15. random_tensor.floor_()# binarize'''
  16. 将x除以keep_prob再乘上随机张量,一部分变成0,一部分保留
  17. '''
  18. output = x.div(keep_prob)* random_tensor
  19. return output
  20. classDropPath(nn.Module):"""
  21. Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
  22. """def__init__(self, drop_prob=None):super(DropPath, self).__init__()
  23. self.drop_prob = drop_prob
  24. defforward(self, x):return drop_path(x, self.drop_prob, self.training)

MLP Head

在这里插入图片描述
原文中关于MLP Head的代码:

  1. # Representation layerif representation_size andnot distilled:
  2. self.has_logits =True
  3. self.num_features = representation_size
  4. self.pre_logits = nn.Sequential(OrderedDict([("fc", nn.Linear(embed_dim, representation_size)),("act", nn.Tanh())]))else:
  5. self.has_logits =False
  6. self.pre_logits = nn.Identity()# Classifier head(s)
  7. self.head = nn.Linear(self.num_features, num_classes)if num_classes >0else nn.Identity()
  8. self.head_dist =Noneif distilled:
  9. self.head_dist = nn.Linear(self.embed_dim, self.num_classes)if num_classes >0else nn.Identity()

这里的代码也很简单,就不做过多注释了,代码中

  1. distilled = False

,所以:

  1. self.pre_logits = nn.Sequential(nn.Linear,(embed_dim, representation_size)nn.Tanh())
  1. self.head = nn.Linear(self.num_features, num_classes)
  1. MLPHead(x) = self.head(self.pre_logits(x[:, 0]))

VisionTransformer

ViT-B/16整体网络结构如下图:
在这里插入图片描述
ViT-B/16模型使用的图像输入尺寸为 224×224×3,patch尺寸为16×16×3,每个patch embed的维度为768,transformer encoder block的个数为12, Multi-Head Attention的head个数为12,最后两个参数看调用模型时的参数设置,representation_size为pre_logits中全连接层节点个数,num_classes为预测的总分类数。

  1. defvit_base_patch16_224_in21k(num_classes:int=21843, has_logits:bool=True):
  2. model = VisionTransformer(img_size=224,
  3. patch_size=16,
  4. embed_dim=768,
  5. depth=12,
  6. num_heads=12,
  7. representation_size=768if has_logits elseNone,
  8. num_classes=num_classes)return model

VisionTransformer具体代码及注释如下:

  1. classVisionTransformer(nn.Module):def__init__(self, img_size=224, patch_size=16, in_c=3, num_classes=1000,
  2. embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True,
  3. qk_scale=None, representation_size=None, distilled=False, drop_ratio=0.,
  4. attn_drop_ratio=0., drop_path_ratio=0., embed_layer=PatchEmbed, norm_layer=None,
  5. act_layer=None):"""
  6. Args:
  7. img_size (int, tuple): 输入图像尺寸
  8. patch_size (int, tuple): patch 尺寸
  9. in_c (int): 输入通道
  10. num_classes (int): 分类数
  11. embed_dim (int): patchembed 维度
  12. depth (int): transformer encoder 模块( Block 模块)个数
  13. num_heads (int): Multi-Head Attention 中的 head 个数
  14. mlp_ratio (int): MLP 隐藏层和 embed_dim 的比例
  15. qkv_bias (bool): 是否使用 qkv 偏置(即使用 Linear 将输入映射到 qkv 时,Linear是否使用 bias )
  16. qk_scale (float): qk缩放比例,默认使用根号 dim_k 分之一
  17. representation_size (Optional[int]): pre-logits 中的全连接节点个数,如果是 None 则不要 pre-logits (MLP Head 中只有一个全连接层)
  18. distilled (bool): 是否使用 DeiT 模型(基于知识蒸馏的transformer),在 VIT 中默认为 False
  19. drop_ratio (float): dropout概率
  20. attn_drop_ratio (float): attention 中的 dropout 概率
  21. drop_path_ratio (float): attention 中的 droppath 概率
  22. embed_layer (nn.Module): patch embedding 层
  23. norm_layer: (nn.Module): normalization 层
  24. """super(VisionTransformer, self).__init__()
  25. self.num_classes = num_classes
  26. '''
  27. self.num_features = self.embed_dim = 768
  28. self.num_tokens = 1
  29. norm_layer = nn.LayerNorm(eps=1e-6)
  30. act_layer = nn.GELU
  31. '''
  32. self.num_features = self.embed_dim = embed_dim # num_features for consistency with other models
  33. self.num_tokens =2if distilled else1
  34. norm_layer = norm_layer or partial(nn.LayerNorm, eps=1e-6)
  35. act_layer = act_layer or nn.GELU
  36. '''
  37. 构建patch embeding layer
  38. num_patches = (224/16) * (224/16) = 196
  39. '''
  40. self.patch_embed = embed_layer(img_size=img_size, patch_size=patch_size, in_c=in_c, embed_dim=embed_dim)
  41. num_patches = self.patch_embed.num_patches
  42. '''
  43. 构建可学习参数:
  44. self.cls_token : [1,1,768] 分类token
  45. self.dist_token : None
  46. self.pos_embed : [1,197,768] 位置编码
  47. '''
  48. self.cls_token = nn.Parameter(torch.zeros(1,1, embed_dim))
  49. self.dist_token = nn.Parameter(torch.zeros(1,1, embed_dim))if distilled elseNone
  50. self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + self.num_tokens, embed_dim))
  51. self.pos_drop = nn.Dropout(p=drop_ratio)'''
  52. 构建首项为0,长度为depth的等差数列,且每一项小于drop_path_ratio
  53. 也就是说 传入 Block 的 droppath 概率是递增的。
  54. 代码这里是让 drop_path_ratio 默认等于0
  55. 最后利用参数构建 depth(12) 层 block 层
  56. 并把 LayerNorm(embed_dim) 赋值给self.norm
  57. '''
  58. dpr =[x.item()for x in torch.linspace(0, drop_path_ratio, depth)]# stochastic depth decay rule
  59. self.blocks = nn.Sequential(*[
  60. Block(dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
  61. drop_ratio=drop_ratio, attn_drop_ratio=attn_drop_ratio, drop_path_ratio=dpr[i],
  62. norm_layer=norm_layer, act_layer=act_layer)for i inrange(depth)])
  63. self.norm = norm_layer(embed_dim)'''
  64. 构建 pre_logits :
  65. 1.全连接层:输入embed_dim(768),输出representation_size(768)
  66. 2.激活函数:Tanh
  67. '''# Representation layerif representation_size andnot distilled:
  68. self.has_logits =True
  69. self.num_features = representation_size
  70. self.pre_logits = nn.Sequential(OrderedDict([("fc", nn.Linear(embed_dim, representation_size)),("act", nn.Tanh())]))else:
  71. self.has_logits =False
  72. self.pre_logits = nn.Identity()'''
  73. 构建分类器:
  74. self.num_features = 768
  75. '''# Classifier head(s)
  76. self.head = nn.Linear(self.num_features, num_classes)if num_classes >0else nn.Identity()
  77. self.head_dist =Noneif distilled:
  78. self.head_dist = nn.Linear(self.embed_dim, self.num_classes)if num_classes >0else nn.Identity()'''
  79. 初始化pos_embed、cls_token
  80. 初始化网络其他层的权重
  81. '''# Weight init
  82. nn.init.trunc_normal_(self.pos_embed, std=0.02)if self.dist_token isnotNone:
  83. nn.init.trunc_normal_(self.dist_token, std=0.02)
  84. nn.init.trunc_normal_(self.cls_token, std=0.02)
  85. self.apply(_init_vit_weights)defforward_features(self, x):'''
  86. self.patch_embed(x) : [B,3,244,244] -> [B,196,768]
  87. 合并 cls_token:
  88. self.cls_token : [1,1,768]
  89. cls_token : [B,1,768]
  90. x = torch.cat((cls_token, x), dim=1) : [B,197,768]
  91. '''
  92. x = self.patch_embed(x)
  93. cls_token = self.cls_token.expand(x.shape[0],-1,-1)if self.dist_token isNone:
  94. x = torch.cat((cls_token, x), dim=1)else:
  95. x = torch.cat((cls_token, self.dist_token.expand(x.shape[0],-1,-1), x), dim=1)'''
  96. 加上位置编码:
  97. x = x + self.pos_embed : [B,197,768]
  98. 经过 Attention blocks 和 LayerNorm : [B,197,768]
  99. 最后返回分类 token 并传入 pre_logits:
  100. return self.pre_logits(x[:, 0]) : [B,768]
  101. '''
  102. x = self.pos_drop(x + self.pos_embed)
  103. x = self.blocks(x)
  104. x = self.norm(x)if self.dist_token isNone:return self.pre_logits(x[:,0])else:return x[:,0], x[:,1]defforward(self, x):'''
  105. self.forward_features(x) : [B,3,244,244] -> [B,768]
  106. x = self.head(x) : [B,768] -> [B,num_classes]
  107. '''
  108. x = self.forward_features(x)if self.head_dist isnotNone:
  109. x, x_dist = self.head(x[0]), self.head_dist(x[1])if self.training andnot torch.jit.is_scripting():# during inference, return the average of both classifier predictionsreturn x, x_dist
  110. else:return(x + x_dist)/2else:
  111. x = self.head(x)return x
  112. def_init_vit_weights(m):"""
  113. ViT weight initialization
  114. :param m: module
  115. """ifisinstance(m, nn.Linear):
  116. nn.init.trunc_normal_(m.weight, std=.01)if m.bias isnotNone:
  117. nn.init.zeros_(m.bias)elifisinstance(m, nn.Conv2d):
  118. nn.init.kaiming_normal_(m.weight, mode="fan_out")if m.bias isnotNone:
  119. nn.init.zeros_(m.bias)elifisinstance(m, nn.LayerNorm):
  120. nn.init.zeros_(m.bias)
  121. nn.init.ones_(m.weight)

上述代码的distilled参数所涉及的 DeiT models 代码中并没有使用,论文中也没有提到,如有疑惑可查看ViT和DeiT的原理与使用。


本文转载自: https://blog.csdn.net/Z960515/article/details/122636814
版权归原作者 HollowKnightZ 所有, 如有侵权,请联系我们删除。

“Vision Transformer(Pytorch版)代码阅读注释”的评论:

还没有评论