0


使用PyTorch复现ConvNext:从Resnet到ConvNext的完整步骤详解

ConvNext论文提出了一种新的基于卷积的架构,不仅超越了基于 Transformer 的模型(如 Swin),而且可以随着数据量的增加而扩展!今天我们使用Pytorch来对其进行复现。下图显示了针对不同数据集/模型大小的 ConvNext 准确度。

作者首先采用众所周知的 ResNet 架构,并根据过去十年中的新最佳实践和发现对其进行迭代改进。作者专注于 Swin-Transformer,并密切关注其设计。这篇论文我们在以前也推荐过,如果你们有阅读过,我们强烈推荐阅读它:)

下图显示了所有各种改进以及每一项改进之后的各自性能。

论文将设计的路线图分为两部分:宏观设计和微观设计。宏观设计是从高层次的角度所做的所有改变,例如架构的设计,而微设计更多的是关于细节的,例如激活函数,归一化等。

下面我们将从一个经典的 BottleNeck 块开始,并使用pytorch逐个实现论文中说到的每个更改。

从ResNet开始

ResNet 由一个一个的残差(BottleNeck) 块,我们就从这里开始。

  1. fromtorchimportnn
  2. fromtorchimportTensor
  3. fromtypingimportList
  4. classConvNormAct(nn.Sequential):
  5. """
  6. A little util layer composed by (conv) -> (norm) -> (act) layers.
  7. """
  8. def__init__(
  9. self,
  10. in_features: int,
  11. out_features: int,
  12. kernel_size: int,
  13. norm = nn.BatchNorm2d,
  14. act = nn.ReLU,
  15. **kwargs
  16. ):
  17. super().__init__(
  18. nn.Conv2d(
  19. in_features,
  20. out_features,
  21. kernel_size=kernel_size,
  22. padding=kernel_size//2,
  23. **kwargs
  24. ),
  25. norm(out_features),
  26. act(),
  27. )
  28. classBottleNeckBlock(nn.Module):
  29. def__init__(
  30. self,
  31. in_features: int,
  32. out_features: int,
  33. reduction: int = 4,
  34. stride: int = 1,
  35. ):
  36. super().__init__()
  37. reduced_features = out_features//reduction
  38. self.block = nn.Sequential(
  39. # wide -> narrow
  40. ConvNormAct(
  41. in_features, reduced_features, kernel_size=1, stride=stride, bias=False
  42. ),
  43. # narrow -> narrow
  44. ConvNormAct(reduced_features, reduced_features, kernel_size=3, bias=False),
  45. # narrow -> wide
  46. ConvNormAct(reduced_features, out_features, kernel_size=1, bias=False, act=nn.Identity),
  47. )
  48. self.shortcut = (
  49. nn.Sequential(
  50. ConvNormAct(
  51. in_features, out_features, kernel_size=1, stride=stride, bias=False
  52. )
  53. )
  54. ifin_features!= out_features
  55. elsenn.Identity()
  56. )
  57. self.act = nn.ReLU()
  58. defforward(self, x: Tensor) ->Tensor:
  59. res = x
  60. x = self.block(x)
  61. res = self.shortcut(res)
  62. x += res
  63. x = self.act(x)
  64. returnx

看看上面代码是否有效

  1. importtorch
  2. x = torch.rand(1, 32, 7, 7)
  3. block = BottleNeckBlock(32, 64)
  4. block(x).shape
  5. #torch.Size([1, 64, 7, 7])

下面开始定义Stage,Stage也叫阶段是残差块的集合。每个阶段通常将输入下采样 2 倍

  1. classConvNexStage(nn.Sequential):
  2. def__init__(
  3. self, in_features: int, out_features: int, depth: int, stride: int = 2, **kwargs
  4. ):
  5. super().__init__(
  6. # downsample is done here
  7. BottleNeckBlock(in_features, out_features, stride=stride, **kwargs),
  8. *[
  9. BottleNeckBlock(out_features, out_features, **kwargs)
  10. for_inrange(depth-1)
  11. ],
  12. )

测试

  1. stage = ConvNexStage(32, 64, depth=2)
  2. stage(x).shape
  3. #torch.Size([1, 64, 4, 4])

我们已经将输入是从 7x7 减少到 4x4 。

ResNet 也有所谓的 stem,这是模型中对输入图像进行大量下采样的第一层。

  1. classConvNextStem(nn.Sequential):
  2. def__init__(self, in_features: int, out_features: int):
  3. super().__init__(
  4. ConvNormAct(
  5. in_features, out_features, kernel_size=7, stride=2
  6. ),
  7. nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
  8. )

现在我们可以定义 ConvNextEncoder 来拼接各个阶段,并将图像作为输入生成最终嵌入。

  1. classConvNextEncoder(nn.Module):
  2. def__init__(
  3. self,
  4. in_channels: int,
  5. stem_features: int,
  6. depths: List[int],
  7. widths: List[int],
  8. ):
  9. super().__init__()
  10. self.stem = ConvNextStem(in_channels, stem_features)
  11. in_out_widths = list(zip(widths, widths[1:]))
  12. self.stages = nn.ModuleList(
  13. [
  14. ConvNexStage(stem_features, widths[0], depths[0], stride=1),
  15. *[
  16. ConvNexStage(in_features, out_features, depth)
  17. for (in_features, out_features), depthinzip(
  18. in_out_widths, depths[1:]
  19. )
  20. ],
  21. ]
  22. )
  23. defforward(self, x):
  24. x = self.stem(x)
  25. forstageinself.stages:
  26. x = stage(x)
  27. returnx

测试结果如下:

  1. image = torch.rand(1, 3, 224, 224)
  2. encoder = ConvNextEncoder(in_channels=3, stem_features=64, depths=[3,4,6,4], widths=[256, 512, 1024, 2048])
  3. encoder(image).shape
  4. #torch.Size([1, 2048, 7, 7])

现在我们完成了 resnet50 编码器,如果你附加一个分类头,那么他就可以在图像分类任务上工作。下面开始进入本文的正题实现ConvNext。

Macro Design

1、改变阶段计算比率

传统的ResNet 中包含了 4 个阶段,而Swin Transformer这4个阶段使用的比例为1:1:3:1(第一个阶段有一个区块,第二个阶段有一个区块,第三个阶段有三个区块……)将ResNet50调整为这个比率((3,4,6,3)->(3,3,9,3))可以使性能从78.8%提高到79.4%。

  1. encoder = ConvNextEncoder(in_channels=3, stem_features=64, depths=[3,3,9,3], widths=[256, 512, 1024, 2048])

2、将stem改为“Patchify”

ResNet stem使用的是非常激进的7x7和maxpool来大量采样输入图像。然而,Transfomers 使用了 被称为“Patchify”的主干,这意味着他们将输入图像嵌入到补丁中。Vision transforms使用非常激进的补丁(16x16),而ConvNext的作者使用使用conv层实现的4x4补丁,这使得性能从79.4%提升到79.5%。

  1. classConvNextStem(nn.Sequential):
  2. def__init__(self, in_features: int, out_features: int):
  3. super().__init__(
  4. nn.Conv2d(in_features, out_features, kernel_size=4, stride=4),
  5. nn.BatchNorm2d(out_features)
  6. )

3、ResNeXtify

ResNetXt 对 BottleNeck 中的 3x3 卷积层采用分组卷积来减少 FLOPS。在 ConvNext 中使用depth-wise convolution(如 MobileNet 和后来的 EfficientNet)。depth-wise convolution也是是分组卷积的一种形式,其中组数等于输入通道数。

作者注意到这与 self-attention 中的加权求和操作非常相似,后者仅在空间维度上混合信息。使用 depth-wise convs 会降低精度(因为没有像 ResNetXt 那样增加宽度),这是意料之中的毕竟提升了速度。

所以我们将 BottleNeck 块内的 3x3 conv 更改为下面代码

  1. ConvNormAct(reduced_features, reduced_features, kernel_size=3, bias=False, groups=reduced_features)

4、Inverted Bottleneck(倒置瓶颈)

一般的 BottleNeck 首先通过 1x1 conv 减少特征,然后用 3x3 conv,最后将特征扩展为原始大小,而倒置瓶颈块则相反。

所以下面我们从宽 -> 窄 -> 宽 修改到到 窄 -> 宽 -> 窄。

这与 Transformer 类似,由于 MLP 层遵循窄 -> 宽 -> 窄设计,MLP 中的第二个稠密层将输入的特征扩展了四倍。

  1. classBottleNeckBlock(nn.Module):
  2. def__init__(
  3. self,
  4. in_features: int,
  5. out_features: int,
  6. expansion: int = 4,
  7. stride: int = 1,
  8. ):
  9. super().__init__()
  10. expanded_features = out_features*expansion
  11. self.block = nn.Sequential(
  12. # narrow -> wide
  13. ConvNormAct(
  14. in_features, expanded_features, kernel_size=1, stride=stride, bias=False
  15. ),
  16. # wide -> wide (with depth-wise)
  17. ConvNormAct(expanded_features, expanded_features, kernel_size=3, bias=False, groups=in_features),
  18. # wide -> narrow
  19. ConvNormAct(expanded_features, out_features, kernel_size=1, bias=False, act=nn.Identity),
  20. )
  21. self.shortcut = (
  22. nn.Sequential(
  23. ConvNormAct(
  24. in_features, out_features, kernel_size=1, stride=stride, bias=False
  25. )
  26. )
  27. ifin_features!= out_features
  28. elsenn.Identity()
  29. )
  30. self.act = nn.ReLU()
  31. defforward(self, x: Tensor) ->Tensor:
  32. res = x
  33. x = self.block(x)
  34. res = self.shortcut(res)
  35. x += res
  36. x = self.act(x)
  37. returnx

5、扩大卷积核大小

像Swin一样,ViT使用更大的内核尺寸(7x7)。增加内核的大小会使计算量更大,所以才使用上面提到的depth-wise convolution,通过使用更少的通道来减少计算量。作者指出,这类似于 Transformers 模型,其中多头自我注意 (MSA) 在 MLP 层之前完成。

  1. classBottleNeckBlock(nn.Module):
  2. def__init__(
  3. self,
  4. in_features: int,
  5. out_features: int,
  6. expansion: int = 4,
  7. stride: int = 1,
  8. ):
  9. super().__init__()
  10. expanded_features = out_features*expansion
  11. self.block = nn.Sequential(
  12. # narrow -> wide (with depth-wise and bigger kernel)
  13. ConvNormAct(
  14. in_features, in_features, kernel_size=7, stride=stride, bias=False, groups=in_features
  15. ),
  16. # wide -> wide
  17. ConvNormAct(in_features, expanded_features, kernel_size=1),
  18. # wide -> narrow
  19. ConvNormAct(expanded_features, out_features, kernel_size=1, bias=False, act=nn.Identity),
  20. )
  21. self.shortcut = (
  22. nn.Sequential(
  23. ConvNormAct(
  24. in_features, out_features, kernel_size=1, stride=stride, bias=False
  25. )
  26. )
  27. ifin_features!= out_features
  28. elsenn.Identity()
  29. )
  30. self.act = nn.ReLU()
  31. defforward(self, x: Tensor) ->Tensor:
  32. res = x
  33. x = self.block(x)
  34. res = self.shortcut(res)
  35. x += res
  36. x = self.act(x)
  37. returnx

这将准确度从 79.9% 提高到 80.6%

Micro Design

1、用 GELU 替换 ReLU

transformers使用的是GELU,为什么我们不用呢?作者测试替换后准确率保持不变。PyTorch 的GELU 是 在 nn.GELU。

2、更少的激活函数

残差块有三个激活函数。而在Transformer块中,只有一个激活函数,即MLP块中的激活函数。作者除去了除中间层之后的所有激活。这是与swing - t一样的,这使得精度提高到81.3% !

3、更少的归一化层

与激活类似,Transformers 块具有较少的归一化层。作者决定删除所有 BatchNorm,只保留中间转换之前的那个。

4、用 LN 代替 BN

作者用 LN代替了 BN层。他们注意到在原始 ResNet 中提到这样做会损害性能,但经过作者以上的所有的更改后,性能提高到 81.5%

上面4个步骤让我们整合起来操作:

  1. classBottleNeckBlock(nn.Module):
  2. def__init__(
  3. self,
  4. in_features: int,
  5. out_features: int,
  6. expansion: int = 4,
  7. stride: int = 1,
  8. ):
  9. super().__init__()
  10. expanded_features = out_features*expansion
  11. self.block = nn.Sequential(
  12. # narrow -> wide (with depth-wise and bigger kernel)
  13. nn.Conv2d(
  14. in_features, in_features, kernel_size=7, stride=stride, bias=False, groups=in_features
  15. ),
  16. # GroupNorm with num_groups=1 is the same as LayerNorm but works for 2D data
  17. nn.GroupNorm(num_groups=1, num_channels=in_features),
  18. # wide -> wide
  19. nn.Conv2d(in_features, expanded_features, kernel_size=1),
  20. nn.GELU(),
  21. # wide -> narrow
  22. nn.Conv2d(expanded_features, out_features, kernel_size=1),
  23. )
  24. self.shortcut = (
  25. nn.Sequential(
  26. ConvNormAct(
  27. in_features, out_features, kernel_size=1, stride=stride, bias=False
  28. )
  29. )
  30. ifin_features!= out_features
  31. elsenn.Identity()
  32. )
  33. defforward(self, x: Tensor) ->Tensor:
  34. res = x
  35. x = self.block(x)
  36. res = self.shortcut(res)
  37. x += res
  38. returnx

分离下采样层

在 ResNet 中,下采样是通过 stride=2 conv 完成的。Transformers(以及其他卷积网络)也有一个单独的下采样模块。作者删除了 stride=2 并在三个 conv 之前添加了一个下采样块,为了保持训练期间的稳定性在,在下采样操作之前需要进行归一化。将此模块添加到 ConvNexStage。达到了超过 Swin 的 82.0%!

  1. classConvNexStage(nn.Sequential):
  2. def__init__(
  3. self, in_features: int, out_features: int, depth: int, **kwargs
  4. ):
  5. super().__init__(
  6. # add the downsampler
  7. nn.Sequential(
  8. nn.GroupNorm(num_groups=1, num_channels=in_features),
  9. nn.Conv2d(in_features, out_features, kernel_size=2, stride=2)
  10. ),
  11. *[
  12. BottleNeckBlock(out_features, out_features, **kwargs)
  13. for_inrange(depth)
  14. ],
  15. )

现在我们得到了最终的 BottleNeckBlock层代码:

  1. classBottleNeckBlock(nn.Module):
  2. def__init__(
  3. self,
  4. in_features: int,
  5. out_features: int,
  6. expansion: int = 4,
  7. ):
  8. super().__init__()
  9. expanded_features = out_features*expansion
  10. self.block = nn.Sequential(
  11. # narrow -> wide (with depth-wise and bigger kernel)
  12. nn.Conv2d(
  13. in_features, in_features, kernel_size=7, padding=3, bias=False, groups=in_features
  14. ),
  15. # GroupNorm with num_groups=1 is the same as LayerNorm but works for 2D data
  16. nn.GroupNorm(num_groups=1, num_channels=in_features),
  17. # wide -> wide
  18. nn.Conv2d(in_features, expanded_features, kernel_size=1),
  19. nn.GELU(),
  20. # wide -> narrow
  21. nn.Conv2d(expanded_features, out_features, kernel_size=1),
  22. )
  23. defforward(self, x: Tensor) ->Tensor:
  24. res = x
  25. x = self.block(x)
  26. x += res
  27. returnx

让我们测试一下最终的stage代码

  1. stage = ConvNexStage(32, 62, depth=1)
  2. stage(torch.randn(1, 32, 14, 14)).shape
  3. #torch.Size([1, 62, 7, 7])

最后的一些丢该

论文中还添加了Stochastic Depth,也称为 Drop Path还有 Layer Scale。

  1. fromtorchvision.opsimportStochasticDepth
  2. classLayerScaler(nn.Module):
  3. def__init__(self, init_value: float, dimensions: int):
  4. super().__init__()
  5. self.gamma = nn.Parameter(init_value*torch.ones((dimensions)),
  6. requires_grad=True)
  7. defforward(self, x):
  8. returnself.gamma[None,...,None,None] *x
  9. classBottleNeckBlock(nn.Module):
  10. def__init__(
  11. self,
  12. in_features: int,
  13. out_features: int,
  14. expansion: int = 4,
  15. drop_p: float = .0,
  16. layer_scaler_init_value: float = 1e-6,
  17. ):
  18. super().__init__()
  19. expanded_features = out_features*expansion
  20. self.block = nn.Sequential(
  21. # narrow -> wide (with depth-wise and bigger kernel)
  22. nn.Conv2d(
  23. in_features, in_features, kernel_size=7, padding=3, bias=False, groups=in_features
  24. ),
  25. # GroupNorm with num_groups=1 is the same as LayerNorm but works for 2D data
  26. nn.GroupNorm(num_groups=1, num_channels=in_features),
  27. # wide -> wide
  28. nn.Conv2d(in_features, expanded_features, kernel_size=1),
  29. nn.GELU(),
  30. # wide -> narrow
  31. nn.Conv2d(expanded_features, out_features, kernel_size=1),
  32. )
  33. self.layer_scaler = LayerScaler(layer_scaler_init_value, out_features)
  34. self.drop_path = StochasticDepth(drop_p, mode="batch")
  35. defforward(self, x: Tensor) ->Tensor:
  36. res = x
  37. x = self.block(x)
  38. x = self.layer_scaler(x)
  39. x = self.drop_path(x)
  40. x += res
  41. returnx

好了,现在我们看看最终结果

  1. stage = ConvNexStage(32, 62, depth=1)
  2. stage(torch.randn(1, 32, 14, 14)).shape
  3. #torch.Size([1, 62, 7, 7])

最后我们修改一下Drop Path的概率

  1. classConvNextEncoder(nn.Module):
  2. def__init__(
  3. self,
  4. in_channels: int,
  5. stem_features: int,
  6. depths: List[int],
  7. widths: List[int],
  8. drop_p: float = .0,
  9. ):
  10. super().__init__()
  11. self.stem = ConvNextStem(in_channels, stem_features)
  12. in_out_widths = list(zip(widths, widths[1:]))
  13. # create drop paths probabilities (one for each stage)
  14. drop_probs = [x.item() forxintorch.linspace(0, drop_p, sum(depths))]
  15. self.stages = nn.ModuleList(
  16. [
  17. ConvNexStage(stem_features, widths[0], depths[0], drop_p=drop_probs[0]),
  18. *[
  19. ConvNexStage(in_features, out_features, depth, drop_p=drop_p)
  20. for (in_features, out_features), depth, drop_pinzip(
  21. in_out_widths, depths[1:], drop_probs[1:]
  22. )
  23. ],
  24. ]
  25. )
  26. defforward(self, x):
  27. x = self.stem(x)
  28. forstageinself.stages:
  29. x = stage(x)
  30. returnx

测试:

  1. image = torch.rand(1, 3, 224, 224)
  2. encoder = ConvNextEncoder(in_channels=3, stem_features=64, depths=[3,4,6,4], widths=[256, 512, 1024, 2048])
  3. encoder(image).shape
  4. #torch.Size([1, 2048, 3, 3])

ConvNext的特征,我们需要在编码器顶部应用分类头。我们还在最后一个线性层之前添加了一个 LayerNorm。

  1. classClassificationHead(nn.Sequential):
  2. def__init__(self, num_channels: int, num_classes: int = 1000):
  3. super().__init__(
  4. nn.AdaptiveAvgPool2d((1, 1)),
  5. nn.Flatten(1),
  6. nn.LayerNorm(num_channels),
  7. nn.Linear(num_channels, num_classes)
  8. )
  9. classConvNextForImageClassification(nn.Sequential):
  10. def__init__(self,
  11. in_channels: int,
  12. stem_features: int,
  13. depths: List[int],
  14. widths: List[int],
  15. drop_p: float = .0,
  16. num_classes: int = 1000):
  17. super().__init__()
  18. self.encoder = ConvNextEncoder(in_channels, stem_features, depths, widths, drop_p)
  19. self.head = ClassificationHead(widths[-1], num_classes)

最终模型测试:

  1. image = torch.rand(1, 3, 224, 224)
  2. classifier = ConvNextForImageClassification(in_channels=3, stem_features=64, depths=[3,4,6,4], widths=[256, 512, 1024, 2048])
  3. classifier(image).shape
  4. #torch.Size([1, 1000])

最后总结

在本文中复现了作者使用ResNet 创建ConvNext 的所有过程。 如果你想需要完整代码,可以查看这个地址:https://github.com/FrancescoSaverioZuppichini/ConvNext

这个地址是论文提供的官方代码,也可以参考:https://github.com/facebookresearch/ConvNeXt

作者:Francesco Zuppichini

“使用PyTorch复现ConvNext:从Resnet到ConvNext的完整步骤详解”的评论:

还没有评论