点云3D检测篇三：SECOND

论文地址：SECOND: Sparsely Embedded Convolutional Detection

代码地址：GitHub - traveller59/second.pytorch: SECOND for KITTI/NuScenes object detection

一、引言

    **Second稀疏嵌入卷积检测算法**是**点云体素化检测**的又一篇重要工作，与2017年以前大多将点云转换为2D的BEV或前视图表示，丢失了大量的空间信息的3D检测方案不同，Second的整体架构还是基于**Voxelnet的点云体素化解决方案**，同时考虑到 Voxelnet 推理速度慢和方向估计性能较差的问题，再此基础上进行了一系列改进与创新。
    主要创新点总结如下：

（1）提出了一种改进的稀疏3D卷积方法，使其运行更快。

（2） 提出了一种新颖的角度损失回归方法，表现出比其他方法更好的方向回归性能。

（3）引入了一种新颖的数据增强方法，适用于仅使用激光雷达进行学习的问题，大大提高了收敛速度和性能。

    注意：Second是在Voxelnet的基础上进行了3个经典创新：改进卷积结构、改进数据增强、增加损失函数。

图 1 Second网络架构图

    具体步骤：

** （1）使用点云体素化模块Point_to_voxel 对输入的**原始点云 Point Cloud 进行体素化(网格化)处理，得到输出的 voxel 体素块。

** （2）使用体素特征编码（voxel feature encoding VFE）层**提取体素特征。

** （3）根据(2)得到的结果，使用改进的稀疏3D卷积**方法，对VFE层提取的体素特征进行卷积。

    **（4）**将(3)得到的体素卷积特征，送给**RPN**层进行**cls-head**和**reg-head**输出。

二、pipeline

2.1 Voxelwise feature extractor 体素生成和体素特征提取

2.1.1 原理：

    **体素化****point_to_voxel** 函数的作用是将点云数据转化为体素，转化方法为：首先根据cfg中预先设定的体素数量设置一个缓冲区，即指定大小的一个tensor，初始化为zero，然后遍历点云计算点云分别属于哪个体素，记录所属的体素坐标和每个体素的点数。
    **体素特征编码（voxel feature encoding VFE）层**以同一体素中的所有点的数据作为输入，并使用由线性连接层，批归一化（BatchNorm）层和激活函数层（ReLU）组成的完全连接网络（FCN）来提取逐体素提取特征。 然后，它使用逐体素最大池来获取每个体素的局部聚合特征。 最后，将获得的体素局部特征平铺，并将这些平铺的体素局部要素和每个点的特征拼接在一起。再然后使用一个FCN（cout）将输入这些逐点的特征转换为cout维的输出特征，FCN（cout）也是一个Linear-BatchNorm-ReLU层。 总体而言，三维特征提取器由几个VFE层和一个FCN层组成。
    **注意：**这里的体素化过程是由**c++**封装的，来源于**spconv库**的**points_to_voxel_3d**函数，具体原理与实现过程建议去参考本人的Voxelnet。
    **Second**其实没有直接使用**VFE**层，而是使用**SimpleVoxel**进行代替，作者认为其与VFE层有着一样的作用，这里提取特征的时候使用的是SimpleVoxel层进行替代。

2.1.2 代码：

class SimpleVoxel(nn.Module):
    def __init__(self,
                 num_input_features=4,
                 use_norm=True,
                 num_filters=[32, 128],
                 with_distance=False,
                 voxel_size=(0.2, 0.2, 4),
                 pc_range=(0, -40, -3, 70.4, 40, 1),
                 name='VoxelFeatureExtractor'):
        super(SimpleVoxel, self).__init__()
        self.name = name
        self.num_input_features = num_input_features
    def forward(self, features, num_voxels, coors):
        # features: [concated_num_points, num_voxel_size, 3(4)]    # [128108,5,4]
        # num_voxels: [concated_num_points]
        points_mean = features[:, :, :self.num_input_features].sum(dim=1, keepdim=False) / num_voxels.type_as(features).view(-1, 1)
        return points_mean.contiguous()     # [128108,4]

def points_to_voxel(points,
                    voxel_size,
                    coors_range,
                    coor_to_voxelidx,
                    max_points=35,
                    max_voxels=20000,
                    full_mean=False,
                    block_filtering=True,
                    block_factor=1,
                    block_size=8,
                    height_threshold=0.2,
                    height_high_threshold=3.0,
                    pad_output=False):
    """convert 3d points(N, >=3) to voxels. This version calculate
    everything in one loop. now it takes only 0.8ms(~6k voxels) 
    with c++ and 3.2ghz cpu.
    Args:
        points: [N, ndim] float tensor. points[:, :3] contain xyz points and
            points[:, 3:] contain other information such as reflectivity.
        voxel_size: [3] list/tuple or array, float. xyz, indicate voxel size
        coors_range: [6] list/tuple or array, float. indicate voxel range.
            format: xyzxyz, minmax
        coor_to_voxelidx: int array. used as a dense map.
        max_points: int. indicate maximum points contained in a voxel.
        max_voxels: int. indicate maximum voxels this function create.
            for voxelnet, 20000 is a good choice. you should shuffle points
            before call this function because max_voxels may drop some points.
        full_mean: bool. if true, all empty points in voxel will be filled with mean
            of exist points.
        block_filtering: filter voxels by height. used for lidar point cloud.
            use some visualization tool to see filtered result.
    Returns:
        voxels: [M, max_points, ndim] float tensor. only contain points.
        coordinates: [M, 3] int32 tensor. zyx format.
        num_points_per_voxel: [M] int32 tensor.
    """
    if full_mean:
        assert block_filtering is False
    if not isinstance(voxel_size, np.ndarray):
        voxel_size = np.array(voxel_size, dtype=points.dtype)
    if not isinstance(coors_range, np.ndarray):
        coors_range = np.array(coors_range, dtype=points.dtype)
    voxelmap_shape = (coors_range[3:] - coors_range[:3]) / voxel_size
    voxelmap_shape = tuple(np.round(voxelmap_shape).astype(np.int32).tolist())
    voxelmap_shape = voxelmap_shape[::-1]
    num_points_per_voxel = np.zeros(shape=(max_voxels, ), dtype=np.int32)
    voxels = np.zeros(shape=(max_voxels, max_points, points.shape[-1]),dtype=points.dtype)
    voxel_point_mask = np.zeros(shape=(max_voxels, max_points),dtype=points.dtype)
    coors = np.zeros(shape=(max_voxels, 3), dtype=np.int32)
    res = {
        "voxels": voxels,
        "coordinates": coors,
        "num_points_per_voxel": num_points_per_voxel,
        "voxel_point_mask": voxel_point_mask,
    }
    if full_mean:
        means = np.zeros(shape=(max_voxels, points.shape[-1]),dtype=points.dtype)
        voxel_num = points_to_voxel_3d_np_mean(points, voxels,
                                               voxel_point_mask, means, coors,
                                               num_points_per_voxel,
                                               coor_to_voxelidx,
                                               voxel_size.tolist(),
                                               coors_range.tolist(),
                                               max_points, max_voxels)
    else:
        if block_filtering:
            block_shape = [*voxelmap_shape[1:]]
            block_shape = [b // block_factor for b in block_shape]
            mins = np.full(block_shape, 99999999, dtype=points.dtype)
            maxs = np.full(block_shape, -99999999, dtype=points.dtype)
            voxel_mask = np.zeros((max_voxels, ), dtype=np.int32)
            voxel_num = points_to_voxel_3d_with_filtering(
                points, voxels, voxel_point_mask, voxel_mask, mins, maxs,
                coors, num_points_per_voxel, coor_to_voxelidx,
                voxel_size.tolist(), coors_range.tolist(), max_points,
                max_voxels, block_factor, block_size, height_threshold,
                height_high_threshold)
            voxel_mask = voxel_mask.astype(np.bool_)
            coors_ = coors[voxel_mask]
            if pad_output:
                res["coordinates"][:voxel_num] = coors_
                res["voxels"][:voxel_num] = voxels[voxel_mask]
                res["voxel_point_mask"][:voxel_num] = voxel_point_mask[
                    voxel_mask]
                res["num_points_per_voxel"][:voxel_num] = num_points_per_voxel[
                    voxel_mask]
                res["coordinates"][voxel_num:] = 0
                res["voxels"][voxel_num:] = 0
                res["num_points_per_voxel"][voxel_num:] = 0
                res["voxel_point_mask"][voxel_num:] = 0
            else:
                res["coordinates"] = coors_
                res["voxels"] = voxels[voxel_mask]
                res["num_points_per_voxel"] = num_points_per_voxel[voxel_mask]
                res["voxel_point_mask"] = voxel_point_mask[voxel_mask]
            voxel_num = coors_.shape[0]
        else:
            voxel_num = points_to_voxel_3d_np(points, voxels, voxel_point_mask,
                                              coors, num_points_per_voxel,
                                              coor_to_voxelidx,
                                              voxel_size.tolist(),
                                              coors_range.tolist(), max_points,
                                              max_voxels)
    res["voxel_num"] = voxel_num
    res["voxel_point_mask"] = res["voxel_point_mask"].reshape(
        -1, max_points, 1)
    return res

2.2 Sparse convolutional middle layer 稀疏3D卷积层

2.2.1 原理：

2.2.1.1 问题定义

    **点云数据**与**传统的图像数据**不同，具有较强的**稀疏性**，无法使用标准的卷积神经网络进行特征提取，如图2所示。同理，考虑到2D任务中如果只处理一部分像素，标准卷积的效果也不好，需要使用2D的稀松卷积，因此本小节就从2D稀疏卷积出发，介绍一下稀疏卷积的原理，大家可以自行将其拓展到3D稀疏卷积中去，其实就多了一个深度信息D。

图 2 点云数据（左）和稀疏图像（右）

    由上，本小节考虑一个简单的**2D稀疏卷积**问题来进行讲解。
    **输入数据： ** 定义一个 3 通道的 5 × 5 图像。除了对应位置点 P1 和 P2 之外，所有像素都是(0, 0, 0)。 输入张量的形状按 **[N,C,H,W]** 顺序为 [1,3,5,5]。在稀疏形式下，[ P1,P2 ]数据列表为 [[0.1, 0.1, 0.1], [0.2, 0.2, 0.2]] ，索引列表为 [ [1,2], [2, 3] ]，如图3的左图 
     **卷积核：**定义一个3X3的卷积核，步长 stride 为1，padding 为 0。如图3的右图。

图 3 2D稀疏卷积的输入数据示例

** 输出数据： 稀疏卷积的输出与传统卷积有很大不同。稀疏卷积有两种输出定义。一种是Sparse output definition，就像普通卷积一样，只要核覆盖一个输入点就计算输出点。另一种称为Submanifold output definition**。只有当核中心覆盖输入站点时，才会计算卷积输出。

     5×5 输入图像， 3×3 卷积核，stride=1，padding=0，输出张量的尺寸为 3×3 。**Sparse output definition**结果为图4左侧，例如 (0,0) 位置为A1，表示该位置的结果只与输入图像中的P1有关， (0,1) 位置为A1A2，表示该位置结果与P1、P2都有关。**Submanifold output definition**结果为图4右侧，只有A1和A2有响应。

图 4 2D稀疏卷积的输出数据示例

2.2.1.2 计算实现

2.2.1.2.1 建立哈希表

    **第一步：**根据 **输入张量** 和 **输出张量** 建立 **序号-坐标哈希表**，（以**Sparse output definition**为例）
    首先建立输入哈希表  ![H a s h_{i n}](https://latex.csdn.net/eq?H%20a%20s%20h_%7Bi%20n%7D) ，表中  ![k e y_{i n}](https://latex.csdn.net/eq?k%20e%20y_%7Bi%20n%7D)表示输入像素的坐标， ![v_{i n}](https://latex.csdn.net/eq?v_%7Bi%20n%7D) 表示序号，每一行表示一个**activate input sites**。那么对于P1输入来说，**output sites for key  =0 , value  =(2,1) ** ，输出张量中与P1输入相关的像素点有6个A1位置，将这6个点的位置坐标记作![P_{\text {out }}](https://latex.csdn.net/eq?P_%7B%5Ctext%20%7Bout%20%7D%7D) 。通过  ![P_{\text {out }}](https://latex.csdn.net/eq?P_%7B%5Ctext%20%7Bout%20%7D%7D)  建立哈希表 ![H a s h_{\text {out }}](https://latex.csdn.net/eq?H%20a%20s%20h_%7B%5Ctext%20%7Bout%20%7D%7D)，![k e y_{o u t}](https://latex.csdn.net/eq?k%20e%20y_%7Bo%20u%20t%7D)表示输出张量中的坐标，![v_{\text {out }}](https://latex.csdn.net/eq?v_%7B%5Ctext%20%7Bout%20%7D%7D)同样表示序号。
    同理处理第二个 P2 , **output sites for key  = 1 ,value  =(3,2)  **，输出张量中与  P2  输入相关的像素点也有 6 个 A2 位置。这时构建![P_{\text {out }}](https://latex.csdn.net/eq?P_%7B%5Ctext%20%7Bout%20%7D%7D)，发现有一部分的坐标是重复的，重复的我们不管它，继续写上之前没有的即  **{ 6, (1,2) }, { 7, (2,2) }**  。

图 5 2D稀疏卷积的哈希实现

2.2.1.2.2 建立RuleBook

** RuleBook定义**

    什么是***Rulebook***? 本质上来说就是一个表。**2.2.1.2.1建立输入、输出的哈希表**，分别将输入、输出的张量坐标映射到序号。现在要将输入、输出的哈希表中的序号建立起联系，这样就可以基本实现了稀疏卷积，因此这也是稀疏卷积实现的关键。

图 6 Rulebook解释定义（输入哈希表和输出哈希表的映射关系）

     ***RuleBook*构建方法**

** （1）总体流程**

图 7 Rulebook构建过程

   ** （2）**从![P_{\text {out }}](https://latex.csdn.net/eq?P_%7B%5Ctext%20%7Bout%20%7D%7D) 到 **GetOffset()**
    如下图所示， 5×5 的输入图像经过 3×3 的卷积核输出 3×3 的output。以output中**（0，0）**位置为例，该点的值是**input**左上角的 **3×3 **橙色窗口卷积得到，在这个橙色的窗口中只有**右侧P1位置非零**，**其余位置均为零**。那么这次卷积操作只需要通过这个位置的卷积权重和输入值计算得到。P1位置对应到卷积核中的位置就是**（1，0）**。把这个**（1，0）**放入**GetOffset()**结果中。

图 8 Pout 到 GetOffset()

   **  注意：**上面公式是在stride=1,padding=0的情况下的。 一句话来说，GetOffset()就是用于找出output中某位置需要用卷积核中的那个weight来计算。
    **（3）**从 **GetOffset() **到 ***Rulebook***
    既然要完成卷积，上一步记录了卷积核权重的位置，那么这一步就需要记录对应的输入像素值，然后计算完了放哪里。如图9所示，可以看到Rulebook中红色方框为上一步记录的卷积核权重位置，橙色方框为输入像素值的输入序号，绿色方框为卷积结果对应的输出序号。

图 9 GetOffset() 到
Rulebook

2.2.1.2.3 稀疏卷积的计算实现

稀疏卷积实现时是通过查询Rulebook，因为可以通过GPU并行实现，因此效率比较高。

以Rulebook第一行红色方框为例，首先通过（-1，-1）找到卷积核权重F0；其次，根据输入像素序号，查找输入哈希表找到对应的tensor向量（0.1，0.1，0.1）；

然后，需要注意到的是下图中我们可以看到红色和蓝黑色的两个方框输出结果对应的序号都是5，意味着他们的输出结果在同一位置，是需要累加的。图9中Output Sparse Tensor尺寸为 9×2 是因为9为输出热力图 3×3 ，2表示输出两通道。

最后，完成计算后再根据输出序号，找出行列坐标，放到输出tensor的对应位置。

图 9 稀疏卷积的实现过程

2.2.2 代码：

    这里的**spconv.SparseConv3d**和**spconv.SubMConv3d**函数中就封装了2.2.1中所讲解的计算过程。

class SpMiddleFHD(nn.Module):
    def __init__(self,
                 output_shape,
                 use_norm=True,
                 num_input_features=128,
                 num_filters_down1=[64],
                 num_filters_down2=[64, 64],
                 name='SpMiddleFHD'):
        super(SpMiddleFHD, self).__init__()
        self.name = name
        if use_norm:
            BatchNorm2d = change_default_args(eps=1e-3, momentum=0.01)(nn.BatchNorm2d)
            BatchNorm1d = change_default_args(eps=1e-3, momentum=0.01)(nn.BatchNorm1d)
            Conv2d = change_default_args(bias=False)(nn.Conv2d)
            SpConv3d = change_default_args(bias=False)(spconv.SparseConv3d)
            SubMConv3d = change_default_args(bias=False)(spconv.SubMConv3d)
            ConvTranspose2d = change_default_args(bias=False)(nn.ConvTranspose2d)
        else:
            BatchNorm2d = Empty
            BatchNorm1d = Empty
            Conv2d = change_default_args(bias=True)(nn.Conv2d)
            SpConv3d = change_default_args(bias=True)(spconv.SparseConv3d)
            SubMConv3d = change_default_args(bias=True)(spconv.SubMConv3d)
            ConvTranspose2d = change_default_args(bias=True)(nn.ConvTranspose2d)
        sparse_shape = np.array(output_shape[1:4]) + [1, 0, 0]
        # sparse_shape[0] = 11
        print(sparse_shape)
        self.sparse_shape = sparse_shape
        self.voxel_output_shape = output_shape
        # input: # [1600, 1200, 41]
        self.middle_conv = spconv.SparseSequential(
            SubMConv3d(num_input_features, 16, 3, indice_key="subm0"),
            BatchNorm1d(16),
            nn.ReLU(),
            SubMConv3d(16, 16, 3, indice_key="subm0"),
            BatchNorm1d(16),
            nn.ReLU(),
            SpConv3d(16, 32, 3, 2, padding=1),  # [1600, 1200, 41] -> [800, 600, 21]
            BatchNorm1d(32),
            nn.ReLU(),
            SubMConv3d(32, 32, 3, indice_key="subm1"),
            BatchNorm1d(32),
            nn.ReLU(),
            SubMConv3d(32, 32, 3, indice_key="subm1"),
            BatchNorm1d(32),
            nn.ReLU(),
            SpConv3d(32, 64, 3, 2,padding=1),  # [800, 600, 21] -> [400, 300, 11]
            BatchNorm1d(64),
            nn.ReLU(),
            SubMConv3d(64, 64, 3, indice_key="subm2"),
            BatchNorm1d(64),
            nn.ReLU(),
            SubMConv3d(64, 64, 3, indice_key="subm2"),
            BatchNorm1d(64),
            nn.ReLU(),
            SubMConv3d(64, 64, 3, indice_key="subm2"),
            BatchNorm1d(64),
            nn.ReLU(),
            SpConv3d(64, 64, 3, 2, padding=[0, 1, 1]),  # [400, 300, 11] -> [200, 150, 5]
            BatchNorm1d(64),
            nn.ReLU(),
            SubMConv3d(64, 64, 3, indice_key="subm3"),
            BatchNorm1d(64),
            nn.ReLU(),
            SubMConv3d(64, 64, 3, indice_key="subm3"),
            BatchNorm1d(64),
            nn.ReLU(),
            SubMConv3d(64, 64, 3, indice_key="subm3"),
            BatchNorm1d(64),
            nn.ReLU(),
            SpConv3d(64, 64, (3, 1, 1), (2, 1, 1)),  # [200, 150, 5] -> [200, 150, 2]
            BatchNorm1d(64),
            nn.ReLU(),
        )
        self.max_batch_size = 6
        # self.grid = torch.full([self.max_batch_size, *sparse_shape], -1, dtype=torch.int32).cuda()
    def forward(self, voxel_features, coors, batch_size):
        # coors[:, 1] += 1
        coors = coors.int()     # [123862,4]
        ret = spconv.SparseConvTensor(voxel_features, coors, self.sparse_shape, batch_size)       # [123862,4]
        # t = time.time()
        # torch.cuda.synchronize()
        ret = self.middle_conv(ret)         # [57238,64]
        # torch.cuda.synchronize()
        # print("spconv forward time", time.time() - t)
        ret = ret.dense()                   # [8,64,2,200,176]
        N, C, D, H, W = ret.shape
        ret = ret.view(N, C * D, H, W)      # [8,128,200,176]
        return ret

2.3 Rule Generation Algorithm

2.3.1 原理：

    **Second**中对**2.2的稀疏卷积方法**进行了规则生成改进，改进了**基于CPU**的、使用**哈希表**的规则生成算法，但这种算法速度很慢，并且需要在**CPU和GPU**之间传输数据。
    因此作者设计了一种**基于GPU的规则生成算法**，该算法在GPU上运行得更快。首先，收集输入索引及其相关的空间索引，而不是输出索引（**2.3.2中的第一个循环**）。在这个阶段会获得重复的输出位置。然后，在空间索引数据上执行一个独特的并行算法，以获取输出索引及其相关联的空间索引。根据之前的结果，生成一个与稀疏数据具有相同空间维度的缓冲区，以便在下一步进行表查询（**2.3.2中的第二个循环**）。最后，遍历这些规则，并利用存储的空间索引来为每个输入索引获取输出索引（**2.3.2中的第三个循环**）。

图 10 基于GPU的规则生成算法

2.3.2 伪代码：

2.4 Region Proposal Network区域候选网络

2.4.1 原理：

    RPN在Faster R-CNN中第一次提出来，简单来说就是**SECOND**网络里面的anchor，在每一个**框**或者**体素**上提前生成**一个预选框/锚框/候选区域**，网络的输出视为对候选区域的偏移和缩扩，再将预选框和网络输出的结合，经过阈值的筛选，得到整个网络目标检测的结果。
    RPN网络从结构上来说，异常的简单，也都是使用是2D卷积结构，之后再分层进行了一个concat拼接，最后再接上cls-head和reg-head进行分类和回归输出。
   注意：这里不同于Voxelnet中的RPN中包含2D卷积和3D卷积的做法，Second的PRN结构将2D卷积和3D卷积分开了，只保留了2D卷积结构。

图 11 RPN候选网络结构

2.4.2 代码：

class RPNBase(RPNNoHeadBase):
    def __init__(self,
                 use_norm=True,
                 num_class=2,
                 layer_nums=(3, 5, 5),
                 layer_strides=(2, 2, 2),
                 num_filters=(128, 128, 256),
                 upsample_strides=(1, 2, 4),
                 num_upsample_filters=(256, 256, 256),
                 num_input_features=128,
                 num_anchor_per_loc=2,
                 encode_background_as_zeros=True,
                 use_direction_classifier=True,
                 use_groupnorm=False,
                 num_groups=32,
                 box_code_size=7,
                 num_direction_bins=2,
                 name='rpn'):
        """upsample_strides support float: [0.25, 0.5, 1]
        if upsample_strides < 1, conv2d will be used instead of convtranspose2d.
        """
        super(RPNBase, self).__init__(
            use_norm=use_norm,
            num_class=num_class,
            layer_nums=layer_nums,
            layer_strides=layer_strides,
            num_filters=num_filters,
            upsample_strides=upsample_strides,
            num_upsample_filters=num_upsample_filters,
            num_input_features=num_input_features,
            num_anchor_per_loc=num_anchor_per_loc,
            encode_background_as_zeros=encode_background_as_zeros,
            use_direction_classifier=use_direction_classifier,
            use_groupnorm=use_groupnorm,
            num_groups=num_groups,
            box_code_size=box_code_size,
            num_direction_bins=num_direction_bins,
            name=name)
        self._num_anchor_per_loc = num_anchor_per_loc
        self._num_direction_bins = num_direction_bins
        self._num_class = num_class
        self._use_direction_classifier = use_direction_classifier
        self._box_code_size = box_code_size
        if encode_background_as_zeros:
            num_cls = num_anchor_per_loc * num_class
        else:
            num_cls = num_anchor_per_loc * (num_class + 1)
        if len(num_upsample_filters) == 0:
            final_num_filters = self._num_out_filters
        else:
            final_num_filters = sum(num_upsample_filters)
        self.conv_cls = nn.Conv2d(final_num_filters, num_cls, 1)
        self.conv_box = nn.Conv2d(final_num_filters, num_anchor_per_loc * box_code_size, 1)
        if use_direction_classifier:
            self.conv_dir_cls = nn.Conv2d(final_num_filters, num_anchor_per_loc * num_direction_bins, 1)
    def forward(self, x):
        res = super().forward(x)        # [8,128,200,176]
        x = res["out"]                  # [8,128,200,176]
        box_preds = self.conv_box(x)        # [8,14,200,176]
        cls_preds = self.conv_cls(x)        # [8,2, 200,176]
        # [N, C, y(H), x(W)]
        C, H, W = box_preds.shape[1:]   # 14,200,176
        box_preds = box_preds.view(-1, self._num_anchor_per_loc,self._box_code_size, H, W).permute(0, 1, 3, 4, 2).contiguous()  # [8,2,200,176,7]
        cls_preds = cls_preds.view(-1, self._num_anchor_per_loc,self._num_class, H, W).permute(0, 1, 3, 4, 2).contiguous()      # [8,2,200,176,1]
        # box_preds = box_preds.permute(0, 2, 3, 1).contiguous()
        # cls_preds = cls_preds.permute(0, 2, 3, 1).contiguous()
        ret_dict = {"box_preds": box_preds,"cls_preds": cls_preds,}
        if self._use_direction_classifier:
            dir_cls_preds = self.conv_dir_cls(x)
            dir_cls_preds = dir_cls_preds.view(-1, self._num_anchor_per_loc, self._num_direction_bins, H, W).permute(0, 1, 3, 4, 2).contiguous()
            # dir_cls_preds = dir_cls_preds.permute(0, 2, 3, 1).contiguous()
            ret_dict["dir_cls_preds"] = dir_cls_preds
        return ret_dict

2.5 Data Augmentation数据增强方案

    **Second**的数据增强方案十分有意思，在沿用了**Voxelnet**的点云随机旋转和缩放之后，还采用了工业界常用的贴图方案，这部分虽然没有在源码中找到对应出处，这边就简单讲一下自己的理解吧。
    **Second**的数据增强方案的本质其实就是**3D贴图，**与常见处理2D检测中样本缺失的问题一样，为了减少误报漏报的情况，2D检测中一般会采用**抠图+贴图**的方式。举一个简单的例子来说，我需要训练一个火焰烟雾检测器，但是在正常情况下，不可能会出现火焰的情况，也就是说你的场景视频切片样本中没有火焰的照片，那么在工业界的一个比较常见的解决方案就是从**网上**找一些火焰照片或者使用**GAN等生成器**生成一些火焰样本，再将其使用抠图软件贴到视频场景中去，再使用**标准标注软件**类似 **labelimg** 进行标注，这样就得到了当前场景下的出现火焰的数据，也就能使用2D检测器进行检测了。
    同理，在3D检测场景中，同样会出现**样本缺失**的问题，因此Second同样才采用了这种思路，用论文中的话来说就是从训练数据集中生成一个包含所有地面真实情况标签及其相关点云数据（地面真实情况的3D边界框内的点）的数据库。然后，在训练过程中，从该数据库中随机选择几个地面真实情况，并通过连接将它们引入到当前训练的点云中，达到模拟不同环境中物体的效果。
    注：贴图其实是工业界的一种方法，感觉直接作为公开benchmark上的增强方案有点图巧了，对比实验感觉没有意义，因为结果基本一定会好。

2.6 Loss损失函数

    Second的损失函数其实还是经典的回归损失+分类损失，计算公式如下：

    其中**Lcls**为分类损失，**Lreg−other**为位置和尺寸的回归损失，**Lreg−θ**为新的角度损失，Ldir为方向分类损失。

总结

    **（1）Second **在 **Voxelnet **基础上进行了经典的3个创新：改进卷积结构、改进数据增强、改进损失函数。
    **（2）Second**在3D卷积结构上进行了重大创新，考虑到**点云的稀疏性**，使用了**3D稀疏卷积**代替了**传统3D卷积方案，**极大地提高了卷积的效果。同时使用**基于GPU的规则**改进了原来**CPU-GPU**传输较慢的问题。
    **（3）Second**使用数据增强方案，极大地增强了网络的泛化能力。

参考

环境配置

second.pytorch环境配置及训练运行折腾史_second程序-CSDN博客

spconv1.2.1安装时出现subprocess.CalledProcessError错误的解决方法_raise calledprocesserror(retcode, cmd)-CSDN博客

算法原理

SECOND论文解读与代码解析-CSDN博客

稀疏卷积 Sparse Convolution Net-CSDN博客

3d稀疏卷积——spconv源码剖析（一） - 古月居 (guyuehome.com)

稀疏卷积 Sparse Convolution Net - 知乎 (zhihu.com)

sparse conv稀疏卷积-CSDN博客

标签：深度学习人工智能

本文转载自: https://blog.csdn.net/weixin_50206562/article/details/140275772
版权归原作者 hunter@@ 所有，如有侵权，请联系我们删除。