Keras深度学习框架实战（6）：使用CNN-RNN架构实现视频分类

1、绪论

1.1 CNN-RNN概述

1.1.1 结构组成

CNN-RNN架构结合了卷积神经网络（CNN）和循环神经网络（RNN）两种不同类型的神经网络结构。

卷积神经网络（CNN）：- 主要用于处理具有网格状拓扑结构的数据，如图像和视频。- 结构上包括卷积层、池化层和全连接层。- 卷积层通过滤波器（或称为核函数）在图像上滑动，计算滤波器与覆盖像素之间的点积，以提取图像中的特定模式或特征。- 池化层对特征图进行下采样操作，减少数据的空间维度，有助于降低计算复杂度并防止过拟合。- 全连接层将卷积层和池化层的输出平铺并通过这些层进行最终预测。
循环神经网络（RNN）：- 主要用于处理时间序列、语音和自然语言等序列数据。- 通过递归连接将当前时刻的输出与下一时刻的输入相关联，从而能够捕获序列中的时间依赖性。- 结构上包括输入层、递归层和输出层。

1.1.2 工作原理

在CNN-RNN架构中，CNN首先处理视频的每一帧，提取图像中的空间特征。然后，RNN接收CNN提取的特征作为输入，处理这些特征在时间序列上的变化，从而进行视频分类或其他相关任务。

1.1.3 应用场景

CNN-RNN架构广泛应用于视频处理任务，如视频分类、动作识别、事件检测等。在这些任务中，CNN-RNN能够充分利用视频中的空间和时间信息，提高模型的性能。

1.2 CNN-RNN的特点

CNN-RNN架构结合了卷积神经网络（CNN）和循环神经网络（RNN）的优点，特别适用于处理视频分类等任务。

特征提取能力：
CNN：通过卷积层和池化层的组合，CNN能够有效地提取图像的局部特征，对空间特征具有强大的提取能力。这种能力使得CNN在图像识别、图像分类等任务上取得了出色的性能。
RNN：RNN特别适合于处理序列数据，如文本、语音和时间序列等。它能够通过循环连接将当前时刻的输出与下一时刻的输入相关联，从而捕获序列中的时间依赖性。
处理能力：
CNN-RNN组合：CNN和RNN的结合使得该架构能够同时处理视频的空间特征和时间依赖性。CNN负责从视频帧中提取空间特征，而RNN则负责处理这些特征在时间序列上的变化，从而进行视频分类。
学习能力：
RNN的中间状态可以用来表示标签之间的关系，当与CNN结合时，CNN-RNN结构可以学习到语义标签的依赖性和图片-标签的相互关系。这种能力使得CNN-RNN在处理多标签图像分类等任务时具有优势。
适用性：
CNN-RNN架构适用于处理视频数据，尤其是那些需要同时考虑空间特征和时间依赖性的任务。例如，在动作识别、事件检测等视频中，这种架构能够捕捉到视频帧之间的时间关系和帧内的空间特征。
可扩展性：
CNN-RNN架构可以与其他深度学习技术相结合，如注意力机制、Transformer等，以进一步提高其性能。例如，通过引入注意力机制，CNN-RNN可以更好地关注视频中的关键部分，从而提高分类的准确性。

CNN-RNN架构通过结合CNN和RNN的优点，具有强大的特征提取能力、处理能力、学习能力、适用性和可扩展性，特别适用于处理视频分类等任务。

2、视频分类任务详解

本章节展示了一个视频分类示例，它在推荐、安全等方面有官方的应用。
示例将使用UCF101数据集来构建我们的视频分类器。该数据集包含被分类为不同动作的视频，如板球击球、拳击、骑自行车等。这个数据集通常用于构建动作识别器，这是视频分类的一个应用。

一个视频由一系列有序的帧组成。每一帧包含空间信息，这些帧的序列包含时间信息。为了模拟这两个方面，我们使用一个混合架构，它由用于空间处理的卷积和用于时间处理的循环层组成。

具体来说，示例将使用一个卷积神经网络（CNN）和一个由GRU层组成的循环神经网络（RNN）。
这种混合架构通常被称为CNN-RNN。

2.1 视频分类准备

2.1.1 软件安装

示例需要TensorFlow 2.5或更高版本，以及TensorFlow Docs，可以使用以下命令安装：

!pip install -q git+https://github.com/tensorflow/docs

2.1.2 数据收集

为了使这个例子的运行时间相对较短，我们将使用原始UCF101数据集的子采样版本。程序员可以查安装下面的示例方式下载数据集。

!!wget -q https://github.com/sayakpaul/Action-Recognition-in-TensorFlow/releases/download/v1.0.0/ucf101_top5.tar.gz
!tar xf ucf101_top5.tar.gz

2.1.3 设置

设置需要用到的软件库

import os
import keras
from imutils import paths
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import imageio
import cv2
from IPython.display import Image

2.2 视频分类预处理

2.2.1 定义超参数

在CNN-RNN视频分类任务中，定义合适的超参数对于模型训练和性能至关重要。超参数包括输入图像大小、卷积层参数（如卷积核大小、数量、步长和填充）、池化层参数（如池化窗口大小和步长）、全连接层参数（如神经元数量），以及RNN部分的参数，如序列长度、RNN类型（如LSTM或GRU）、隐藏层单元数量和堆叠层数。此外，训练相关的超参数如学习率、批量大小和迭代次数也需仔细设置。学习率决定了模型参数更新的步长，批量大小影响计算资源消耗和训练稳定性，而迭代次数则控制了整个训练数据集被遍历的次数。这些超参数的选择应根据任务需求、数据集特点和硬件资源进行调整和优化，以确保模型能够达到最佳性能。

2.2.1.1 CNN部分超参数

输入图像大小：- 定义：输入到CNN模型的视频帧的像素尺寸。- 建议：常见的尺寸包括32x32、96x96和224x224。分辨率越高，通常有助于性能提升，但也会增加计算成本。因此，在选择时应考虑硬件资源和任务需求。
卷积层参数：- 卷积核大小（Kernel Size）：定义了卷积操作的窗口大小。 - 常见设置：3x3或5x5的小卷积核。小卷积核有助于增加网络容量和模型复杂度，同时减少参数数量。- 卷积核数量（Number of Filters）：每个卷积层中的卷积核数量。 - 常见设置：使用2的次幂，如64、128等。更多的卷积核可以提取更丰富的特征，但也会增加计算量和内存需求。- 步长（Stride）：卷积核在输入图像上滑动的步长。 - 常见设置：通常为1，这有助于保持空间维度的信息。- 填充（Padding）：在输入图像周围添加的像素值。 - 常见操作：使用“zeros-padding”来保持卷积层的输出与输入具有相同的空间尺寸。
池化层参数：- 池化窗口大小（Pooling Size）：定义了池化操作的窗口大小。 - 常见设置：2x2。- 池化步长（Pooling Stride）：池化窗口在特征图上的滑动步长。 - 常见设置：与池化窗口大小相同，这样输出结果大小仅为输入数据长宽大小的四分之一。

2.2.1.2 RNN部分超参数

序列长度：- 定义：输入到RNN模型的视频帧序列的长度。- 考虑因素：序列长度应根据视频内容和任务需求来确定。较长的序列可以捕获更多的时序信息，但也会增加计算复杂度。
RNN类型：- 定义：选择使用的RNN模型类型，如LSTM或GRU。- 考虑因素：LSTM和GRU在处理长期依赖关系方面有不同的特点。LSTM通常具有更强的能力，但计算成本也更高。GRU则是一种更轻量级的替代品。
隐藏层单元数量：- 定义：RNN隐藏层中的神经元数量。- 考虑因素：较多的隐藏层单元数量可以提高模型的表达能力，但也可能导致过拟合。应根据任务需求和验证集的性能来选择合适的数量。

2.1.1.3 训练相关超参数

学习率：- 定义：模型参数更新时的步长大小。- 考虑因素：学习率过大可能导致模型不稳定，学习率过小则可能导致训练速度过慢。常见的学习率调整策略包括轮数减缓、指数减缓和分数减缓。
批量大小（Batch Size）：- 定义：每次训练时使用的样本数量。- 考虑因素：较大的批量大小可以加速训练，但也会增加内存需求。较小的批量大小可能提高模型的泛化能力，但可能导致训练过程更加不稳定。
迭代次数（Epochs）：- 定义：整个训练数据集被模型处理的次数。- 考虑因素：增加迭代次数可以提高模型性能，但过多的迭代可能导致过拟合。应根据验证集的性能来选择合适的迭代次数。

通过仔细调整这些超参数，可以优化CNN-RNN视频分类模型的性能，并使其适应特定的任务和数据集。

IMG_SIZE =224
BATCH_SIZE =64
EPOCHS =10
MAX_SEQ_LENGTH =20
NUM_FEATURES =2048

2.2.2 数据准备

在CNN-RNN视频分类任务中，视频预处理是一个至关重要的步骤，它直接影响模型训练和推理的效果。预处理通常包括以下几个步骤：首先，将视频文件拆分成连续的图像帧，这可以通过视频处理库如OpenCV实现。接着，为了适配CNN模型的输入要求，需要将图像帧调整为统一的尺寸，并保持图像的宽高比以避免失真。此外，根据任务需求，可以选择将彩色图像转换为灰度图像以简化计算。之后，对图像进行归一化操作，将像素值缩放到[0, 1]或[-1, 1]的范围内，以加速模型训练和提高性能。此外，为了增加数据的多样性和提高模型的泛化能力，可以进行数据增强，如随机裁剪、旋转、翻转和亮度调整等。完成这些处理后，将预处理后的图像帧按照原始顺序构建成特征序列，这些特征序列随后将作为RNN模型的输入。最后，为了提高计算效率，可以将多个特征序列组合成一个批次进行批处理。通过这些预处理步骤，原始视频文件将被转化为适合CNN-RNN模型处理的数据格式，为后续的视频分类任务提供有力支持。

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")print(f"Total videos for training: {len(train_df)}")print(f"Total videos for testing: {len(test_df)}")
train_df.sample(10)

Total videos for training: 594
Total videos for testing: 224

video_nametag492v_TennisSwing_g10_c03.aviTennisSwing536v_TennisSwing_g16_c05.aviTennisSwing413v_ShavingBeard_g16_c05.aviShavingBeard268v_Punch_g12_c04.aviPunch288v_Punch_g15_c03.aviPunch30v_CricketShot_g12_c03.aviCricketShot449v_ShavingBeard_g21_c07.aviShavingBeard524v_TennisSwing_g14_c07.aviTennisSwing145v_PlayingCello_g12_c01.aviPlayingCello566v_TennisSwing_g21_c03.aviTennisSwing
训练视频分类器的众多挑战之一是找出一种将视频输入网络的方法。这篇博客文章讨论了五种这样的方法。由于视频是一系列有序的帧，我们可以提取帧并将它们放入一个三维张量中。但是，不同视频的帧数可能会有所不同，这将阻止我们将它们堆叠成批次（除非我们使用填充）。作为一种替代方案，我们可以以固定间隔保存视频帧，直到达到最大帧数。在这个例子中，我们将执行以下操作：

捕获视频的帧。
提取视频帧，直到达到最大帧数。
如果视频的帧数少于最大帧数，我们将用零填充视频。

请注意，这个工作流程与涉及文本序列的问题相同。UCF101数据集的视频在帧之间的对象和动作上没有极端变化。因此，可能可以只考虑几个帧来进行学习任务。但是，这种方法可能不会很好地泛化到其他视频分类问题。我们将使用OpenCV的

VideoCapture()

方法来读取视频中的帧。

# 下面的两个函数取自连接提供的教程：#https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hubdefcrop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim =min(y, x)
    start_x =(x //2)-(min_dim //2)
    start_y =(y //2)-(min_dim //2)return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]defload_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
    cap = cv2.VideoCapture(path)
    frames =[]try:whileTrue:
            ret, frame = cap.read()ifnot ret:break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:,:,[2,1,0]]
            frames.append(frame)iflen(frames)== max_frames:breakfinally:
        cap.release()return np.array(frames)

2.2.3 视频特征提取

在CNN-RNN视频分类任务中，视频特征提取的CNN部分扮演着至关重要的角色。首先，视频被分割成一系列帧，作为CNN的输入。为了适应模型的输入要求，这些帧通常会被调整到固定的尺寸，如224x224像素，并进行归一化处理，将像素值缩放到[0, 1]的范围内。

接着，这些预处理后的帧被送入卷积神经网络（CNN）模型中进行特征提取。CNN模型通过多个卷积层对帧进行卷积操作，使用不同大小的卷积核来提取帧中的局部特征，这些特征可能包括颜色、纹理、边缘等。在卷积层之后，通常会使用激活函数（如ReLU）来增加模型的非线性，使得模型能够学习更复杂的特征表示。

为了进一步减小特征图的空间尺寸并保留重要特征，CNN模型还包含了池化层。常见的池化操作有最大池化和平均池化，它们有助于减少模型的计算量和参数数量，同时防止过拟合。

经过多个卷积和池化层的处理后，CNN模型会输出一个特征图或特征向量。这个特征向量包含了帧的全局特征信息，这些特征在空间上是不变的，即它们在视频的不同帧中可能是相似的。这些特征向量将作为视频帧的特征表示，用于后续的RNN处理。

在实际应用中，为了提高特征提取的准确性和效率，通常会使用在大型数据集上预训练的CNN模型（如ResNet、VGG等）进行特征提取。这些模型已经学习到了丰富的图像特征表示，可以显著提高视频分类的性能。此外，根据具体任务和数据集的特点，可以对CNN模型的结构和参数进行调整，以优化特征提取的性能。

CNN-RNN视频分类任务中的CNN部分通过一系列卷积、池化和全连接层的处理，从视频帧中提取出丰富的特征信息，为后续的RNN处理提供了有力的支持。这些特征不仅包含了帧中的空间信息，还通过预训练和模型调整等手段得到了优化，为视频分类任务提供了坚实的基础。

本文我们使用预训练的网络从提取的帧中提取有意义的特征。

Keras Applications

模块提供了许多在ImageNet-1k数据集上预训练的最新模型。
我们将为此目的使用InceptionV3模型。

defbuild_feature_extractor():
    feature_extractor = keras.applications.InceptionV3(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE,3),)
    preprocess_input = keras.applications.inception_v3.preprocess_input
    inputs = keras.Input((IMG_SIZE, IMG_SIZE,3))
    preprocessed = preprocess_input(inputs)
    outputs = feature_extractor(preprocessed)return keras.Model(inputs, outputs, name="feature_extractor")
feature_extractor = build_feature_extractor()

视频的标签是字符串。神经网络不理解字符串值，所以在将它们输入模型之前，它们必须被转换为某种数值形式。在这里，我们将使用

StringLookup

层将类标签编码为整数。

label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(train_df["tag"]))print(label_processor.get_vocabulary())

['CricketShot', 'PlayingCello', 'Punch', 'ShavingBeard', 'TennisSwing']

最后，我们可以将所有部分放在一起，创建我们的数据处理工具。

defprepare_all_videos(df, root_dir):
    num_samples =len(df)
    video_paths = df["video_name"].values.tolist()
    labels = df["tag"].values
    labels = keras.ops.convert_to_numpy(label_processor(labels[...,None]))# `frame_masks`和`frame_features`是我们将输入到我们的序列模型中的数据。# `frame_masks`将包含一组布尔值，表示一个时间步是否被填充遮蔽。
    frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype="bool")
    frame_features = np.zeros(
        shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")# 对于每个视频。for idx, path inenumerate(video_paths):# 收集它的所有帧并添加一个批次维度。
        frames = load_video(os.path.join(root_dir, path))
        frames = frames[None,...]# 初始化占位符，以存储当前视频的遮罩和特征。
        temp_frame_mask = np.zeros(
            shape=(1,
                MAX_SEQ_LENGTH,), dtype="bool",)
        temp_frame_features = np.zeros(
            shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")# 从当前视频的帧中提取特征。for i, batch inenumerate(frames):
            video_length = batch.shape[0]
            length =min(MAX_SEQ_LENGTH, video_length)for j inrange(length):
                temp_frame_features[i, j,:]= feature_extractor.predict(
                    batch[None, j,:], verbose=0,)
            temp_frame_mask[i,:length]=1# 1 = 未遮蔽，0 = 遮蔽
        frame_features[idx,]= temp_frame_features.squeeze()
        frame_masks[idx,]= temp_frame_mask.squeeze()return(frame_features, frame_masks), labels
train_data, train_labels = prepare_all_videos(train_df,"train")
test_data, test_labels = prepare_all_videos(test_df,"test")print(f"Frame features in train set:{train_data[0].shape}")print(f"Frame masks in train set: {train_data[1].shape}")

Frame features in train set: (594, 20, 2048)
Frame masks in train set: (594, 20)

上述代码块将根据执行它的机器不同，需要大约20分钟。

2.3 序列建模

在CNN-RNN视频分类任务中，为了捕捉视频帧之间的时序依赖关系，我们引入了序列化模型，特别是循环神经网络（RNN）及其变种，如长短期记忆（LSTM）和门控循环单元（GRU）。这些模型在处理视频帧序列时展现出强大的能力，因为它们能够学习并保留帧与帧之间的时间依赖性，这对于理解视频中的连续动作和事件至关重要。

首先，我们使用卷积神经网络（CNN）对视频帧进行特征提取。CNN通过卷积操作和池化操作，从每一帧中提取出有代表性的特征。这些特征包含了丰富的空间信息，如边缘、形状和纹理等，对于后续的序列化处理至关重要。

接着，我们将CNN提取到的帧特征按照时间顺序排列成一个序列。这个序列包含了视频帧之间的时间关系，是后续RNN处理的基础。

然后，我们将这个序列化的帧特征输入到RNN中进行处理。RNN通过循环连接的方式，使得每个时间步的输出都依赖于之前时间步的信息，从而能够捕捉帧与帧之间的时序依赖。然而，传统的RNN在处理长序列时可能存在梯度消失或梯度爆炸的问题，导致无法有效捕捉长期依赖。为了解决这个问题，我们通常采用LSTM或GRU作为RNN的变种。

LSTM通过引入门控机制（输入门、遗忘门、输出门），能够控制信息的保留和遗忘，解决梯度消失和梯度爆炸的问题。它包含一个细胞状态和一个隐藏状态，通过门控机制来更新这两个状态。细胞状态用于存储长期依赖信息，而隐藏状态则用于输出当前时间步的信息。

GRU作为LSTM的简化版本，只有两个门（更新门和重置门），但在很多任务上性能与LSTM相当。它通过这两个门来控制信息的流动，实现对长期依赖的捕捉。

最后，我们将RNN的输出通过全连接层进行分类，得到视频的分类结果。在训练过程中，我们需要调整一些超参数，如隐藏层大小、学习率和迭代次数等，以优化模型的性能。隐藏层大小决定了模型的容量和复杂度，学习率决定了模型参数更新的步长，而迭代次数则决定了整个训练数据集被模型处理的次数。

CNN-RNN视频分类任务中的序列化模型通过捕捉视频帧之间的时序依赖关系，提高了视频分类的准确性和效率。CNN负责特征提取，RNN及其变种（LSTM、GRU）负责时序建模，两者相结合，共同完成了视频分类的任务。
本示例，我们将这些数据输入到一个由循环层如

GRU

组成的序列模型中进行训练。

# 我们的序列模型的实用工具。
class_vocab = label_processor.get_vocabulary()
frame_features_input = keras.Input((MAX_SEQ_LENGTH, NUM_FEATURES))
mask_input = keras.Input((MAX_SEQ_LENGTH,), dtype="bool")# 参考以下教程，了解使用`mask`的重要性：# https://keras.io/api/layers/recurrent_layers/gru/
x = keras.layers.GRU(16, return_sequences=True)(
    frame_features_input, mask=mask_input
)
x = keras.layers.GRU(8)(x)
x = keras.layers.Dropout(0.4)(x)
x = keras.layers.Dense(8, activation="relu")(x)
output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)
rnn_model = keras.Model([frame_features_input, mask_input], output)
rnn_model.compile(
    loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])return rnn_model

# 运行实验的实用工具。defrun_experiment():
    filepath ="/tmp/video_classifier/ckpt.weights.h5"
    checkpoint = keras.callbacks.ModelCheckpoint(
        filepath, save_weights_only=True, save_best_only=True, verbose=1)
    seq_model = get_sequence_model()
    history = seq_model.fit([train_data[0], train_data[1]],
        train_labels,
        validation_split=0.3,
        epochs=EPOCHS,
        callbacks=[checkpoint],)
    seq_model.load_weights(filepath)
    _, accuracy = seq_model.evaluate([test_data[0], test_data[1]], test_labels)print(f"Test accuracy: {round(accuracy *100,2)}%")return history, seq_model
_, sequence_model = run_experiment()

持续运行的结果如下：

Epoch 1/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.3058 - loss: 1.5597
Epoch 1: val_loss improved from inf to 1.78077, saving model to /tmp/video_classifier/ckpt.weights.h5
 13/13 ━━━━━━━━━━━━━━━━━━━━ 2s 36ms/step - accuracy: 0.3127 - loss: 1.5531 - val_accuracy: 0.1397 - val_loss: 1.7808
Epoch 2/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.5216 - loss: 1.2704
Epoch 2: val_loss improved from 1.78077 to 1.78026, saving model to /tmp/video_classifier/ckpt.weights.h5
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step - accuracy: 0.5226 - loss: 1.2684 - val_accuracy: 0.1788 - val_loss: 1.7803
Epoch 3/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.6189 - loss: 1.1656
Epoch 3: val_loss did not improve from 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.6174 - loss: 1.1651 - val_accuracy: 0.2849 - val_loss: 1.8322
Epoch 4/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.6518 - loss: 1.0645
Epoch 4: val_loss did not improve from 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step - accuracy: 0.6515 - loss: 1.0647 - val_accuracy: 0.2793 - val_loss: 2.0419
Epoch 5/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.6833 - loss: 0.9976
Epoch 5: val_loss did not improve from 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.6843 - loss: 0.9965 - val_accuracy: 0.3073 - val_loss: 1.9077
Epoch 6/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.7229 - loss: 0.9312
Epoch 6: val_loss did not improve from 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.7241 - loss: 0.9305 - val_accuracy: 0.3017 - val_loss: 2.1513
Epoch 7/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.8023 - loss: 0.9132
Epoch 7: val_loss did not improve from 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.8035 - loss: 0.9093 - val_accuracy: 0.3184 - val_loss: 2.1705
Epoch 8/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.8127 - loss: 0.8380
Epoch 8: val_loss did not improve from 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.8128 - loss: 0.8356 - val_accuracy: 0.3296 - val_loss: 2.2043
Epoch 9/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.8494 - loss: 0.7641
Epoch 9: val_loss did not improve from 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.8494 - loss: 0.7622 - val_accuracy: 0.3017 - val_loss: 2.3734
Epoch 10/10
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.8634 - loss: 0.6883
Epoch 10: val_loss did not improve from 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.8649 - loss: 0.6882 - val_accuracy: 0.3240 - val_loss: 2.4410
 7/7 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7816 - loss: 1.0624
Test accuracy: 56.7%

注意：为了使这个例子的运行时间相对较短，我们只使用了少量的训练样本。这些训练样本的数量对于所使用的序列模型来说是低的，该模型有99,909个可训练参数。我们鼓励你使用上面提到的笔记本从UCF101数据集中采样更多的数据，并训练相同的模型。

2.4 推理预测

在CNN-RNN视频分类任务中，分类预测是流程中的核心环节，它基于之前通过卷积神经网络（CNN）和循环神经网络（RNN）对视频帧序列的深度处理和分析，最终输出视频的类别标签。详细来说，这一步骤包括以下几个关键部分：

首先，经过CNN对视频帧的逐帧特征提取，我们得到了一系列包含空间信息的特征向量。这些特征向量随后被序列化，按照视频帧的时间顺序排列，形成一个特征序列。

接着，RNN模型接收这个特征序列作为输入，通过其内部的循环结构，捕捉帧与帧之间的时序依赖关系。在每个时间步，RNN根据当前帧的特征向量和之前的隐藏状态（即之前帧的信息）更新其内部状态，并可能产生一个输出向量。

完成整个序列的处理后，RNN会输出一个或多个向量，这些向量包含了整个视频序列的时空特征信息。这些输出向量可以是RNN最后一个时间步的输出，也可以是所有时间步输出的某种聚合形式，如平均池化或加权和。

然后，整合后的特征向量被送入一个或多个全连接层中，进一步提取和组合分类相关的信息。这些全连接层可以根据需要调整，以优化分类性能。

在全连接层的输出上，应用Softmax函数将特征向量转换为每个类别的概率分布。Softmax函数确保了所有类别的概率之和为1，并允许我们直接选择概率最高的类别作为分类预测的结果。

最后，模型输出预测得到的视频类别标签，完成整个分类任务。同时，使用评估数据集对模型的分类性能进行评估，包括准确率、召回率、F1分数等指标，以量化模型在不同类别上的预测效果。如果性能不理想，可以进一步调整CNN、RNN的模型结构、参数或训练策略，并重新进行训练和分类预测，以优化模型的性能。

整个分类预测步骤充分利用了CNN在图像特征提取方面的优势和RNN在处理序列数据方面的能力，有效地实现了对视频内容的自动分类和识别。

defprepare_single_video(frames):
    frames = frames[None,...]
    frame_mask = np.zeros(
        shape=(1,
            MAX_SEQ_LENGTH,), dtype="bool",)
    frame_features = np.zeros(shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")for i, batch inenumerate(frames):
        video_length = batch.shape[0]
        length =min(MAX_SEQ_LENGTH, video_length)for j inrange(length):
            frame_features[i, j,:]= feature_extractor.predict(batch[None, j,:])
        frame_mask[i,:length]=1# 1 = 未遮蔽，0 = 遮蔽return frame_features, frame_mask
defsequence_prediction(path):
    class_vocab = label_processor.get_vocabulary()
    frames = load_video(os.path.join("test", path))
    frame_features, frame_mask = prepare_single_video(frames)
    probabilities = sequence_model.predict([frame_features, frame_mask])[0]for i in np.argsort(probabilities)[::-1]:print(f"  {class_vocab[i]}: {probabilities[i]*100:5.2f}%")return frames
# 这个实用工具用于可视化。# 参考自：#https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hubdefto_gif(images):
    converted_images = images.astype(np.uint8)
    imageio.mimsave("animation.gif", converted_images, duration=100)return Image("animation.gif")
test_video = np.random.choice(test_df["video_name"].values.tolist())print(f"Test video path: {test_video}")
test_frames = sequence_prediction(test_video)
to_gif(test_frames[:MAX_SEQ_LENGTH])

运行持续的结果如下：

Test video path: v_TennisSwing_g03_c01.avi
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 32ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 166ms/step
 CricketShot: 46.99%
 ShavingBeard: 18.83%
 TennisSwing: 14.65%
 Punch: 12.41%
 PlayingCello: 7.12%
<IPython.core.display.Image object>

3、视频分类实验源代码

"""shell
pip install -q git+https://github.com/tensorflow/docs
""""""
## Data collection
In order to keep the runtime of this example relatively short, we will be using a
subsampled version of the original UCF101 dataset. You can refer to
[this notebook](https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb)
to know how the subsampling was done.
""""""shell
!wget -q https://github.com/sayakpaul/Action-Recognition-in-TensorFlow/releases/download/v1.0.0/ucf101_top5.tar.gz
tar xf ucf101_top5.tar.gz
""""""
## Setup
"""import os
import keras
from imutils import paths
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import imageio
import cv2
from IPython.display import Image
"""
## Define hyperparameters
"""
IMG_SIZE =224
BATCH_SIZE =64
EPOCHS =10
MAX_SEQ_LENGTH =20
NUM_FEATURES =2048"""
## Data preparation
"""
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")print(f"Total videos for training: {len(train_df)}")print(f"Total videos for testing: {len(test_df)}")
train_df.sample(10)"""
One of the many challenges of training video classifiers is figuring out a way to feed
the videos to a network. [This blog post](https://blog.coast.ai/five-video-classification-methods-implemented-in-keras-and-tensorflow-99cad29cc0b5)
discusses five such methods. Since a video is an ordered sequence of frames, we could
just extract the frames and put them in a 3D tensor. But the number of frames may differ
from video to video which would prevent us from stacking them into batches
(unless we use padding). As an alternative, we can **save video frames at a fixed
interval until a maximum frame count is reached**. In this example we will do
the following:
1. Capture the frames of a video.
2. Extract frames from the videos until a maximum frame count is reached.
3. In the case, where a video's frame count is lesser than the maximum frame count we
will pad the video with zeros.
Note that this workflow is identical to [problems involving texts sequences](https://developers.google.com/machine-learning/guides/text-classification/). Videos of the UCF101 dataset is [known](https://www.crcv.ucf.edu/papers/UCF101_CRCV-TR-12-01.pdf)
to not contain extreme variations in objects and actions across frames. Because of this,
it may be okay to only consider a few frames for the learning task. But this approach may
not generalize well to other video classification problems. We will be using
[OpenCV's `VideoCapture()` method](https://docs.opencv.org/master/dd/d43/tutorial_py_video_display.html)
to read frames from videos.
"""# The following two methods are taken from this tutorial:# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hubdefcrop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim =min(y, x)
    start_x =(x //2)-(min_dim //2)
    start_y =(y //2)-(min_dim //2)return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]defload_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
    cap = cv2.VideoCapture(path)
    frames =[]try:whileTrue:
            ret, frame = cap.read()ifnot ret:break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:,:,[2,1,0]]
            frames.append(frame)iflen(frames)== max_frames:breakfinally:
        cap.release()return np.array(frames)"""
We can use a pre-trained network to extract meaningful features from the extracted
frames. The [`Keras Applications`](https://keras.io/api/applications/) module provides
a number of state-of-the-art models pre-trained on the [ImageNet-1k dataset](http://image-net.org/).
We will be using the [InceptionV3 model](https://arxiv.org/abs/1512.00567) for this purpose.
"""defbuild_feature_extractor():
    feature_extractor = keras.applications.InceptionV3(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE,3),)
    preprocess_input = keras.applications.inception_v3.preprocess_input
    inputs = keras.Input((IMG_SIZE, IMG_SIZE,3))
    preprocessed = preprocess_input(inputs)
    outputs = feature_extractor(preprocessed)return keras.Model(inputs, outputs, name="feature_extractor")
feature_extractor = build_feature_extractor()"""
The labels of the videos are strings. Neural networks do not understand string values,
so they must be converted to some numerical form before they are fed to the model. Here
we will use the [`StringLookup`](https://keras.io/api/layers/preprocessing_layers/categorical/string_lookup)
layer encode the class labels as integers.
"""
label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(train_df["tag"]))print(label_processor.get_vocabulary())"""
Finally, we can put all the pieces together to create our data processing utility.
"""defprepare_all_videos(df, root_dir):
    num_samples =len(df)
    video_paths = df["video_name"].values.tolist()
    labels = df["tag"].values
    labels = keras.ops.convert_to_numpy(label_processor(labels[...,None]))# `frame_masks` and `frame_features` are what we will feed to our sequence model.# `frame_masks` will contain a bunch of booleans denoting if a timestep is# masked with padding or not.
    frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype="bool")
    frame_features = np.zeros(
        shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")# For each video.for idx, path inenumerate(video_paths):# Gather all its frames and add a batch dimension.
        frames = load_video(os.path.join(root_dir, path))
        frames = frames[None,...]# Initialize placeholders to store the masks and features of the current video.
        temp_frame_mask = np.zeros(
            shape=(1,
                MAX_SEQ_LENGTH,),
            dtype="bool",)
        temp_frame_features = np.zeros(
            shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")# Extract features from the frames of the current video.for i, batch inenumerate(frames):
            video_length = batch.shape[0]
            length =min(MAX_SEQ_LENGTH, video_length)for j inrange(length):
                temp_frame_features[i, j,:]= feature_extractor.predict(
                    batch[None, j,:],
                    verbose=0,)
            temp_frame_mask[i,:length]=1# 1 = not masked, 0 = masked
        frame_features[idx,]= temp_frame_features.squeeze()
        frame_masks[idx,]= temp_frame_mask.squeeze()return(frame_features, frame_masks), labels
train_data, train_labels = prepare_all_videos(train_df,"train")
test_data, test_labels = prepare_all_videos(test_df,"test")print(f"Frame features in train set: {train_data[0].shape}")print(f"Frame masks in train set: {train_data[1].shape}")"""
The above code block will take ~20 minutes to execute depending on the machine it's being
executed.
""""""
## The sequence model
Now, we can feed this data to a sequence model consisting of recurrent layers like `GRU`.
"""# Utility for our sequence model.defget_sequence_model():
    class_vocab = label_processor.get_vocabulary()
    frame_features_input = keras.Input((MAX_SEQ_LENGTH, NUM_FEATURES))
    mask_input = keras.Input((MAX_SEQ_LENGTH,), dtype="bool")# Refer to the following tutorial to understand the significance of using `mask`:# https://keras.io/api/layers/recurrent_layers/gru/
    x = keras.layers.GRU(16, return_sequences=True)(
        frame_features_input, mask=mask_input
    )
    x = keras.layers.GRU(8)(x)
    x = keras.layers.Dropout(0.4)(x)
    x = keras.layers.Dense(8, activation="relu")(x)
    output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)
    rnn_model = keras.Model([frame_features_input, mask_input], output)
    rnn_model.compile(
        loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])return rnn_model
# Utility for running experiments.defrun_experiment():
    filepath ="/tmp/video_classifier/ckpt.weights.h5"
    checkpoint = keras.callbacks.ModelCheckpoint(
        filepath, save_weights_only=True, save_best_only=True, verbose=1)
    seq_model = get_sequence_model()
    history = seq_model.fit([train_data[0], train_data[1]],
        train_labels,
        validation_split=0.3,
        epochs=EPOCHS,
        callbacks=[checkpoint],)
    seq_model.load_weights(filepath)
    _, accuracy = seq_model.evaluate([test_data[0], test_data[1]], test_labels)print(f"Test accuracy: {round(accuracy *100,2)}%")return history, seq_model
_, sequence_model = run_experiment()"""
**Note**: To keep the runtime of this example relatively short, we just used a few
training examples. This number of training examples is low with respect to the sequence
model being used that has 99,909 trainable parameters. You are encouraged to sample more
data from the UCF101 dataset using [the notebook](https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb) mentioned above and train the same model.
""""""
## Inference
"""defprepare_single_video(frames):
    frames = frames[None,...]
    frame_mask = np.zeros(
        shape=(1,
            MAX_SEQ_LENGTH,),
        dtype="bool",)
    frame_features = np.zeros(shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")for i, batch inenumerate(frames):
        video_length = batch.shape[0]
        length =min(MAX_SEQ_LENGTH, video_length)for j inrange(length):
            frame_features[i, j,:]= feature_extractor.predict(batch[None, j,:])
        frame_mask[i,:length]=1# 1 = not masked, 0 = maskedreturn frame_features, frame_mask
defsequence_prediction(path):
    class_vocab = label_processor.get_vocabulary()
    frames = load_video(os.path.join("test", path))
    frame_features, frame_mask = prepare_single_video(frames)
    probabilities = sequence_model.predict([frame_features, frame_mask])[0]for i in np.argsort(probabilities)[::-1]:print(f"  {class_vocab[i]}: {probabilities[i]*100:5.2f}%")return frames
# This utility is for visualization.# Referenced from:# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hubdefto_gif(images):
    converted_images = images.astype(np.uint8)
    imageio.mimsave("animation.gif", converted_images, duration=100)return Image("animation.gif")
test_video = np.random.choice(test_df["video_name"].values.tolist())print(f"Test video path: {test_video}")
test_frames = sequence_prediction(test_video)
to_gif(test_frames[:MAX_SEQ_LENGTH])

4、总结

视频分类任务是一个重要的计算机视觉任务，旨在将视频分配到预定义的类别中。这一任务在多个领域具有广泛的应用，如推荐系统、安全监控和动作识别等。视频数据包含丰富的空间和时间信息，因此，处理视频数据需要同时考虑这两方面的特征。以下是对文章讨论的内容进行了总结：

4.1 数据集与预处理

在视频分类任务中，通常使用大型、多样化的数据集来训练模型。以UCF101数据集为例，它包含了多个动作类别的视频，为模型提供了丰富的训练样本。数据预处理是视频分类任务中至关重要的一步，包括视频帧的提取、帧大小的调整、归一化以及可能的数据增强等步骤。这些预处理步骤有助于确保模型能够处理不同来源和质量的视频数据。

4.2 模型构建与训练

针对视频分类任务，一种常见的模型架构是CNN-RNN混合模型。这种模型结合了CNN在空间特征提取方面的优势和RNN在时间特征处理方面的能力。在模型构建过程中，首先使用预训练的CNN模型（如InceptionV3）来提取视频帧的特征。然后，将这些特征按照时间顺序排列成序列，并输入到RNN模型（如GRU）中进行处理。最后，通过全连接层和Softmax函数进行分类预测。

在模型训练过程中，需要定义合适的超参数，如学习率、批量大小、训练周期数等。同时，还需要将数据集划分为训练集、验证集和测试集，以便在训练过程中评估模型的性能并进行调整。通过多次迭代训练，模型逐渐学习到如何从视频数据中提取关键特征并进行分类。

4.3实验结果与推理

在训练完成后，模型在测试集上的性能通常通过准确率等指标进行评估。通过对比不同模型的实验结果，可以选择性能最佳的模型进行后续的应用。对于单个视频的推理，可以将视频帧输入到训练好的模型中，得到分类预测结果和相应的概率分布。这些结果可以用于视频推荐、异常检测等实际应用场景。

4.4 可视化与改进方向

为了更好地理解模型的行为和性能，可以对预测结果进行可视化。例如，将预测的视频帧转换为GIF动画，以便直观地展示模型对不同动作类别的识别能力。此外，还可以尝试不同的改进方向来进一步提高模型的性能。例如，使用迁移学习来微调预训练模型，尝试不同的CNN和RNN模型架构，调整序列长度等参数，以及加入自注意力机制等高级特性。

总之，视频分类任务是一个具有挑战性和应用价值的计算机视觉任务。通过合理的模型架构和训练策略，可以有效地处理视频数据并提取关键特征进行分类预测。随着技术的不断发展，未来视频分类任务将有望取得更好的性能和更广泛的应用。

标签：深度学习 keras cnn

本文转载自: https://blog.csdn.net/MUKAMO/article/details/139428919
版权归原作者 MUKAMO 所有，如有侵权，请联系我们删除。