0


一般神经网络(DNN)反向传播过程

DNN反向传播过程

多元函数微分

损失函数都是标量函数,它使用范数损失将向量转换为标量。计算损失函数在第L层输入的导数是一种标量对向量的求导。实际上不论是几维向量,都可以视为一列多元函数的自变量数组。
例如,

    m
   
   
    ×
   
   
    n
   
  
  
   m\times n
  
 
m×n维度的矩阵

 
  
   
    {
   
   
    
     W
    
    
     
      i
     
     
      j
     
    
   
   
    }
   
  
  
   \{W_{ij}\}
  
 
{Wij​}可以转化为一列多元函数的自变量数组:

 
  
   
    
     {
    
    
     
      W
     
     
      
       i
      
      
       j
      
     
    
    
     }
    
    
     →
    
    
     (
    
    
     
      W
     
     
      11
     
    
    
     ,
    
    
     
      W
     
     
      12
     
    
    
     .
    
    
     .
    
    
     .
    
    
     
      W
     
     
      
       n
      
      
       m
      
     
    
    
     )
    
   
   
     \{W_{ij}\}\rightarrow(W_{11},W_{12}...W_{nm}) 
   
  
 {Wij​}→(W11​,W12​...Wnm​)

那么关于

    {
   
   
    
     W
    
    
     
      i
     
     
      j
     
    
   
   
    }
   
  
  
   \{W_{ij}\}
  
 
{Wij​}的标量函数可以视作关于

 
  
   
    (
   
   
    
     W
    
    
     11
    
   
   
    ,
   
   
    
     W
    
    
     12
    
   
   
    .
   
   
    .
   
   
    .
   
   
    
     W
    
    
     
      n
     
     
      m
     
    
   
   
    )
   
  
  
   (W_{11},W_{12}...W_{nm})
  
 
(W11​,W12​...Wnm​)的多元函数。多元函数的梯度就是标量函数对矩阵求导的结果。还记得多元函数的梯度是这样省的:

 
  
   
    
     
      
       ∂
      
      
       f
      
     
     
      
       ∂
      
      
       
        x
       
       
        →
       
      
     
    
    
     =
    
    
     (
    
    
     
      
       ∂
      
      
       f
      
     
     
      
       ∂
      
      
       
        x
       
       
        1
       
      
     
    
    
     ,
    
    
     
      
       ∂
      
      
       f
      
     
     
      
       ∂
      
      
       
        x
       
       
        2
       
      
     
    
    
     .
    
    
     .
    
    
     .
    
    
     
      
       ∂
      
      
       f
      
     
     
      
       ∂
      
      
       
        x
       
       
        n
       
      
     
    
    
     )
    
   
   
     \frac{\partial f}{\partial \overrightarrow{x}}=(\frac{\partial f}{\partial x_{1}}, \frac{\partial f}{\partial x_{2}}...\frac{\partial f}{\partial x_{n}}) 
   
  
 ∂x∂f​=(∂x1​∂f​,∂x2​∂f​...∂xn​∂f​)

向量对向量求导

向量函数可以视作多个标量多元函数组成的向量,例如有将向量B映射为A的向量函数。

     A
    
    
     =
    
    
     G
    
    
     (
    
    
     B
    
    
     )
    
    
    
     w
    
    
     h
    
    
     e
    
    
     r
    
    
     e
    
    
      
    
    
     A
    
    
     ∈
    
    
     
      R
     
     
      
       N
      
      
       ×
      
      
       1
      
     
    
    
     ,
    
    
     B
    
    
     ∈
    
    
     
      R
     
     
      
       M
      
      
       ×
      
      
       1
      
     
    
   
   
     A=G(B)\\ where\ A\in R^{N\times1},B\in R^{M\times1} 
   
  
 A=G(B)where A∈RN×1,B∈RM×1

如果我们将向量A视作多个标量多元函数组成的向量,那么求导就方便多了。

        A
       
      
     
     
      
       
        
        
         =
        
        
         (
        
        
         
          a
         
         
          1
         
        
        
         (
        
        
         
          b
         
         
          1
         
        
        
         ,
        
        
         
          b
         
         
          2
         
        
        
         ,
        
        
         .
        
        
         .
        
        
         .
        
        
         
          b
         
         
          m
         
        
        
         )
        
        
         ,
        
        
         
          a
         
         
          2
         
        
        
         (
        
        
         
          b
         
         
          1
         
        
        
         ,
        
        
         
          b
         
         
          2
         
        
        
         ,
        
        
         .
        
        
         .
        
        
         .
        
        
         
          b
         
         
          m
         
        
        
         )
        
        
         ,
        
        
         .
        
        
         .
        
        
         .
        
        
         )
        
       
      
     
    
    
     
      
       
        
         
          ∂
         
         
          A
         
        
        
         
          ∂
         
         
          B
         
        
       
      
     
     
      
       
        
        
         =
        
        
         (
        
        
         
          
           ∂
          
          
           
            a
           
           
            1
           
          
         
         
          
           ∂
          
          
           B
          
         
        
        
         ,
        
        
         
          
           ∂
          
          
           
            a
           
           
            2
           
          
         
         
          
           ∂
          
          
           B
          
         
        
        
         ,
        
        
         .
        
        
         .
        
        
         .
        
        
         )
        
       
      
     
    
    
     
      
       
      
     
     
      
       
        
        
         =
        
        
         
          (
         
         
          
           
            
             
              
               
                ∂
               
               
                
                 a
                
                
                 1
                
               
              
              
               
                ∂
               
               
                
                 b
                
                
                 1
                
               
              
             
            
           
           
            
             
              
               .
              
              
               .
              
              
               .
              
             
            
           
           
            
             
              
               
                ∂
               
               
                
                 a
                
                
                 1
                
               
              
              
               
                ∂
               
               
                
                 b
                
                
                 m
                
               
              
             
            
           
          
          
           
            
             
              
               
                ∂
               
               
                
                 a
                
                
                 2
                
               
              
              
               
                ∂
               
               
                
                 b
                
                
                 1
                
               
              
             
            
           
           
            
             
              
               .
              
              
               .
              
              
               .
              
             
            
           
           
            
             
              
               
                ∂
               
               
                
                 a
                
                
                 2
                
               
              
              
               
                ∂
               
               
                
                 b
                
                
                 m
                
               
              
             
            
           
          
          
           
            
             
              
               .
              
              
               .
              
              
               .
              
             
            
           
           
            
             
              
               .
              
              
               .
              
              
               .
              
             
            
           
           
            
             
              
               .
              
              
               .
              
              
               .
              
             
            
           
          
          
           
            
             
              
               
                ∂
               
               
                
                 a
                
                
                 n
                
               
              
              
               
                ∂
               
               
                
                 b
                
                
                 1
                
               
              
             
            
           
           
            
             
              
               .
              
              
               .
              
              
               .
              
             
            
           
           
            
             
              
               
                ∂
               
               
                
                 a
                
                
                 n
                
               
              
              
               
                ∂
               
               
                
                 b
                
                
                 m
                
               
              
             
            
           
          
         
         
          )
         
        
       
      
     
    
   
   
     \begin{aligned} A&=(a_{1}(b_{1},b_{2},...b_{m}),a_{2}(b_{1},b_{2},...b_{m}),...)\\ \frac{\partial A}{\partial B}&=(\frac{\partial a_{1}}{\partial B},\frac{\partial a_{2}}{\partial B},...)\\ &=\left( \begin{array}{ccc} \frac{\partial a_{1}}{\partial b_{1}} & ... & \frac{\partial a_{1}}{\partial b_{m}}\\ \frac{\partial a_{2}}{\partial b_{1}} & ... & \frac{\partial a_{2}}{\partial b_{m}}\\ ... & ... & ...\\ \frac{\partial a_{n}}{\partial b_{1}} & ... & \frac{\partial a_{n}}{\partial b_{m}}\\ \end{array} \right) \end{aligned} 
   
  
 A∂B∂A​​=(a1​(b1​,b2​,...bm​),a2​(b1​,b2​,...bm​),...)=(∂B∂a1​​,∂B∂a2​​,...)=⎝⎜⎜⎛​∂b1​∂a1​​∂b1​∂a2​​...∂b1​∂an​​​............​∂bm​∂a1​​∂bm​∂a2​​...∂bm​∂an​​​⎠⎟⎟⎞​​

Wow, see, 现在向量求导清晰多了。当然,不管你将求导展开成

    n
   
   
    ×
   
   
    m
   
  
  
   n\times m
  
 
n×m形式的矩阵还是

 
  
   
    m
   
   
    ×
   
   
    n
   
  
  
   m\times n
  
 
m×n的矩阵,只要在求导时统一,都没有关系。

DNN损失函数求导

神经网络的损失函数都是标量函数。常见的损失有L1、L2范数损失、啦啦啦的。以L2范数损失为例,一般的全连接神经网络损失函数:

         ϵ
        
        
         =
        
        
         
          1
         
         
          2
         
        
        
         ∣
        
        
         ∣
        
        
         σ
        
        
         (
        
        
         
          
           a
          
          
           L
          
         
         
          )
         
         
          −
         
         
          
           y
          
          
           ∣
          
          
           
            ∣
           
           
            2
           
          
         
        
       
      
     
     
      
       
        
         @
        
        
         E
        
        
         q
        
        
         .
        
        
         1
        
       
      
     
    
   
   
     \begin{array}{ccc} \epsilon = \frac{1}{2} ||\sigma (\bf{a^{L}})-\bf{y}||^{2} & @Eq.1 \end{array} 
   
  
 ϵ=21​∣∣σ(aL)−y∣∣2​@Eq.1​

其中

     a
    
    
     L
    
   
   
    =
   
   
    
     
      W
     
     
      L
     
    
    
     ⋅
    
    
     
      
       a
      
      
       
        L
       
       
        −
       
       
        1
       
      
     
     
      +
     
     
      
       
        b
       
       
        L
       
      
      
       ,
      
      
       
        
         a
        
        
         L
        
       
       
        ,
       
       
        
         
          b
         
         
          L
         
        
        
         ∈
        
        
         
          R
         
         
          
           N
          
          
           L
          
         
        
        
         ,
        
        
         
          
           W
          
          
           L
          
         
         
          ∈
         
         
          
           R
          
          
           
            N
           
           
            L
           
          
         
         
          ×
         
         
          
           R
          
          
           
            N
           
           
            
             L
            
            
             −
            
            
             1
            
           
          
         
        
       
      
     
    
   
  
  
   \bf{a^{L}}=\bf{W^{L}}\cdot\bf{a^{L-1}}+\bf{b^{L}}, \bf{a^{L}},\bf{b^{L}}\in R^{N_{L}},\bf{W^{L}}\in R^{N_{L}}\times R^{N_{L-1}}
  
 
aL=WL⋅aL−1+bL,aL,bL∈RNL​,WL∈RNL​×RNL−1​表示第L层激活函数的结果,

 
  
   
    y
   
  
  
   \bf{y}
  
 
y表示Ground truth。Now,如何求解损失函数对

 
  
   
    
     W
    
    
     L
    
   
   
    ,
   
   
    
     b
    
    
     L
    
   
  
  
   \bf{W^{L}}, \bf{b^{L}}
  
 
WL,bL的梯度呢?We only have to expand Eq.1 to the following expression 啦啦啦:

 
  
   
    
     
      
       
        ϵ
       
      
     
     
      
       
        
        
         =
        
        
         
          1
         
         
          2
         
        
        
         
          Σ
         
         
          i
         
         
          N
         
        
        
         [
        
        
         σ
        
        
         (
        
        
         
          Σ
         
         
          j
         
         
          M
         
        
        
         
          W
         
         
          
           i
          
          
           j
          
         
         
          L
         
        
        
         ⋅
        
        
         
          a
         
         
          j
         
         
          
           L
          
          
           −
          
          
           1
          
         
        
        
         +
        
        
         
          b
         
         
          i
         
         
          L
         
        
        
         )
        
        
         −
        
        
         
          y
         
         
          i
         
        
        
         
          ]
         
         
          2
         
        
       
      
     
    
    
     
      
       
        
         
          ∂
         
         
          ϵ
         
        
        
         
          ∂
         
         
          
           W
          
          
           
            x
           
           
            y
           
          
         
        
       
      
     
     
      
       
        
        
         =
        
        
         [
        
        
         σ
        
        
         (
        
        
         
          Σ
         
         
          j
         
         
          M
         
        
        
         
          W
         
         
          
           x
          
          
           j
          
         
         
          L
         
        
        
         ⋅
        
        
         
          a
         
         
          j
         
         
          
           L
          
          
           −
          
          
           1
          
         
        
        
         +
        
        
         
          b
         
         
          x
         
         
          L
         
        
        
         )
        
        
         −
        
        
         
          y
         
         
          x
         
        
        
         ]
        
        
         ×
        
        
         
          σ
         
         
          ′
         
        
        
         (
        
        
         
          Σ
         
         
          j
         
         
          M
         
        
        
         
          W
         
         
          
           x
          
          
           j
          
         
         
          L
         
        
        
         ⋅
        
        
         
          a
         
         
          j
         
         
          
           L
          
          
           −
          
          
           1
          
         
        
        
         +
        
        
         
          b
         
         
          x
         
         
          L
         
        
        
         )
        
        
         ×
        
        
         
          a
         
         
          y
         
         
          
           L
          
          
           −
          
          
           1
          
         
        
       
      
     
    
    
     
      
       
        
         s
        
        
         o
        
        
         ,
        
        
         
          
           ∂
          
          
           ϵ
          
         
         
          
           ∂
          
          
           
            W
           
           
            L
           
          
         
        
       
      
     
     
      
       
        
        
         =
        
        
         {
        
        
         
          
           ∂
          
          
           ϵ
          
         
         
          
           ∂
          
          
           
            W
           
           
            
             x
            
            
             y
            
           
           
            L
           
          
         
        
        
         
          }
         
         
          
           x
          
          
           :
          
          
           1
          
          
           →
          
          
           N
          
          
           ,
          
          
           y
          
          
           :
          
          
           1
          
          
           →
          
          
           M
          
         
        
       
      
     
    
    
     
      
       
      
     
     
      
       
        
        
         T
        
        
         h
        
        
         e
        
        
         n
        
        
          
        
        
         s
        
        
         u
        
        
         r
        
        
         p
        
        
         r
        
        
         i
        
        
         s
        
        
         i
        
        
         n
        
        
         g
        
        
         l
        
        
         y
        
       
      
     
    
    
     
      
       
      
     
     
      
       
        
        
         =
        
        
         [
        
        
         σ
        
        
         (
        
        
         
          
           W
          
          
           L
          
         
         
          ⋅
         
         
          
           a
          
          
           
            L
           
           
            −
           
           
            1
           
          
         
         
          +
         
         
          
           
            b
           
           
            L
           
          
          
           )
          
          
           ⊙
          
          
           
            σ
           
           
            ′
           
          
          
           (
          
          
           
            
             W
            
            
             L
            
           
           
            ⋅
           
           
            
             a
            
            
             
              L
             
             
              −
             
             
              1
             
            
           
           
            +
           
           
            
             
              b
             
             
              L
             
            
            
             )
            
            
             ]
            
            
             ⋅
            
            
             (
            
            
             
              a
             
             
              
               L
              
              
               −
              
              
               1
              
             
            
            
             
              )
             
             
              T
             
            
           
          
         
        
       
      
     
    
   
   
     \begin{aligned} \epsilon &= \frac{1}{2}\Sigma_{i}^{N} [\sigma(\Sigma_{j}^{M}W_{ij}^{L}\cdot a^{L-1}_{j}+b_{i}^{L})-y_{i}]^{2}\\ \frac{\partial\epsilon}{\partial W_{xy}} &= [\sigma(\Sigma_{j}^{M}W_{xj}^{L}\cdot a^{L-1}_{j}+b_{x}^{L})-y_{x}]\times\sigma'(\Sigma_{j}^{M}W_{xj}^{L}\cdot a^{L-1}_{j}+b_{x}^{L})\times a_{y}^{L-1}\\ so, \frac{\partial\epsilon}{\partial \bf{W^{L}}}&=\{\frac{\partial\epsilon}{\partial W_{xy}^{L}}\}_{x:1\rightarrow N,y:1\rightarrow M}\\ &Then\ surprisingly\\ &=[\sigma(\bf{W^{L}}\cdot a^{L-1}+\bf{b^{L}})\odot\sigma'(\bf{W^{L}}\cdot a^{L-1}+\bf{b^{L}})]\cdot (a^{L-1})^{T} \end{aligned} 
   
  
 ϵ∂Wxy​∂ϵ​so,∂WL∂ϵ​​=21​ΣiN​[σ(ΣjM​WijL​⋅ajL−1​+biL​)−yi​]2=[σ(ΣjM​WxjL​⋅ajL−1​+bxL​)−yx​]×σ′(ΣjM​WxjL​⋅ajL−1​+bxL​)×ayL−1​={∂WxyL​∂ϵ​}x:1→N,y:1→M​Then surprisingly=[σ(WL⋅aL−1+bL)⊙σ′(WL⋅aL−1+bL)]⋅(aL−1)T​

同样的,损失函数对偏置求导得到:

       ∂
      
      
       ϵ
      
     
     
      
       ∂
      
      
       
        b
       
       
        L
       
      
     
    
    
     =
    
    
     [
    
    
     σ
    
    
     (
    
    
     
      
       W
      
      
       L
      
     
     
      ⋅
     
     
      
       a
      
      
       
        L
       
       
        −
       
       
        1
       
      
     
     
      +
     
     
      
       
        b
       
       
        L
       
      
      
       )
      
      
       ⊙
      
      
       
        σ
       
       
        ′
       
      
      
       (
      
      
       
        
         W
        
        
         L
        
       
       
        ⋅
       
       
        
         a
        
        
         
          L
         
         
          −
         
         
          1
         
        
       
       
        +
       
       
        
         
          b
         
         
          L
         
        
        
         )
        
        
         ]
        
       
      
     
    
   
   
     \frac{\partial\epsilon}{\partial \bf{b^{L}}}=[\sigma(\bf{W^{L}}\cdot a^{L-1}+\bf{b^{L}})\odot\sigma'(\bf{W^{L}}\cdot a^{L-1}+\bf{b^{L}})] 
   
  
 ∂bL∂ϵ​=[σ(WL⋅aL−1+bL)⊙σ′(WL⋅aL−1+bL)]

通常我们用

     z
    
    
     L
    
   
   
    =
   
   
    
     
      W
     
     
      L
     
    
    
     ⋅
    
    
     
      a
     
     
      
       L
      
      
       −
      
      
       1
      
     
    
    
     +
    
    
     
      b
     
     
      L
     
    
   
  
  
   \bf{z^{L}}=\bf{W^{L}}\cdot a^{L-1}+\bf{b^{L}}
  
 
zL=WL⋅aL−1+bL表示未激活输出,

 
  
   
    
     δ
    
    
     L
    
   
   
    =
   
   
    σ
   
   
    (
   
   
    
     
      z
     
     
      L
     
    
    
     )
    
    
     ⊙
    
    
     
      σ
     
     
      ′
     
    
    
     (
    
    
     
      
       z
      
      
       L
      
     
     
      )
     
    
   
  
  
   \bf{\delta^{L}}=\sigma(\bf{z^{L}})\odot\sigma'(\bf{z^{L}})
  
 
δL=σ(zL)⊙σ′(zL)表示Hadamard乘积结果。那么损失函数对最后一层神经网络参数的梯度就是:

 
  
   
    
     
      
       
        
         
          ∂
         
         
          ϵ
         
        
        
         
          ∂
         
         
          
           W
          
          
           L
          
         
        
       
      
     
     
      
       
        
        
         =
        
        
         
          
           δ
          
          
           L
          
         
         
          ⋅
         
         
          (
         
         
          
           
            a
           
           
            
             L
            
            
             −
            
            
             1
            
           
          
          
           
            )
           
           
            T
           
          
         
        
       
      
     
    
    
     
      
       
        
         
          ∂
         
         
          ϵ
         
        
        
         
          ∂
         
         
          
           b
          
          
           L
          
         
        
       
      
     
     
      
       
        
        
         =
        
        
         
          δ
         
         
          L
         
        
       
      
     
    
   
   
     \begin{aligned} \frac{\partial\epsilon}{\partial \bf{W^{L}}}&=\bf{\delta^{L}}\cdot (\bf{a^{L-1}})^{T}\\ \frac{\partial\epsilon}{\partial \bf{b^{L}}}&=\bf{\delta^{L}} \end{aligned} 
   
  
 ∂WL∂ϵ​∂bL∂ϵ​​=δL⋅(aL−1)T=δL​

桥豆麻嘚,好像推出来了什么不得了的东西。如果是对第

    h
   
  
  
   h
  
 
h层的参数求导,那么有:

 
  
   
    
     
      
       
        
         
          ∂
         
         
          ϵ
         
        
        
         
          ∂
         
         
          
           W
          
          
           H
          
         
        
       
      
     
     
      
       
        
        
         =
        
        
         
          
           δ
          
          
           H
          
         
         
          ⋅
         
         
          (
         
         
          
           
            a
           
           
            
             H
            
            
             −
            
            
             1
            
           
          
          
           
            )
           
           
            T
           
          
          
                
          
          
           @
          
          
           E
          
          
           q
          
          
           .
          
          
           2
          
         
        
       
      
     
    
    
     
      
       
        
         
          ∂
         
         
          ϵ
         
        
        
         
          ∂
         
         
          
           b
          
          
           H
          
         
        
       
      
     
     
      
       
        
        
         =
        
        
         
          
           δ
          
          
           H
          
         
         
                                
         
         
          @
         
         
          E
         
         
          q
         
         
          .
         
         
          3
         
        
       
      
     
    
    
     
      
       
      
     
    
    
     
      
       
        
         w
        
        
         h
        
        
         e
        
        
         r
        
        
         e
        
        
          
        
        
         
          δ
         
         
          H
         
        
       
      
     
     
      
       
        
        
         =
        
        
         
          
           ∂
          
          
           ϵ
          
         
         
          
           ∂
          
          
           
            Z
           
           
            L
           
          
         
        
        
         ⋅
        
        
         
          
           ∂
          
          
           
            Z
           
           
            L
           
          
         
         
          
           ∂
          
          
           
            Z
           
           
            
             L
            
            
             −
            
            
             1
            
           
          
         
        
        
         .
        
        
         .
        
        
         .
        
        
         
          
           ∂
          
          
           
            Z
           
           
            
             H
            
            
             +
            
            
             1
            
           
          
         
         
          
           ∂
          
          
           
            Z
           
           
            H
           
          
         
        
       
      
     
    
   
   
     \begin{aligned} \frac{\partial\epsilon}{\partial \bf{W^{H}}}&=\bf{\delta^{H}}\cdot (\bf{a^{H-1}})^{T}\ \ \ \ \ @Eq.2\\ \frac{\partial\epsilon}{\partial \bf{b^{H}}}&=\bf{\delta^{H}}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ @Eq.3\\\\ where\ \bf{\delta^{H}}&=\frac{\partial\epsilon}{\partial \bf{Z^{L}}}\cdot\frac{\partial\bf{Z^{L}}}{\partial \bf{Z^{L-1}}}...\frac{\partial\bf{Z^{H+1}}}{\partial \bf{Z^{H}}} \end{aligned} 
   
  
 ∂WH∂ϵ​∂bH∂ϵ​where δH​=δH⋅(aH−1)T     @Eq.2=δH                      @Eq.3=∂ZL∂ϵ​⋅∂ZL−1∂ZL​...∂ZH∂ZH+1​​

clearly,求导的关键在于求解后一层非激活输出对前一层非激活输出的导数,即:

          ∂
         
         
          
           Z
          
          
           L
          
         
        
        
         
          ∂
         
         
          
           Z
          
          
           
            L
           
           
            −
           
           
            1
           
          
         
        
       
      
     
     
      
       
        
        
         =
        
        
         {
        
        
         
          
           ∂
          
          
           
            Z
           
           
            i
           
           
            L
           
          
         
         
          
           ∂
          
          
           
            Z
           
           
            j
           
           
            
             L
            
            
             −
            
            
             1
            
           
          
         
        
        
         }
        
       
      
     
    
    
     
      
       
        
         
          ∂
         
         
          
           Z
          
          
           i
          
          
           L
          
         
        
        
         
          ∂
         
         
          
           Z
          
          
           j
          
          
           
            L
           
           
            −
           
           
            1
           
          
         
        
       
      
     
     
      
       
        
        
         =
        
        
         
          W
         
         
          
           i
          
          
           j
          
         
         
          L
         
        
        
         ⋅
        
        
         
          a
         
         
          j
         
         
          L
         
        
       
      
     
    
    
     
      
       
        
         w
        
        
         h
        
        
         i
        
        
         c
        
        
         h
        
        
         i
        
        
         n
        
        
         d
        
        
         i
        
        
         c
        
        
         a
        
        
         t
        
        
         e
        
        
         s
        
        
          
        
        
         
          
           ∂
          
          
           
            Z
           
           
            L
           
          
         
         
          
           ∂
          
          
           
            Z
           
           
            
             L
            
            
             −
            
            
             1
            
           
          
         
        
       
      
     
     
      
       
        
        
         =
        
        
         
          
           W
          
          
           L
          
         
         
          ⋅
         
         
          d
         
         
          i
         
         
          a
         
         
          g
         
         
          (
         
         
          
           
            a
           
           
            
             L
            
            
             −
            
            
             1
            
           
          
          
           )
          
         
        
       
      
     
    
    
     
      
       
        
         w
        
        
         h
        
        
         e
        
        
         r
        
        
         e
        
        
          
        
        
         d
        
        
         i
        
        
         a
        
        
         g
        
        
         (
        
        
         
          
           a
          
          
           
            L
           
           
            −
           
           
            1
           
          
         
         
          )
         
        
       
      
     
     
      
       
        
        
         =
        
        
         
          (
         
         
          
           
            
             
              
               a
              
              
               1
              
              
               
                L
               
               
                −
               
               
                1
               
              
             
            
           
           
            
             
              0
             
            
           
           
            
             
              
               .
              
              
               .
              
              
               .
              
             
            
           
          
          
           
            
             
              0
             
            
           
           
            
             
              
               a
              
              
               2
              
              
               
                L
               
               
                −
               
               
                1
               
              
             
            
           
           
            
             
              
               .
              
              
               .
              
              
               .
              
             
            
           
          
          
           
            
             
              
               .
              
              
               .
              
              
               .
              
             
            
           
           
            
             
              
               .
              
              
               .
              
              
               .
              
             
            
           
           
            
             
              
               .
              
              
               .
              
              
               .
              
             
            
           
          
          
           
            
             
              
               .
              
              
               .
              
              
               .
              
             
            
           
           
            
             
              
               .
              
              
               .
              
              
               .
              
             
            
           
           
            
             
              
               a
              
              
               
                N
               
               
                
                 L
                
                
                 −
                
                
                 1
                
               
              
              
               
                L
               
               
                −
               
               
                1
               
              
             
            
           
          
         
         
          )
         
        
       
      
     
    
   
   
     \begin{aligned} \frac{\partial\bf{Z^{L}}}{\partial \bf{Z^{L-1}}}&=\{\frac{\partial Z^{L}_{i}}{\partial Z^{L-1}_{j}}\}\\ \frac{\partial Z^{L}_{i}}{\partial Z^{L-1}_{j}}&=W^{L}_{ij}\cdot a^{L}_{j}\\ which indicates\ \frac{\partial\bf{Z^{L}}}{\partial \bf{Z^{L-1}}}&=\bf{W^{L}}\cdot diag(\bf{a^{L-1}})\\ where\ diag(\bf{a^{L-1}})&=\left(\begin{array}{ccc} a_{1}^{L-1} & 0 & ...\\ 0 & a_{2}^{L-1} & ...\\ ...& ... & ... \\ ... & ... & a_{N^{L-1}}^{L-1}\\ \end{array}\right) \end{aligned} 
   
  
 ∂ZL−1∂ZL​∂ZjL−1​∂ZiL​​whichindicates ∂ZL−1∂ZL​where diag(aL−1)​={∂ZjL−1​∂ZiL​​}=WijL​⋅ajL​=WL⋅diag(aL−1)=⎝⎜⎜⎛​a1L−1​0......​0a2L−1​......​.........aNL−1L−1​​⎠⎟⎟⎞​​

将上式代入至

     δ
    
    
     H
    
   
  
  
   \delta^{H}
  
 
δH中,就可以得到:

 
  
   
    
     
      
       
        
         δ
        
        
         H
        
       
      
     
     
      
       
        
        
         =
        
        
         (
        
        
         
          
           ∂
          
          
           
            Z
           
           
            L
           
          
         
         
          
           ∂
          
          
           
            Z
           
           
            
             L
            
            
             −
            
            
             1
            
           
          
         
        
        
         .
        
        
         .
        
        
         .
        
        
         
          
           ∂
          
          
           
            Z
           
           
            
             H
            
            
             +
            
            
             1
            
           
          
         
         
          
           ∂
          
          
           
            Z
           
           
            H
           
          
         
        
        
         
          )
         
         
          T
         
        
        
         ⋅
        
        
         
          δ
         
         
          L
         
        
       
      
     
    
    
     
      
       
      
     
     
      
       
        
        
         =
        
        
         
          Π
         
         
          T
         
        
        
         (
        
        
         
          
           W
          
          
           L
          
         
         
          ⋅
         
         
          d
         
         
          i
         
         
          a
         
         
          g
         
         
          (
         
         
          
           
            a
           
           
            
             L
            
            
             −
            
            
             1
            
           
          
          
           )
          
          
           )
          
          
           ⋅
          
          
           
            δ
           
           
            L
           
          
          
                       
          
          
           @
          
          
           E
          
          
           q
          
          
           .
          
          
           4
          
         
        
       
      
     
    
   
   
     \begin{aligned} \delta^{H} &= (\frac{\partial\bf{Z^{L}}}{\partial \bf{Z^{L-1}}}...\frac{\partial\bf{Z^{H+1}}}{\partial \bf{Z^{H}}})^{T}\cdot\delta^{L}\\ &= \Pi^{T}(\bf{W^{L}}\cdot diag(\bf{a^{L-1}}))\cdot\delta^{L} \ \ \ \ \ \ \ \ \ \ \ \ @Eq.4 \end{aligned} 
   
  
 δH​=(∂ZL−1∂ZL​...∂ZH∂ZH+1​)T⋅δL=ΠT(WL⋅diag(aL−1))⋅δL            @Eq.4​

to analyze it from the dimension aspect, Eq.4的维度信息是:

     [
    
    
     (
    
    
     
      N
     
     
      L
     
    
    
     ∗
    
    
     
      N
     
     
      
       L
      
      
       −
      
      
       1
      
     
    
    
     )
    
    
     ×
    
    
     (
    
    
     
      N
     
     
      
       L
      
      
       −
      
      
       1
      
     
    
    
     ∗
    
    
     
      N
     
     
      
       L
      
      
       −
      
      
       2
      
     
    
    
     )
    
    
     ×
    
    
     .
    
    
     .
    
    
     .
    
    
     (
    
    
     
      N
     
     
      
       H
      
      
       +
      
      
       1
      
     
    
    
     ∗
    
    
     
      N
     
     
      H
     
    
    
     )
    
    
     
      ]
     
     
      T
     
    
    
     ×
    
    
     (
    
    
     
      N
     
     
      L
     
    
    
     ∗
    
    
     1
    
    
     )
    
    
     =
    
    
     (
    
    
     
      N
     
     
      H
     
    
    
     ∗
    
    
     1
    
    
     )
    
   
   
     [(N^{L}*N^{L-1})\times(N^{L-1}*N^{L-2})\times...(N^{H+1}*N^{H})]^{T}\times(N^{L}*1)=(N^{H}*1) 
   
  
 [(NL∗NL−1)×(NL−1∗NL−2)×...(NH+1∗NH)]T×(NL∗1)=(NH∗1)

那么就不难得到任意一层的参数梯度表达式:

          ∂
         
         
          ϵ
         
        
        
         
          ∂
         
         
          
           W
          
          
           H
          
         
        
       
      
     
     
      
       
        
        
         =
        
        
         
          Π
         
         
          T
         
        
        
         (
        
        
         
          
           W
          
          
           L
          
         
         
          ⋅
         
         
          d
         
         
          i
         
         
          a
         
         
          g
         
         
          (
         
         
          
           
            a
           
           
            
             L
            
            
             −
            
            
             1
            
           
          
          
           )
          
          
           )
          
          
           ⋅
          
          
           
            δ
           
           
            L
           
          
          
           ⋅
          
          
           (
          
          
           
            
             a
            
            
             
              H
             
             
              −
             
             
              1
             
            
           
           
            
             )
            
            
             T
            
           
          
         
        
       
      
     
    
    
     
      
       
        
         
          ∂
         
         
          ϵ
         
        
        
         
          ∂
         
         
          
           b
          
          
           H
          
         
        
       
      
     
     
      
       
        
        
         =
        
        
         
          Π
         
         
          T
         
        
        
         (
        
        
         
          
           W
          
          
           L
          
         
         
          ⋅
         
         
          d
         
         
          i
         
         
          a
         
         
          g
         
         
          (
         
         
          
           
            a
           
           
            
             L
            
            
             −
            
            
             1
            
           
          
          
           )
          
          
           )
          
          
           ⋅
          
          
           
            δ
           
           
            L
           
          
         
        
       
      
     
    
   
   
     \begin{aligned} \frac{\partial\epsilon}{\partial \bf{W^{H}}}&=\Pi^{T}(\bf{W^{L}}\cdot diag(\bf{a^{L-1}}))\cdot\delta^{L}\cdot (\bf{a^{H-1}})^{T}\\ \frac{\partial\epsilon}{\partial \bf{b^{H}}}&=\Pi^{T}(\bf{W^{L}}\cdot diag(\bf{a^{L-1}}))\cdot\delta^{L} \end{aligned} 
   
  
 ∂WH∂ϵ​∂bH∂ϵ​​=ΠT(WL⋅diag(aL−1))⋅δL⋅(aH−1)T=ΠT(WL⋅diag(aL−1))⋅δL​

本文转载自: https://blog.csdn.net/qq_40840924/article/details/124454175
版权归原作者 粉粉Shawn 所有, 如有侵权,请联系我们删除。

“一般神经网络(DNN)反向传播过程”的评论:

还没有评论