DNN反向传播过程
多元函数微分
损失函数都是标量函数,它使用范数损失将向量转换为标量。计算损失函数在第L层输入的导数是一种标量对向量的求导。实际上不论是几维向量,都可以视为一列多元函数的自变量数组。
例如,
m
×
n
m\times n
m×n维度的矩阵
{
W
i
j
}
\{W_{ij}\}
{Wij}可以转化为一列多元函数的自变量数组:
{
W
i
j
}
→
(
W
11
,
W
12
.
.
.
W
n
m
)
\{W_{ij}\}\rightarrow(W_{11},W_{12}...W_{nm})
{Wij}→(W11,W12...Wnm)
那么关于
{
W
i
j
}
\{W_{ij}\}
{Wij}的标量函数可以视作关于
(
W
11
,
W
12
.
.
.
W
n
m
)
(W_{11},W_{12}...W_{nm})
(W11,W12...Wnm)的多元函数。多元函数的梯度就是标量函数对矩阵求导的结果。还记得多元函数的梯度是这样省的:
∂
f
∂
x
→
=
(
∂
f
∂
x
1
,
∂
f
∂
x
2
.
.
.
∂
f
∂
x
n
)
\frac{\partial f}{\partial \overrightarrow{x}}=(\frac{\partial f}{\partial x_{1}}, \frac{\partial f}{\partial x_{2}}...\frac{\partial f}{\partial x_{n}})
∂x∂f=(∂x1∂f,∂x2∂f...∂xn∂f)
向量对向量求导
向量函数可以视作多个标量多元函数组成的向量,例如有将向量B映射为A的向量函数。
A
=
G
(
B
)
w
h
e
r
e
A
∈
R
N
×
1
,
B
∈
R
M
×
1
A=G(B)\\ where\ A\in R^{N\times1},B\in R^{M\times1}
A=G(B)where A∈RN×1,B∈RM×1
如果我们将向量A视作多个标量多元函数组成的向量,那么求导就方便多了。
A
=
(
a
1
(
b
1
,
b
2
,
.
.
.
b
m
)
,
a
2
(
b
1
,
b
2
,
.
.
.
b
m
)
,
.
.
.
)
∂
A
∂
B
=
(
∂
a
1
∂
B
,
∂
a
2
∂
B
,
.
.
.
)
=
(
∂
a
1
∂
b
1
.
.
.
∂
a
1
∂
b
m
∂
a
2
∂
b
1
.
.
.
∂
a
2
∂
b
m
.
.
.
.
.
.
.
.
.
∂
a
n
∂
b
1
.
.
.
∂
a
n
∂
b
m
)
\begin{aligned} A&=(a_{1}(b_{1},b_{2},...b_{m}),a_{2}(b_{1},b_{2},...b_{m}),...)\\ \frac{\partial A}{\partial B}&=(\frac{\partial a_{1}}{\partial B},\frac{\partial a_{2}}{\partial B},...)\\ &=\left( \begin{array}{ccc} \frac{\partial a_{1}}{\partial b_{1}} & ... & \frac{\partial a_{1}}{\partial b_{m}}\\ \frac{\partial a_{2}}{\partial b_{1}} & ... & \frac{\partial a_{2}}{\partial b_{m}}\\ ... & ... & ...\\ \frac{\partial a_{n}}{\partial b_{1}} & ... & \frac{\partial a_{n}}{\partial b_{m}}\\ \end{array} \right) \end{aligned}
A∂B∂A=(a1(b1,b2,...bm),a2(b1,b2,...bm),...)=(∂B∂a1,∂B∂a2,...)=⎝⎜⎜⎛∂b1∂a1∂b1∂a2...∂b1∂an............∂bm∂a1∂bm∂a2...∂bm∂an⎠⎟⎟⎞
Wow, see, 现在向量求导清晰多了。当然,不管你将求导展开成
n
×
m
n\times m
n×m形式的矩阵还是
m
×
n
m\times n
m×n的矩阵,只要在求导时统一,都没有关系。
DNN损失函数求导
神经网络的损失函数都是标量函数。常见的损失有L1、L2范数损失、啦啦啦的。以L2范数损失为例,一般的全连接神经网络损失函数:
ϵ
=
1
2
∣
∣
σ
(
a
L
)
−
y
∣
∣
2
@
E
q
.
1
\begin{array}{ccc} \epsilon = \frac{1}{2} ||\sigma (\bf{a^{L}})-\bf{y}||^{2} & @Eq.1 \end{array}
ϵ=21∣∣σ(aL)−y∣∣2@Eq.1
其中
a
L
=
W
L
⋅
a
L
−
1
+
b
L
,
a
L
,
b
L
∈
R
N
L
,
W
L
∈
R
N
L
×
R
N
L
−
1
\bf{a^{L}}=\bf{W^{L}}\cdot\bf{a^{L-1}}+\bf{b^{L}}, \bf{a^{L}},\bf{b^{L}}\in R^{N_{L}},\bf{W^{L}}\in R^{N_{L}}\times R^{N_{L-1}}
aL=WL⋅aL−1+bL,aL,bL∈RNL,WL∈RNL×RNL−1表示第L层激活函数的结果,
y
\bf{y}
y表示Ground truth。Now,如何求解损失函数对
W
L
,
b
L
\bf{W^{L}}, \bf{b^{L}}
WL,bL的梯度呢?We only have to expand Eq.1 to the following expression 啦啦啦:
ϵ
=
1
2
Σ
i
N
[
σ
(
Σ
j
M
W
i
j
L
⋅
a
j
L
−
1
+
b
i
L
)
−
y
i
]
2
∂
ϵ
∂
W
x
y
=
[
σ
(
Σ
j
M
W
x
j
L
⋅
a
j
L
−
1
+
b
x
L
)
−
y
x
]
×
σ
′
(
Σ
j
M
W
x
j
L
⋅
a
j
L
−
1
+
b
x
L
)
×
a
y
L
−
1
s
o
,
∂
ϵ
∂
W
L
=
{
∂
ϵ
∂
W
x
y
L
}
x
:
1
→
N
,
y
:
1
→
M
T
h
e
n
s
u
r
p
r
i
s
i
n
g
l
y
=
[
σ
(
W
L
⋅
a
L
−
1
+
b
L
)
⊙
σ
′
(
W
L
⋅
a
L
−
1
+
b
L
)
]
⋅
(
a
L
−
1
)
T
\begin{aligned} \epsilon &= \frac{1}{2}\Sigma_{i}^{N} [\sigma(\Sigma_{j}^{M}W_{ij}^{L}\cdot a^{L-1}_{j}+b_{i}^{L})-y_{i}]^{2}\\ \frac{\partial\epsilon}{\partial W_{xy}} &= [\sigma(\Sigma_{j}^{M}W_{xj}^{L}\cdot a^{L-1}_{j}+b_{x}^{L})-y_{x}]\times\sigma'(\Sigma_{j}^{M}W_{xj}^{L}\cdot a^{L-1}_{j}+b_{x}^{L})\times a_{y}^{L-1}\\ so, \frac{\partial\epsilon}{\partial \bf{W^{L}}}&=\{\frac{\partial\epsilon}{\partial W_{xy}^{L}}\}_{x:1\rightarrow N,y:1\rightarrow M}\\ &Then\ surprisingly\\ &=[\sigma(\bf{W^{L}}\cdot a^{L-1}+\bf{b^{L}})\odot\sigma'(\bf{W^{L}}\cdot a^{L-1}+\bf{b^{L}})]\cdot (a^{L-1})^{T} \end{aligned}
ϵ∂Wxy∂ϵso,∂WL∂ϵ=21ΣiN[σ(ΣjMWijL⋅ajL−1+biL)−yi]2=[σ(ΣjMWxjL⋅ajL−1+bxL)−yx]×σ′(ΣjMWxjL⋅ajL−1+bxL)×ayL−1={∂WxyL∂ϵ}x:1→N,y:1→MThen surprisingly=[σ(WL⋅aL−1+bL)⊙σ′(WL⋅aL−1+bL)]⋅(aL−1)T
同样的,损失函数对偏置求导得到:
∂
ϵ
∂
b
L
=
[
σ
(
W
L
⋅
a
L
−
1
+
b
L
)
⊙
σ
′
(
W
L
⋅
a
L
−
1
+
b
L
)
]
\frac{\partial\epsilon}{\partial \bf{b^{L}}}=[\sigma(\bf{W^{L}}\cdot a^{L-1}+\bf{b^{L}})\odot\sigma'(\bf{W^{L}}\cdot a^{L-1}+\bf{b^{L}})]
∂bL∂ϵ=[σ(WL⋅aL−1+bL)⊙σ′(WL⋅aL−1+bL)]
通常我们用
z
L
=
W
L
⋅
a
L
−
1
+
b
L
\bf{z^{L}}=\bf{W^{L}}\cdot a^{L-1}+\bf{b^{L}}
zL=WL⋅aL−1+bL表示未激活输出,
δ
L
=
σ
(
z
L
)
⊙
σ
′
(
z
L
)
\bf{\delta^{L}}=\sigma(\bf{z^{L}})\odot\sigma'(\bf{z^{L}})
δL=σ(zL)⊙σ′(zL)表示Hadamard乘积结果。那么损失函数对最后一层神经网络参数的梯度就是:
∂
ϵ
∂
W
L
=
δ
L
⋅
(
a
L
−
1
)
T
∂
ϵ
∂
b
L
=
δ
L
\begin{aligned} \frac{\partial\epsilon}{\partial \bf{W^{L}}}&=\bf{\delta^{L}}\cdot (\bf{a^{L-1}})^{T}\\ \frac{\partial\epsilon}{\partial \bf{b^{L}}}&=\bf{\delta^{L}} \end{aligned}
∂WL∂ϵ∂bL∂ϵ=δL⋅(aL−1)T=δL
桥豆麻嘚,好像推出来了什么不得了的东西。如果是对第
h
h
h层的参数求导,那么有:
∂
ϵ
∂
W
H
=
δ
H
⋅
(
a
H
−
1
)
T
@
E
q
.
2
∂
ϵ
∂
b
H
=
δ
H
@
E
q
.
3
w
h
e
r
e
δ
H
=
∂
ϵ
∂
Z
L
⋅
∂
Z
L
∂
Z
L
−
1
.
.
.
∂
Z
H
+
1
∂
Z
H
\begin{aligned} \frac{\partial\epsilon}{\partial \bf{W^{H}}}&=\bf{\delta^{H}}\cdot (\bf{a^{H-1}})^{T}\ \ \ \ \ @Eq.2\\ \frac{\partial\epsilon}{\partial \bf{b^{H}}}&=\bf{\delta^{H}}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ @Eq.3\\\\ where\ \bf{\delta^{H}}&=\frac{\partial\epsilon}{\partial \bf{Z^{L}}}\cdot\frac{\partial\bf{Z^{L}}}{\partial \bf{Z^{L-1}}}...\frac{\partial\bf{Z^{H+1}}}{\partial \bf{Z^{H}}} \end{aligned}
∂WH∂ϵ∂bH∂ϵwhere δH=δH⋅(aH−1)T @Eq.2=δH @Eq.3=∂ZL∂ϵ⋅∂ZL−1∂ZL...∂ZH∂ZH+1
clearly,求导的关键在于求解后一层非激活输出对前一层非激活输出的导数,即:
∂
Z
L
∂
Z
L
−
1
=
{
∂
Z
i
L
∂
Z
j
L
−
1
}
∂
Z
i
L
∂
Z
j
L
−
1
=
W
i
j
L
⋅
a
j
L
w
h
i
c
h
i
n
d
i
c
a
t
e
s
∂
Z
L
∂
Z
L
−
1
=
W
L
⋅
d
i
a
g
(
a
L
−
1
)
w
h
e
r
e
d
i
a
g
(
a
L
−
1
)
=
(
a
1
L
−
1
0
.
.
.
0
a
2
L
−
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
N
L
−
1
L
−
1
)
\begin{aligned} \frac{\partial\bf{Z^{L}}}{\partial \bf{Z^{L-1}}}&=\{\frac{\partial Z^{L}_{i}}{\partial Z^{L-1}_{j}}\}\\ \frac{\partial Z^{L}_{i}}{\partial Z^{L-1}_{j}}&=W^{L}_{ij}\cdot a^{L}_{j}\\ which indicates\ \frac{\partial\bf{Z^{L}}}{\partial \bf{Z^{L-1}}}&=\bf{W^{L}}\cdot diag(\bf{a^{L-1}})\\ where\ diag(\bf{a^{L-1}})&=\left(\begin{array}{ccc} a_{1}^{L-1} & 0 & ...\\ 0 & a_{2}^{L-1} & ...\\ ...& ... & ... \\ ... & ... & a_{N^{L-1}}^{L-1}\\ \end{array}\right) \end{aligned}
∂ZL−1∂ZL∂ZjL−1∂ZiLwhichindicates ∂ZL−1∂ZLwhere diag(aL−1)={∂ZjL−1∂ZiL}=WijL⋅ajL=WL⋅diag(aL−1)=⎝⎜⎜⎛a1L−10......0a2L−1...............aNL−1L−1⎠⎟⎟⎞
将上式代入至
δ
H
\delta^{H}
δH中,就可以得到:
δ
H
=
(
∂
Z
L
∂
Z
L
−
1
.
.
.
∂
Z
H
+
1
∂
Z
H
)
T
⋅
δ
L
=
Π
T
(
W
L
⋅
d
i
a
g
(
a
L
−
1
)
)
⋅
δ
L
@
E
q
.
4
\begin{aligned} \delta^{H} &= (\frac{\partial\bf{Z^{L}}}{\partial \bf{Z^{L-1}}}...\frac{\partial\bf{Z^{H+1}}}{\partial \bf{Z^{H}}})^{T}\cdot\delta^{L}\\ &= \Pi^{T}(\bf{W^{L}}\cdot diag(\bf{a^{L-1}}))\cdot\delta^{L} \ \ \ \ \ \ \ \ \ \ \ \ @Eq.4 \end{aligned}
δH=(∂ZL−1∂ZL...∂ZH∂ZH+1)T⋅δL=ΠT(WL⋅diag(aL−1))⋅δL @Eq.4
to analyze it from the dimension aspect, Eq.4的维度信息是:
[
(
N
L
∗
N
L
−
1
)
×
(
N
L
−
1
∗
N
L
−
2
)
×
.
.
.
(
N
H
+
1
∗
N
H
)
]
T
×
(
N
L
∗
1
)
=
(
N
H
∗
1
)
[(N^{L}*N^{L-1})\times(N^{L-1}*N^{L-2})\times...(N^{H+1}*N^{H})]^{T}\times(N^{L}*1)=(N^{H}*1)
[(NL∗NL−1)×(NL−1∗NL−2)×...(NH+1∗NH)]T×(NL∗1)=(NH∗1)
那么就不难得到任意一层的参数梯度表达式:
∂
ϵ
∂
W
H
=
Π
T
(
W
L
⋅
d
i
a
g
(
a
L
−
1
)
)
⋅
δ
L
⋅
(
a
H
−
1
)
T
∂
ϵ
∂
b
H
=
Π
T
(
W
L
⋅
d
i
a
g
(
a
L
−
1
)
)
⋅
δ
L
\begin{aligned} \frac{\partial\epsilon}{\partial \bf{W^{H}}}&=\Pi^{T}(\bf{W^{L}}\cdot diag(\bf{a^{L-1}}))\cdot\delta^{L}\cdot (\bf{a^{H-1}})^{T}\\ \frac{\partial\epsilon}{\partial \bf{b^{H}}}&=\Pi^{T}(\bf{W^{L}}\cdot diag(\bf{a^{L-1}}))\cdot\delta^{L} \end{aligned}
∂WH∂ϵ∂bH∂ϵ=ΠT(WL⋅diag(aL−1))⋅δL⋅(aH−1)T=ΠT(WL⋅diag(aL−1))⋅δL
版权归原作者 粉粉Shawn 所有, 如有侵权,请联系我们删除。