论文标题:Learning Important Features Through Propagating Activation Differences
论文作者:****Avanti Shrikumar, Peyton Greenside, Anshul Kundaje
论文发表时间及来源:Oct 2019,ICML
论文链接:http://proceedings.mlr.press/v70/shrikumar17a/shrikumar17a.pdf
DeepLIFT方法
1. DeepLIFT理论
DeepLIFT解释了目标输入、目标输出与“参考(reference)”输入、“参考”输出间的差异。“参考”输入是人为选择的中性的输入。
用![x_{i}](https://latex.codecogs.com/gif.latex?x_%7Bi%7D)表示单层神经元或多层神经元的集合,![x^{0}](https://latex.codecogs.com/gif.latex?x%5E%7B0%7D)为![x_{i}](https://latex.codecogs.com/gif.latex?x_%7Bi%7D)对应的“参考”,有![\Delta x_{i}=x_{i}-x^{0}](https://latex.codecogs.com/gif.latex?%5CDelta%20x_%7Bi%7D%3Dx_%7Bi%7D-x%5E%7B0%7D)。用![t](https://latex.codecogs.com/gif.latex?t)表示目标输入经过![x_{i}](https://latex.codecogs.com/gif.latex?x_%7Bi%7D)的输出(当![x_{i}](https://latex.codecogs.com/gif.latex?x_%7Bi%7D)为全部神经元的集合时,![t](https://latex.codecogs.com/gif.latex?t)为目标输出),![t^{0}](https://latex.codecogs.com/gif.latex?t%5E%7B0%7D)表示“参考”输出,有 ![\Delta t=t-t^{0}](https://latex.codecogs.com/gif.latex?%5CDelta%20t%3Dt-t%5E%7B0%7D)。如(1)式,![\Delta t](https://latex.codecogs.com/gif.latex?%5CDelta%20t)为各个输入贡献分数的加和。
2. 乘数(Multiplier)与链式法则
乘数与偏导数类似:偏导数![\frac{\partial t}{\partial x}](https://latex.codecogs.com/gif.latex?%5Cfrac%7B%5Cpartial%20t%7D%7B%5Cpartial%20x%7D)是指![x](https://latex.codecogs.com/gif.latex?x)产生无穷小变化时,![t](https://latex.codecogs.com/gif.latex?t)的变化率;而乘数是指![x](https://latex.codecogs.com/gif.latex?x)产生一定量的变化后,![t](https://latex.codecogs.com/gif.latex?t)的变化率。
这里![\Delta y_{j}](https://latex.codecogs.com/gif.latex?%5CDelta%20y_%7Bj%7D)可理解为中间层的![\Delta t](https://latex.codecogs.com/gif.latex?%5CDelta%20t)。给定每个神经元与其直接后继的乘数,即可计算任意神经元与目标神经元的乘数。
3. 定义“参考”
MNIST任务中,使用全黑图片作为“参考”。
CIFAR10任务中,使用原始图像的模糊版本能突出目标输入的轮廓,而全黑图片作为参考时产生了一些难以解释的像素。
DNA序列分类任务中,以ATGC的期望频率作为“参考”。即目标输入是四维的one-hot编码,“参考”输入是相同维度的ATGC期望频率。这里还有一种方法没有看懂,见Appendix J。
4. 区分正、负贡献
当应用RevealCancel规则时,区分正、负贡献非常重要。
5. 分配贡献分数的规则
线性规则
用于Dense层,卷积层,不可用于非线性层。定义线性函数![y=b+\sum_{i}^{}\omega _{i}x_{i}](https://latex.codecogs.com/gif.latex?y%3Db+%5Csum_%7Bi%7D%5E%7B%7D%5Comega%20_%7Bi%7Dx_%7Bi%7D),则![\Delta y=\sum_{i}^{}\omega _{i}\Delta x_{i}](https://latex.codecogs.com/gif.latex?%5CDelta%20y%3D%5Csum_%7Bi%7D%5E%7B%7D%5Comega%20_%7Bi%7D%5CDelta%20x_%7Bi%7D)。
公式有点复杂,举例说明。“参考”输入![\left [ 0, 0 \right ]\cdot \left [ 3, 4 \right ]^{T}=0](https://latex.codecogs.com/gif.latex?%5Cleft%20%5B%200%2C%200%20%5Cright%20%5D%5Ccdot%20%5Cleft%20%5B%203%2C%204%20%5Cright%20%5D%5E%7BT%7D%3D0),目标输入![\left [ 1, 2 \right ]\cdot \left [ 3, 4 \right ]^{T}=11](https://latex.codecogs.com/gif.latex?%5Cleft%20%5B%201%2C%202%20%5Cright%20%5D%5Ccdot%20%5Cleft%20%5B%203%2C%204%20%5Cright%20%5D%5E%7BT%7D%3D11),则由式(6)得![\Delta y^{+}=(1-0)*3+(2-0)*4=11](https://latex.codecogs.com/gif.latex?%5CDelta%20y%5E%7B+%7D%3D%281-0%29*3+%282-0%29*4%3D11),由式(8)得![\left [ 1, 2 \right ]](https://latex.codecogs.com/gif.latex?%5Cleft%20%5B%201%2C%202%20%5Cright%20%5D)两个特征贡献分数分别为3和8,由式(12)得两个神经元的乘数分别为3和4。乘数的作用是,如果神经网络有两层线性函数,![\left [ 3, 4 \right ]^{T}](https://latex.codecogs.com/gif.latex?%5Cleft%20%5B%203%2C%204%20%5Cright%20%5D%5E%7BT%7D)为第一层神经元,![\left [ 5 \right ]](https://latex.codecogs.com/gif.latex?%5Cleft%20%5B%205%20%5Cright%20%5D)为第二层神经元,则第二层的乘数为5,由式(3)得整个神经网络第一个特征的乘数为15,第二个特征的乘数为20,每个位置的输入乘以乘数就是其贡献分数。
Rescale规则
用于非线性层,如ReLU,tanh或sigmoid等。由于非线性函数![y=f\left ( x \right )](https://latex.codecogs.com/gif.latex?y%3Df%5Cleft%20%28%20x%20%5Cright%20%29)只有一个输入,则![C_{\Delta x\Delta y}=\Delta y](https://latex.codecogs.com/gif.latex?C_%7B%5CDelta%20x%5CDelta%20y%7D%3D%5CDelta%20y),![m_{\Delta x\Delta y}=\frac{\Delta y}{\Delta x}](https://latex.codecogs.com/gif.latex?m_%7B%5CDelta%20x%5CDelta%20y%7D%3D%5Cfrac%7B%5CDelta%20y%7D%7B%5CDelta%20x%7D),![\Delta y^{^{+}}](https://latex.codecogs.com/gif.latex?%5CDelta%20y%5E%7B%5E%7B+%7D%7D)和![\Delta y^{^{-}}](https://latex.codecogs.com/gif.latex?%5CDelta%20y%5E%7B%5E%7B-%7D%7D)分别为:
当![x\rightarrow x^{0}](https://latex.codecogs.com/gif.latex?x%5Crightarrow%20x%5E%7B0%7D)时,![m_{\Delta x\Delta y}](https://latex.codecogs.com/gif.latex?m_%7B%5CDelta%20x%5CDelta%20y%7D)可用梯度代替。Rescale规则解决了梯度饱和问题和值域问题,例子见论文。
RevealCancel规则
这里说明为何 ![\Delta y^{^{+}}](https://latex.codecogs.com/gif.latex?%5CDelta%20y%5E%7B%5E%7B+%7D%7D)和![\Delta y^{^{-}}](https://latex.codecogs.com/gif.latex?%5CDelta%20y%5E%7B%5E%7B-%7D%7D) 需分开计算。下图是一个计算最小值的操作,假定![i_{1}^{0}=i_{2}^{0}=0](https://latex.codecogs.com/gif.latex?i_%7B1%7D%5E%7B0%7D%3Di_%7B2%7D%5E%7B0%7D%3D0),目标输入![i_{1}=3](https://latex.codecogs.com/gif.latex?i_%7B1%7D%3D3),![i_{2}=1](https://latex.codecogs.com/gif.latex?i_%7B2%7D%3D1),则![h_{1}=(3-1)=2>0](https://latex.codecogs.com/gif.latex?h_%7B1%7D%3D%283-1%29%3D2%3E0),![h_{2}=max(0, h_{1})=2](https://latex.codecogs.com/gif.latex?h_%7B2%7D%3Dmax%280%2C%20h_%7B1%7D%29%3D2)。根据线性规则,可知![C_{\Delta i_{1}\Delta h_{1}}=i_{1}=3](https://latex.codecogs.com/gif.latex?C_%7B%5CDelta%20i_%7B1%7D%5CDelta%20h_%7B1%7D%7D%3Di_%7B1%7D%3D3),![C_{\Delta i_{2}\Delta h_{1}}=-i_{2}=-1](https://latex.codecogs.com/gif.latex?C_%7B%5CDelta%20i_%7B2%7D%5CDelta%20h_%7B1%7D%7D%3D-i_%7B2%7D%3D-1)。根据Rescale规则,![m_{\Delta h_{1}\Delta h_{2}}=\frac{\Delta h_{2}}{\Delta h_{1}}=1](https://latex.codecogs.com/gif.latex?m_%7B%5CDelta%20h_%7B1%7D%5CDelta%20h_%7B2%7D%7D%3D%5Cfrac%7B%5CDelta%20h_%7B2%7D%7D%7B%5CDelta%20h_%7B1%7D%7D%3D1),![C_{\Delta i_{1}\Delta h_{2}}=m_{\Delta h_{1}\Delta h_{2}}C_{\Delta i_{1}\Delta h_{1}}=i_{1}=3](https://latex.codecogs.com/gif.latex?C_%7B%5CDelta%20i_%7B1%7D%5CDelta%20h_%7B2%7D%7D%3Dm_%7B%5CDelta%20h_%7B1%7D%5CDelta%20h_%7B2%7D%7DC_%7B%5CDelta%20i_%7B1%7D%5CDelta%20h_%7B1%7D%7D%3Di_%7B1%7D%3D3),![C_{\Delta i_{2}\Delta h_{2}}=m_{\Delta h_{1}\Delta h_{2}}C_{\Delta i_{2}\Delta h_{1}}=-i_{2}=-1](https://latex.codecogs.com/gif.latex?C_%7B%5CDelta%20i_%7B2%7D%5CDelta%20h_%7B2%7D%7D%3Dm_%7B%5CDelta%20h_%7B1%7D%5CDelta%20h_%7B2%7D%7DC_%7B%5CDelta%20i_%7B2%7D%5CDelta%20h_%7B1%7D%7D%3D-i_%7B2%7D%3D-1)。则![i_{1}](https://latex.codecogs.com/gif.latex?i_%7B1%7D)总贡献分数为![C_{\Delta i_{1}\Delta o}=\Delta i_{1}m_{\Delta i_{1}\Delta o}=\Delta i_{1}(1+m_{\Delta i_{1}\Delta h_{1}}m_{\Delta h_{1}\Delta h_{2}}m_{\Delta h_{2}\Delta o})=0](https://latex.codecogs.com/gif.latex?C_%7B%5CDelta%20i_%7B1%7D%5CDelta%20o%7D%3D%5CDelta%20i_%7B1%7Dm_%7B%5CDelta%20i_%7B1%7D%5CDelta%20o%7D%3D%5CDelta%20i_%7B1%7D%281+m_%7B%5CDelta%20i_%7B1%7D%5CDelta%20h_%7B1%7D%7Dm_%7B%5CDelta%20h_%7B1%7D%5CDelta%20h_%7B2%7D%7Dm_%7B%5CDelta%20h_%7B2%7D%5CDelta%20o%7D%29%3D0), ![i_{2}](https://latex.codecogs.com/gif.latex?i_%7B2%7D)总贡献分数为![C_{\Delta i_{2}\Delta o}=\Delta i_{2}m_{\Delta 2_{1}\Delta o}=\Delta i_{2}m_{\Delta i_{2}\Delta h_{1}}m_{\Delta h_{1}\Delta h_{2}}m_{\Delta h_{2}\Delta o}=1](https://latex.codecogs.com/gif.latex?C_%7B%5CDelta%20i_%7B2%7D%5CDelta%20o%7D%3D%5CDelta%20i_%7B2%7Dm_%7B%5CDelta%202_%7B1%7D%5CDelta%20o%7D%3D%5CDelta%20i_%7B2%7Dm_%7B%5CDelta%20i_%7B2%7D%5CDelta%20h_%7B1%7D%7Dm_%7B%5CDelta%20h_%7B1%7D%5CDelta%20h_%7B2%7D%7Dm_%7B%5CDelta%20h_%7B2%7D%5CDelta%20o%7D%3D1)。
同样地,梯度,输入*梯度方法也会赋予其中一个特征0的贡献分数,这忽略了特征间的相互依赖性。 ![\Delta y^{^{+}}](https://latex.codecogs.com/gif.latex?%5CDelta%20y%5E%7B%5E%7B+%7D%7D)和![\Delta y^{^{-}}](https://latex.codecogs.com/gif.latex?%5CDelta%20y%5E%7B%5E%7B-%7D%7D) 分开计算的公式为:
用这种方法计算出![i_{1}](https://latex.codecogs.com/gif.latex?i_%7B1%7D)和![i_{2}](https://latex.codecogs.com/gif.latex?i_%7B2%7D)的贡献分数均为0.5,其过程简单来说是把每一层神经元输出做+、-的区分,两条路径分别计算乘数与贡献分数后再加和。计算过程有点复杂,详见Appendix C 3.4。
版权归原作者 黄金贵 所有, 如有侵权,请联系我们删除。