
1. 概述

Bidirectional Encoder Representation from Transformers(BERT)[1],即双向Transformer的Encoder表示,是2018年提出的一种基于上下文的预训练模型,通过大量语料学习到每个词的一般性embedding形式,学习到与上下文无关的语义向量表示,以此实现对多义词的建模。与预训练语言模型ELMo[2]以及GPT[3]的关系如下图所示:
Embeddings from Language Models(ELMo)[2],Generative Pre-Training(GPT)[3]以及Bidirectional Encoder Representation from Transformers(BERT)[1]三者都是基于上下文的预训练模型,也都是采用两阶段的过程,第一阶段是利用无监督的方式对语言模型进行预训练,第二阶段通过监督的方式在具体语言任务上进行Fine-tuning。不同的是在ELMo中采用的双向的LSTM算法;在GPT中采用的特征提取算法是Transformer[4],且是单向的Transformer语言模型,相比较于ELMo中的LSTM模型,基于Transformer的模型具有更好的特征提取能力;在BERT中同样采用了基于Transformer的特征提取算法,与GPT中不同的是:

  • 第一,在BERT中的Transformer是一个双向的Transformer模型,更进一步提升了特征的提取能力
  • 第二,GPT中采用的是Transformer中的Decoder模型,BERT中采用的是Transformer中的Encoder模型。

2. 算法原理

2.1. Transformer结构



2.2. BERT的基本原理



为了使得BERT能够适配更多的应用,模型在pre-training阶段,使用了Masked Language Model(MLM)和Next Sentence Prediction(NSP)两种任务作为模型预训练的任务,其中MLM可以学习到词的Embedding,NSP可以学习到句子的Embedding。在Transformer中,输入中会将词向量与位置向量相加,而在BERT中,为了能适配上述的两个任务,即MLM和NSP,这里的Embedding包含了三种Embedding的和,如下图所示:


其中,Token Embeddings是词向量,第一个单词是CLS标志,可以用于之后的分类任,Segment Embeddings用来区别两种句子,这是在预训练阶段,针对NSP任务的输入,Position Embeddings是位置向量,但是和Transformer中不一样,与词向量一样,是通过学习出来的。此处包含了两种标记,一个是

  1. [CLS]


  1. [SEP]


2.2.1. 预训练之MLM

Masked Language Model的原理是随机将一些词替换成

  1. [MASK]

,在训练的过程中,通过上下文信息来预测被mask的词。文献[1]中给出了如下的例子:“my dog is hairy”,此时被随机选中的词是“hairy”,则样本被替换成“my dog is [MASK]”,训练的目的是要使得BERT模型能够预测出此处的“[MASK]”即为“hairy”。同时,随机替换的概率为

  1. 15
  2. %
  3. 15\%
  4. 15%。同时,对于这
  5. 15
  6. %
  7. 15\%
  8. 15%的随机选择,分为以下的三种情况:
  • 选中词的 80 % 80% 80%替换成[MASK],如:“my dog is [MASK]”
  • 选中词的 10 % 10% 10%随机替换,如替换成apple,即:“my dog is apple”
  • 选中词的 10 % 10% 10%保持不变,即:“my dog is hairy”


2.2.2. 预训练之NSP

Next Sentence Prediction的目的是让模型理解两个橘子之间的关系,训练的输入是两个句子,BERT模型需要判断后一个句子是不是前一个句子的下一句。在Input中,有Segment Embeddings,就是标记的不同的句子。在选择训练数据时,输入句子A和B,B有50%的概率是A的下一句,具体的例子如:


2.3. BERT的网络结构




  1. deftransformer_model(input_tensor,
  2. attention_mask=None,
  3. hidden_size=768,
  4. num_hidden_layers=12,
  5. num_attention_heads=12,
  6. intermediate_size=3072,
  7. intermediate_act_fn=gelu,
  8. hidden_dropout_prob=0.1,
  9. attention_probs_dropout_prob=0.1,
  10. initializer_range=0.02,
  11. do_return_all_layers=False):"""Multi-headed, multi-layer Transformer from "Attention is All You Need".
  12. This is almost an exact implementation of the original Transformer encoder.
  13. See the original paper:
  14. https://arxiv.org/abs/1706.03762
  15. Also see:
  16. https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
  17. Args:
  18. input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
  19. attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
  20. seq_length], with 1 for positions that can be attended to and 0 in
  21. positions that should not be.
  22. hidden_size: int. Hidden size of the Transformer.
  23. num_hidden_layers: int. Number of layers (blocks) in the Transformer.
  24. num_attention_heads: int. Number of attention heads in the Transformer.
  25. intermediate_size: int. The size of the "intermediate" (a.k.a., feed
  26. forward) layer.
  27. intermediate_act_fn: function. The non-linear activation function to apply
  28. to the output of the intermediate/feed-forward layer.
  29. hidden_dropout_prob: float. Dropout probability for the hidden layers.
  30. attention_probs_dropout_prob: float. Dropout probability of the attention
  31. probabilities.
  32. initializer_range: float. Range of the initializer (stddev of truncated
  33. normal).
  34. do_return_all_layers: Whether to also return all layers or just the final
  35. layer.
  36. Returns:
  37. float Tensor of shape [batch_size, seq_length, hidden_size], the final
  38. hidden layer of the Transformer.
  39. Raises:
  40. ValueError: A Tensor shape or parameter is invalid.
  41. """if hidden_size % num_attention_heads !=0:raise ValueError("The hidden size (%d) is not a multiple of the number of attention ""heads (%d)"%(hidden_size, num_attention_heads))
  42. attention_head_size =int(hidden_size / num_attention_heads)# self-attention的头
  43. input_shape = get_shape_list(input_tensor, expected_rank=3)
  44. batch_size = input_shape[0]# batch的大小
  45. seq_length = input_shape[1]# 句子长度
  46. input_width = input_shape[2]# The Transformer performs sum residuals on all layers so the input needs# to be the same as the hidden size.if input_width != hidden_size:raise ValueError("The width of the input tensor (%d) != hidden size (%d)"%(input_width, hidden_size))# We keep the representation as a 2D tensor to avoid re-shaping it back and# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on# the GPU/CPU but may not be free on the TPU, so we want to minimize them to# help the optimizer.
  47. prev_output = reshape_to_matrix(input_tensor)
  48. all_layer_outputs =[]for layer_idx inrange(num_hidden_layers):with tf.variable_scope("layer_%d"% layer_idx):
  49. layer_input = prev_output
  50. with tf.variable_scope("attention"):# attention的计算
  51. attention_heads =[]with tf.variable_scope("self"):
  52. attention_head = attention_layer(
  53. from_tensor=layer_input,
  54. to_tensor=layer_input,
  55. attention_mask=attention_mask,
  56. num_attention_heads=num_attention_heads,
  57. size_per_head=attention_head_size,
  58. attention_probs_dropout_prob=attention_probs_dropout_prob,
  59. initializer_range=initializer_range,
  60. do_return_2d_tensor=True,
  61. batch_size=batch_size,
  62. from_seq_length=seq_length,
  63. to_seq_length=seq_length)
  64. attention_heads.append(attention_head)# 多头注意力
  65. attention_output =Noneiflen(attention_heads)==1:
  66. attention_output = attention_heads[0]else:# In the case where we have other sequences, we just concatenate# them to the self-attention head before the projection.
  67. attention_output = tf.concat(attention_heads, axis=-1)# concat多头的输出# Run a linear projection of `hidden_size` then add a residual# with `layer_input`.with tf.variable_scope("output"):
  68. attention_output = tf.layers.dense(
  69. attention_output,
  70. hidden_size,
  71. kernel_initializer=create_initializer(initializer_range))
  72. attention_output = dropout(attention_output, hidden_dropout_prob)# dropout
  73. attention_output = layer_norm(attention_output + layer_input)# layer norm# The activation is only applied to the "intermediate" hidden layer.with tf.variable_scope("intermediate"):
  74. intermediate_output = tf.layers.dense(
  75. attention_output,
  76. intermediate_size,
  77. activation=intermediate_act_fn,
  78. kernel_initializer=create_initializer(initializer_range))# Down-project back to `hidden_size` then add the residual.with tf.variable_scope("output"):
  79. layer_output = tf.layers.dense(
  80. intermediate_output,
  81. hidden_size,
  82. kernel_initializer=create_initializer(initializer_range))
  83. layer_output = dropout(layer_output, hidden_dropout_prob)
  84. layer_output = layer_norm(layer_output + attention_output)
  85. prev_output = layer_output
  86. all_layer_outputs.append(layer_output)if do_return_all_layers:
  87. final_outputs =[]for layer_output in all_layer_outputs:
  88. final_output = reshape_from_matrix(layer_output, input_shape)
  89. final_outputs.append(final_output)return final_outputs
  90. else:
  91. final_output = reshape_from_matrix(prev_output, input_shape)return final_output

2.3.1. BERT是双向Transformer



从上图中可以看出,唯一的不同是在Multi-Head Attention部分,如图中的红色框,在BERT中使用的是Multi-Head Attention,而GPT中使用的是Masked Multi-Head Attention。在Masked Multi-Head Attention是应用在Decoder阶段的生成模型,即在

  1. t
  2. t
  3. t时刻,根据
  4. t
  5. 1
  6. t-1
  7. t1时刻及之前的词预测
  8. t
  9. t
  10. t时刻的词,对于
  11. t
  12. t
  13. t时刻以及
  14. t
  15. t
  16. t时刻之后的词是不可见的,因此Masked Multi-Head Attention是一个单向的模型,同时不便于并行。

对于Multi-Head Attention,其计算方法如下图所示:



  1. [MASK]


2.3.2. Fine Tune


  • 序列标注,如中文分词,词性标注,命名实体识别(特点:句子中每个单词要求模型根据上下文都要给出一个分类类别)
  • 分类任务,如文本分类,情感计算(特点:总体给出一个分类类别)
  • 句子关系判断,如QA,语意改写(特点:给定两个句子,模型判断出两个句子是否具备某种语义关系)
  • 生成式任务,如机器翻译,文本摘要,写诗造句,看图说话(特点:输入文本内容后,需要自主生成另外一段文字)






  1. [CLS]


  1. C
  2. R
  3. H
  4. C\in \mathbb{R}^H
  5. CRH,在Fine-Tune阶段,加上一个权重矩阵
  6. W
  7. R
  8. K
  9. ×
  10. H
  11. W\in \mathbb{R}^{K\times H}
  12. WRK×H,其中,
  13. K
  14. K
  15. K为分类的类别数。最终通过Softmax函数得到最终的输出概率。






3. 总结



