自然语言处理(NLP)项目面临的最常见问题之一是缺乏数据的标记。标记数据是昂贵并且耗时的。数据增广技术通过对数据进行扩充，加大训练的数据量来防止过拟合和使模型更健壮，帮助我们建立更好的模型。在这篇文章中，我将介绍我们如何使用Transformers库和预训练模型，如BERT, GPT-2, T5等，以轻松地增加我们的文本数据。我还想提一下谷歌研究人员关于无监督数据增广(UDA)的一篇有趣的论文，他们展示了只有20个标记的例子和与其他技术结合的数据增广，他们的模型在IMDB数据集上表现得比最先进的模型更好，同样的技术在图像分类任务上也显示了良好的结果。(arxiv:1904.12848)

Back Translation

这是我觉得最有趣的技术，这种方式首先使用模型将句子转换为不同的语言，然后将其转换回目标语言。当我们为此使用 ML 模型时，它会生成与原始句子相同但单词不同的句子。Huggingface 的模型中心提供了各种预训练模型，例如 Google T5、Facebook NMT（神经机器翻译）等。在下面的代码中，我使用 T5-base 进行英语到德语的翻译，然后使用 Bert2Bert 模型进行德语到英语的翻译 . 我们还可以使用 Fairseq 模型，这些模型可用于英语到德语和德语到英语。

 fromtransformersimportpipeline
 fromtransformersimportAutoTokenizer, AutoModelForSeq2SeqLM
 
 #English to German using the Pipeline and T5
 translator_en_to_de = pipeline("translation_en_to_de", model='t5-base')
 
 #Germal to English using Bert2Bert model
 tokenizer = AutoTokenizer.from_pretrained("google/bert2bert_L-24_wmt_de_en", pad_token="<pad>", eos_token="</s>", bos_token="<s>")
 model_de_to_en = AutoModelForSeq2SeqLM.from_pretrained("google/bert2bert_L-24_wmt_de_en")
 
 input_text = "I went to see a movie in the theater"
 en_to_de_output = translator_en_to_de(input_text)
 translated_text = en_to_de_output[0]['translation_text']
 print("Translated text->",translated_text)
 #Ich ging ins Kino, um einen Film zu sehen.
 
 input_ids = tokenizer(translated_text, return_tensors="pt", add_special_tokens=False).input_ids
 output_ids = model_de_to_en.generate(input_ids)[0]
 augmented_text = tokenizer.decode(output_ids, skip_special_tokens=True)
 print("Augmented Text->",augmented_text)
 #I went to the cinema to see a film.

我们可以看到，对于输入文本“I went to see a movie in the theater”，我们得到的输出文本是“I went to the cinema to see a film”，它传达了相同的含义，但使用了不同的词和不同的顺序！我们还可以使用不同的语言（如英语到法语等）来创建更多变体。

随机插入

在这种技术中，我们在给定的句子中随机插入一个单词。一种方法是随机插入任何单词，但我们也可以使用预训练模型（如 BERT）根据上下文插入单词。这里我们可以使用transformer pipeline中的“fill-mask”任务来插入一个单词。

 fromtransformersimportpipeline
 importrandom
 
 unmasker = pipeline('fill-mask', model='bert-base-cased')
 
 input_text = "I went to see a movie in the theater"
 
 orig_text_list = input_text.split()
 len_input = len(orig_text_list)
 #Random index where we want to insert the word except at the start or end
 rand_idx = random.randint(1,len_input-2)
 
 new_text_list = orig_text_list[:rand_idx] + ['[MASK]'] +orig_text_list[rand_idx:]
 new_mask_sent = ' '.join(new_text_list)
 print("Masked sentence->",new_mask_sent)
 #I went to see a [Mask] movie in the theater
 
 augmented_text_list = unmasker(new_mask_sent)
 augmented_text = augmented_text_list[0]['sequence']
 print("Augmented text->",augmented_text)
 #I went to see a new movie in the theater

我们可以看到对于输入文本“I went to see a movie in the theater”，BERT 模型在随机位置插入一个词“new”来创建一个新句子“I went to see a new movie in the theater”，它实际上提供了 5 个不同的选项。当我们随机选择索引时，每次都会在不同的地方插入单词。在此之后，我们可以使用通用句子编码器使用相似性度量来选择最相似的句子。

随机替换

在这种技术中，我们用一个新词替换一个随机词，我们可以使用预先构建的字典来替换同义词，或者我们可以使用像 BERT 这样的预训练模型。这里我们再次使用“fill-mask”管道。

 fromtransformersimportpipeline
 importrandom
 
 unmasker = pipeline('fill-mask', model='bert-base-cased')
 
 input_text = "I went to see a movie in the theater"
 
 orig_text_list = input_text.split()
 len_input = len(orig_text_list)
 #Random index where we want to replace the word 
 rand_idx = random.randint(1,len_input-1)
 
 new_text_list = orig_text_list.copy()
 new_text_list[rand_idx] = '[MASK]'
 new_mask_sent = ' '.join(new_text_list)
 print("Masked sentence->",new_mask_sent)
 #I went to [MASK] a movie in the theater
 
 augmented_text_list = unmasker(new_mask_sent)
 #To ensure new word and old word are not name
 forresinaugmented_text_list:
   ifres['token_str'] != orig_word:
     augmented_text = res['sequence']
     break
 print("Augmented text->",augmented_text)
 #I went to watch a movie in the theater

在上面的一个示例代码中，我们随机选择单词“see”并使用 BERT 将其替换为单词“watch”，从而生成具有相同含义但单词不同的句子“I went to watch a movie in the theater”。我们还可以使用相同的技术替换多个单词。对于随机插入和替换，我们还可以使用其他支持“填充掩码”任务的模型，如 Distilbert（小而快）、Roberta 甚至多语言模型！

文本生成

在这项技术中，我们使用 GPT2、distlgpt2 等生成模型来使句子更长。我们以原始文本作为开始，然后模型根据输入文本生成额外的单词，这样我们就可以在句子中添加随机噪声。如果我们只添加几个单词并使用相似度确保句子与原始句子相似，那么我们可以在不改变含义的情况下生成额外的句子！

 fromtransformersimportpipeline
 generator = pipeline('text-generation', model='gpt2')
 
 input_text = "I went to see a movie in the theater"
 input_length = len(input_text.split())
 num_new_words = 5
 output_length = input_length+num_new_words
 gpt_output = generator(input_text, max_length=output_length, num_return_sequences=5)
 augmented_text = gpt_output[0]['generated_text']
 print("Augmented text->",augmented_text)
 #I went to see a movie in the theater, and the director was

这里我们使用“文本生成”管道和 GPT-2 模型在我们的原始句子中添加 5 个新词，得到一个新句子，如“I went to see a movie in the theater, and the director was”，如果我们决定添加 10 个新词，我们可以得到这样的句子：“I went to see a movie in the theater, and the director was”。所以我们可以看到，根据我们的用例，我们可以生成很多不同长度的句子。

作者：Manu Suryavansh

https://towardsdatascience.com/nlp-data-augmentation-using-transformers-89a44a993bab

标签：

使用🤗Transformers进行NLP的数据增广的4种常用方法

Back Translation

随机插入

随机替换

文本生成

发表评论

“使用🤗Transformers进行NLP的数据增广的4种常用方法”的评论:

关于作者

Deephub

相关阅读

文章导航