0


基于Transformer实现中英翻译任务的微调

本文旨在说明如何通过Transfoemers库和pytorch来微调一个中英翻译模型。这里选择开源的opus-mt-zh-en模型来实现微调,提升该模型在特定语料上的性能。入门小白,如果有误还请指导。

1.数据准备

这里选择 translation2019zh 语料作为数据集,它共包含中英文平行语料 520 万对,可以用于训练中英翻译模型。该数据集只划分好了训练集和验证集,分别包含 516 万和 3.9 万个样本,,语料以 json 格式提供,一行是一个中英文对照句子对:

  1. {'english': 'The fact that foreign brands dominate the China market rankles with central government economic planners, and could embolden local officials to take action, analysts say.', 'chinese': '分析人士指出,外国品牌主导中国市场的现实使中央政府的经济规划者们感到愤怒,并有可能使地方官员大胆采取行动。'}

1.1构建数据集

因为准备的语料并没有提供测试集,而且使用五百多万条样本进行训练耗时过长,这里我们只抽取训练集中的前 33 万条数据,并从中划分出 3 万条数据作为验证集,然后将 translation2019zh 中的验证集作为测试集(这一过程通过torch的Dataset来实现):

  1. from torch.utils.data import Dataset, random_split
  2. import json
  3. max_dataset_size = 330000
  4. train_set_size = 300000
  5. valid_set_size = 30000
  6. class TRANS(Dataset):
  7. def __init__(self, data_file):
  8. self.data = self.load_data(data_file)
  9. def load_data(self, data_file):
  10. Data = {}
  11. with open(data_file, 'rt', encoding='utf-8') as f:
  12. for idx, line in enumerate(f):
  13. if idx >= max_dataset_size:
  14. break
  15. sample = json.loads(line.strip())
  16. Data[idx] = sample
  17. return Data
  18. def __len__(self):
  19. return len(self.data)
  20. def __getitem__(self, idx):
  21. return self.data[idx]
  22. //路径根据实际路径替换
  23. data = TRANS('/mnt/workspace/translation/data/translation2019zh_train.json')
  24. train_data, valid_data = random_split(data, [train_set_size, valid_set_size])
  25. test_data = TRANS('/mnt/workspace/translation/data/translation2019zh_valid.json')

1.2数据预处理

因为我们在训练过程中需要先将数据集切分为很多的 mini-batches,然后按批 (batch) 将样本送入模型,并且循环这一过程,每一个完整遍历所有样本的循环称为一个 epoch,这就是torch中DataLoader做的事了。同时需要将文本转换为模型可以接受的 token IDs。对于翻译任务,我们需要运用分词器同时对源文本和目标文本进行编码。

我们会在模型预测出的标签序列与答案标签序列之间计算损失来调整模型参数,因此我们同样需要将填充的 pad 字符设置为 -50(其他不在分词器中的值都可以),以便在使用交叉熵计算序列损失时将它们忽略:

  1. import torch
  2. from transformers import AutoTokenizer
  3. #这是hugging face里的模型,需要科学上网
  4. #model_checkpoint = "Helsinki-NLP/opus-mt-zh-en"
  5. #这是modelscope里的,国内可以直接访问
  6. model_checkpoint ="moxying/opus-mt-zh-en"
  7. tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
  8. max_input_length = 128
  9. max_target_length = 128
  10. #每次给模型输入4批数据
  11. inputs = [train_data[s_idx]["chinese"] for s_idx in range(4)]
  12. targets = [train_data[s_idx]["english"] for s_idx in range(4)]
  13. model_inputs = tokenizer(
  14. inputs,
  15. padding=True,
  16. max_length=max_input_length,
  17. truncation=True,
  18. return_tensors="pt"
  19. )
  20. #默认情况下分词器会采用源语言的设定来编码文本,要编码目标语言则需要通过上下文管理器
  21. as_target_tokenizer(),否则中文分词器可能无法识别大部分的英文单词
  22. with tokenizer.as_target_tokenizer():
  23. labels = tokenizer(
  24. targets,
  25. padding=True,
  26. max_length=max_target_length,
  27. truncation=True,
  28. return_tensors="pt"
  29. )["input_ids"]
  30. end_token_index = torch.where(labels == tokenizer.eos_token_id)[1]
  31. for idx, end_idx in enumerate(end_token_index):
  32. labels[idx][end_idx+1:] = -50

我们使用的 模型会在分词结果的结尾加上特殊 token

  1. '</s>'

,因此这里通过

  1. tokenizer.eos_token_id

定位其在 token ID 序列中的索引,然后将其之后的 pad 字符设置为 -50。

翻译模型就是一个典型的Encoder-Decoder架构(也可以叫Seq2Seq ),Encoder 负责编码输入序列,Decoder 负责循环地逐个生成输出 token。因此,对于每一个样本,我们还需要额外准备 decoder input IDs 作为 Decoder 的输入。decoder input IDs 是标签序列的移位,在序列的开始位置增加了一个特殊的“序列起始符”。

考虑到不同模型的移位操作可能存在差异,我们通过模型自带的

  1. prepare_decoder_input_ids_from_labels

函数来完成。完整的批处理函数为:

  1. import torch
  2. from torch.utils.data import DataLoader
  3. from transformers import AutoModelForSeq2SeqLM
  4. from transformers import AutoTokenizer
  5. #这是hugging face里的模型,需要科学上网
  6. #model_checkpoint = "Helsinki-NLP/opus-mt-zh-en"
  7. #这是modelscope里的,国内可以直接访问
  8. model_checkpoint ="moxying/opus-mt-zh-en"
  9. tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
  10. max_input_length = 128
  11. max_target_length = 128
  12. device = 'cuda' if torch.cuda.is_available() else 'cpu'
  13. print(f'Using {device} device')
  14. model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
  15. model = model.to(device)
  16. def collote_fn(batch_samples):
  17. batch_inputs, batch_targets = [], []
  18. for sample in batch_samples:
  19. batch_inputs.append(sample['chinese'])
  20. batch_targets.append(sample['english'])
  21. batch_data = tokenizer(
  22. batch_inputs,
  23. padding=True,
  24. max_length=max_input_length,
  25. truncation=True,
  26. return_tensors="pt"
  27. )
  28. with tokenizer.as_target_tokenizer():
  29. labels = tokenizer(
  30. batch_targets,
  31. padding=True,
  32. max_length=max_target_length,
  33. truncation=True,
  34. return_tensors="pt"
  35. )["input_ids"]
  36. batch_data['decoder_input_ids'] = model.prepare_decoder_input_ids_from_labels(labels)
  37. end_token_index = torch.where(labels == tokenizer.eos_token_id)[1]
  38. for idx, end_idx in enumerate(end_token_index):
  39. labels[idx][end_idx+1:] = -100
  40. batch_data['labels'] = labels
  41. return batch_data
  42. train_dataloader = DataLoader(train_data, batch_size=32, shuffle=True, collate_fn=collote_fn)
  43. valid_dataloader = DataLoader(valid_data, batch_size=32, shuffle=False, collate_fn=collote_fn)

2.模型微调

前面直接使用 Transformers 库自带的

  1. AutoModelForSeq2SeqLM

类来构建模型,并且在批处理函数中还调用了模型自带的

  1. prepare_decoder_input_ids_from_labels

函数,因此下面只需要实现 Epoch 中的”训练循环”和”验证/测试循环”。

使用

  1. AutoModelForSeq2SeqLM

构造的模型已经封装好了对应的损失函数,并且计算出的损失会直接包含在模型的输出

  1. outputs

中,可以直接通过

  1. outputs.loss

获得,因此训练循环为:

  1. from tqdm.auto import tqdm
  2. def train_loop(dataloader, model, optimizer, lr_scheduler, epoch, total_loss):
  3. progress_bar = tqdm(range(len(dataloader)))
  4. progress_bar.set_description(f'loss: {0:>7f}')
  5. finish_batch_num = (epoch-1) * len(dataloader)
  6. model.train()
  7. for batch, batch_data in enumerate(dataloader, start=1):
  8. batch_data = batch_data.to(device)
  9. outputs = model(**batch_data)
  10. loss = outputs.loss
  11. optimizer.zero_grad()
  12. loss.backward()
  13. optimizer.step()
  14. lr_scheduler.step()
  15. total_loss += loss.item()
  16. progress_bar.set_description(f'loss: {total_loss/(finish_batch_num + batch):>7f}')
  17. progress_bar.update(1)
  18. return total_loss

验证/测试循环负责评估模型的性能。对于翻译任务,经典的评估指标是 BLEU 值,用于度量两个词语序列之间的一致性,但是其并不会衡量语义连贯性或者语法正确性。

由于计算 BLEU 值需要输入分好词的文本,而不同的分词方式会对结果造成影响,因此现在更常用的评估指标是 SacreBLEU,它对分词的过程进行了标准化。SacreBLEU 直接以未分词的文本作为输入,并且对于同一个输入可以接受多个目标作为参考。虽然我们使用的 translation2019zh 语料对于每一个句子只有一个参考,也需要将其包装为一个句子列表

在“验证/测试循环”中,我们首先通过

  1. model.generate()

函数获取预测结果,然后将预测结果和正确标签都处理为 SacreBLEU 接受的文本列表形式(这里我们将标签序列中的 -50 替换为 pad token ID 以便于分词器解码),最后送入到 SacreBLEU 中计算 BLEU 值:

  1. from sacrebleu.metrics import BLEU
  2. bleu = BLEU()
  3. def test_loop(dataloader, model):
  4. preds, labels = [], []
  5. model.eval()
  6. for batch_data in tqdm(dataloader):
  7. batch_data = batch_data.to(device)
  8. with torch.no_grad():
  9. generated_tokens = model.generate(
  10. batch_data["input_ids"],
  11. attention_mask=batch_data["attention_mask"],
  12. max_length=max_target_length,
  13. ).cpu().numpy()
  14. label_tokens = batch_data["labels"].cpu().numpy()
  15. decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
  16. label_tokens = np.where(label_tokens != -100, label_tokens, tokenizer.pad_token_id)
  17. decoded_labels = tokenizer.batch_decode(label_tokens, skip_special_tokens=True)
  18. preds += [pred.strip() for pred in decoded_preds]
  19. labels += [[label.strip()] for label in decoded_labels]
  20. bleu_score = bleu.corpus_score(preds, labels).score
  21. print(f"BLEU: {bleu_score:>0.2f}\n")
  22. return bleu_score

3.保存模型

这里会根据模型在验证集上的性能来调整超参数以及选出最好的模型权重,然后将选出的模型应用于测试集以评估最终的性能。这里我们继续使用 AdamW 优化器,并且通过

  1. get_scheduler()

函数定义学习率调度器:

  1. from transformers import AdamW, get_scheduler
  2. learning_rate = 2e-5
  3. epoch_num = 3
  4. optimizer = AdamW(model.parameters(), lr=learning_rate)
  5. lr_scheduler = get_scheduler(
  6. "linear",
  7. optimizer=optimizer,
  8. num_warmup_steps=0,
  9. num_training_steps=epoch_num*len(train_dataloader),
  10. )
  11. total_loss = 0.
  12. best_bleu = 0.
  13. for t in range(epoch_num):
  14. print(f"Epoch {t+1}/{epoch_num}\n-------------------------------")
  15. total_loss = train_loop(train_dataloader, model, optimizer, lr_scheduler, t+1, total_loss)
  16. valid_bleu = test_loop(valid_dataloader, model, mode='Valid')
  17. if valid_bleu > best_bleu:
  18. best_bleu = valid_bleu
  19. print('saving new weights...\n')
  20. torch.save(model.state_dict(), f'epoch_{t+1}_valid_bleu_{valid_bleu:0.2f}_model_weights.bin')
  21. print("Done!")

4.完整代码

  1. import random
  2. import os
  3. import numpy as np
  4. import torch
  5. from torch.utils.data import Dataset, DataLoader, random_split
  6. from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
  7. from transformers import AdamW, get_scheduler
  8. from sacrebleu.metrics import BLEU
  9. from tqdm.auto import tqdm
  10. import json
  11. from modelscope import snapshot_download
  12. def seed_everything(seed=1029):
  13. random.seed(seed)
  14. os.environ['PYTHONHASHSEED'] = str(seed)
  15. np.random.seed(seed)
  16. torch.manual_seed(seed)
  17. torch.cuda.manual_seed(seed)
  18. torch.cuda.manual_seed_all(seed)
  19. torch.backends.cudnn.deterministic = True
  20. device = 'cuda' if torch.cuda.is_available() else 'cpu'
  21. print(f'Using {device} device')
  22. seed_everything(42)
  23. max_dataset_size = 220000
  24. train_set_size = 200000
  25. valid_set_size = 20000
  26. max_input_length = 128
  27. max_target_length = 128
  28. batch_size = 32
  29. learning_rate = 1e-5
  30. epoch_num = 3
  31. class TRANS(Dataset):
  32. def __init__(self, data_file):
  33. self.data = self.load_data(data_file)
  34. def load_data(self, data_file):
  35. Data = {}
  36. with open(data_file, 'rt', encoding='utf-8') as f:
  37. for idx, line in enumerate(f):
  38. if idx >= max_dataset_size:
  39. break
  40. sample = json.loads(line.strip())
  41. Data[idx] = sample
  42. return Data
  43. def __len__(self):
  44. return len(self.data)
  45. def __getitem__(self, idx):
  46. return self.data[idx]
  47. data = TRANS('/mnt/workspace/translation/data/translation2019zh_train.json')
  48. train_data, valid_data = random_split(data, [train_set_size, valid_set_size])
  49. test_data = TRANS('/mnt/workspace/translation/data/translation2019zh_valid.json')
  50. model_checkpoint = snapshot_download('moxying/opus-mt-zh-en')
  51. tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
  52. model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
  53. model = model.to(device)
  54. def collote_fn(batch_samples):
  55. batch_inputs, batch_targets = [], []
  56. for sample in batch_samples:
  57. batch_inputs.append(sample['chinese'])
  58. batch_targets.append(sample['english'])
  59. batch_data = tokenizer(
  60. batch_inputs,
  61. padding=True,
  62. max_length=max_input_length,
  63. truncation=True,
  64. return_tensors="pt"
  65. )
  66. with tokenizer.as_target_tokenizer():
  67. labels = tokenizer(
  68. batch_targets,
  69. padding=True,
  70. max_length=max_target_length,
  71. truncation=True,
  72. return_tensors="pt"
  73. )["input_ids"]
  74. batch_data['decoder_input_ids'] = model.prepare_decoder_input_ids_from_labels(labels)
  75. end_token_index = torch.where(labels == tokenizer.eos_token_id)[1]
  76. for idx, end_idx in enumerate(end_token_index):
  77. labels[idx][end_idx+1:] = -100
  78. batch_data['labels'] = labels
  79. return batch_data
  80. train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=collote_fn)
  81. valid_dataloader = DataLoader(valid_data, batch_size=batch_size, shuffle=False, collate_fn=collote_fn)
  82. test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=False, collate_fn=collote_fn)
  83. def train_loop(dataloader, model, optimizer, lr_scheduler, epoch, total_loss):
  84. progress_bar = tqdm(range(len(dataloader)))
  85. progress_bar.set_description(f'loss: {0:>7f}')
  86. finish_batch_num = (epoch-1) * len(dataloader)
  87. model.train()
  88. for batch, batch_data in enumerate(dataloader, start=1):
  89. batch_data = batch_data.to(device)
  90. outputs = model(**batch_data)
  91. loss = outputs.loss
  92. optimizer.zero_grad()
  93. loss.backward()
  94. optimizer.step()
  95. lr_scheduler.step()
  96. total_loss += loss.item()
  97. progress_bar.set_description(f'loss: {total_loss/(finish_batch_num + batch):>7f}')
  98. progress_bar.update(1)
  99. return total_loss
  100. bleu = BLEU()
  101. def test_loop(dataloader, model):
  102. preds, labels = [], []
  103. model.eval()
  104. for batch_data in tqdm(dataloader):
  105. batch_data = batch_data.to(device)
  106. with torch.no_grad():
  107. generated_tokens = model.generate(
  108. batch_data["input_ids"],
  109. attention_mask=batch_data["attention_mask"],
  110. max_length=max_target_length,
  111. ).cpu().numpy()
  112. label_tokens = batch_data["labels"].cpu().numpy()
  113. decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
  114. label_tokens = np.where(label_tokens != -100, label_tokens, tokenizer.pad_token_id)
  115. decoded_labels = tokenizer.batch_decode(label_tokens, skip_special_tokens=True)
  116. preds += [pred.strip() for pred in decoded_preds]
  117. labels += [[label.strip()] for label in decoded_labels]
  118. bleu_score = bleu.corpus_score(preds, labels).score
  119. print(f"BLEU: {bleu_score:>0.2f}\n")
  120. return bleu_score
  121. optimizer = AdamW(model.parameters(), lr=learning_rate)
  122. lr_scheduler = get_scheduler(
  123. "linear",
  124. optimizer=optimizer,
  125. num_warmup_steps=0,
  126. num_training_steps=epoch_num*len(train_dataloader),
  127. )
  128. total_loss = 0.
  129. best_bleu = 0.
  130. for t in range(epoch_num):
  131. print(f"Epoch {t+1}/{epoch_num}\n-------------------------------")
  132. total_loss = train_loop(train_dataloader, model, optimizer, lr_scheduler, t+1, total_loss)
  133. valid_bleu = test_loop(valid_dataloader, model)
  134. if valid_bleu > best_bleu:
  135. best_bleu = valid_bleu
  136. print('saving new weights...\n')
  137. torch.save(
  138. model.state_dict(),
  139. f'epoch_{t+1}_valid_bleu_{valid_bleu:0.2f}_model_weights.bin'
  140. )
  141. print("Done!")

5.模型测试

训练完成后,我们加载在验证集上性能最优的模型权重,汇报其在测试集上的性能由于

  1. AutoModelForSeq2SeqLM

对整个解码过程进行了封装,我们只需要调用

  1. generate()

函数就可以自动通过 beam search 找到最佳的 token ID 序列,因此我们只需要再使用分词器将 token ID 序列转换为文本就可以获得翻译结果:

  1. model.load_state_dict(torch.load('epoch_1_valid_bleu_53.38_model_weights.bin'))
  2. model.eval()
  3. with torch.no_grad():
  4. print('evaluating on test set...')
  5. sources, preds, labels = [], [], []
  6. for batch_data in tqdm(test_dataloader):
  7. batch_data = batch_data.to(device)
  8. generated_tokens = model.generate(
  9. batch_data["input_ids"],
  10. attention_mask=batch_data["attention_mask"],
  11. max_length=max_target_length,
  12. ).cpu().numpy()
  13. label_tokens = batch_data["labels"].cpu().numpy()
  14. decoded_sources = tokenizer.batch_decode(
  15. batch_data["input_ids"].cpu().numpy(),
  16. skip_special_tokens=True,
  17. use_source_tokenizer=True
  18. )
  19. decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
  20. label_tokens = np.where(label_tokens != -100, label_tokens, tokenizer.pad_token_id)
  21. decoded_labels = tokenizer.batch_decode(label_tokens, skip_special_tokens=True)
  22. sources += [source.strip() for source in decoded_sources]
  23. preds += [pred.strip() for pred in decoded_preds]
  24. labels += [[label.strip()] for label in decoded_labels]
  25. bleu_score = bleu.corpus_score(preds, labels).score
  26. print(f"Test BLEU: {bleu_score:>0.2f}\n")

可以看到,经过微调,模型在测试集上的 BLEU 值达到了 54.87,证明了我们对模型的微调是成功的。(没有微调之前的值为42左右)


本文转载自: https://blog.csdn.net/WB231444/article/details/140737820
版权归原作者 星辰境末 所有, 如有侵权,请联系我们删除。

“基于Transformer实现中英翻译任务的微调”的评论:

还没有评论