0


仅需5行代码,从HuggingFace到昇腾人工智能计算中心

今日话题

Optimum Ascend

HuggingFace Transformers用户的福音来了,昇腾推出高阶迁移工具Optimum Ascend ,只需几行代码可以让Transformers用户能够简单直接地在昇腾人工智能计算中心进行模型训练、微调和评估。

具体操作如下:

01

安装Optimum Ascend:

直接执行以下两行代码即可成功安装Optimum Ascend。

git clone https://gitee.com/ascend/transformers.git -b optimum && cd transformers

pip install -e .

02

完成自动迁移:

使用Optimum.ascend自带的自动迁移方式在NPU上使用Transformers,具体操作是在训练脚本中引入以下头文件,然后运行即可。

import torch_npu

from torch_npu.contrib import transfer_to_npu

from optimum.ascend import transfor_to_npu

通过上述描述可以看出,用户只需要5行代码即可完成迁移,从而可以进行模型的训练和评估。

近日,某用户基于昇腾人工智能计算中心完成了T5和CodeT5模型的训练微调,训练速度比主流GPU提升10%。相比于零散的算力,计算中心提供多机多卡训练环境,提升模型训练稳定性,缩短模型迭代周期。同时计算中心提供普惠算力,一定程度上降低模型研发成本。

案例介绍

昇腾系列处理器是基于华为达芬奇架构的 NPU。昇腾训练处理器具有超高算力,性能最高可达 320 TFLOPS FP16,并且支持DeepSpeed框架。

图片

1)准备环境

开通计算中心账号,在开发环境创建一个notebook实例,创建时选择Pytorch1.8的镜像,打开后进入终端。(或准备一台昇腾服务器,并且安装好了训练卡驱动,torch,torch_npu和6.0.1版本以上的CANN)。用户需要导入以下包:

import torch

import torch_npu

import transfer_to_npu

图片

2)安装原生Transformers框架(要求Transformers版本为4.25.1,

PyTorch版本为1.8.1。)

git clone https://gitee.com/ascend/transformers.git -b optimum && cd transformers

pip install -e .

图片

3)选择有监督训练

from optimum.ascend import transfor_to_npu

from transformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained("t5-small")

tokenizer = T5Tokenizer.from_pretrained("t5-small")

input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids

labels = tokenizer('Das Haus ist wunderbar.', return_tensors='pt').input_ids

the forward function automatically creates the correct decoder_input_ids loss = model(input_ids=input_ids, labels=labels).loss

图片

4)模型搭建

借助T5ForConditionalGeneration模型来搭建一个简单的由文本到SQL语句的翻译模型。其Pytorch实现如下:

class T5ForTextToSQL(torch.nn.Module):

'''

A basic T5 model for Text-to-SQL task.

'''

def __init__(self):

    super(T5ForTextToSQL, self).__init__()

    self.t5 = T5ForConditionalGeneration.from_pretrained('t5-small')

    

def forward(self, input_ids, labels):

    out = self.t5(input_ids=input_ids, labels=labels)

    

    return out



def generate(self, input_ids):

    result = self.t5.generate(input_ids=input_ids)

    

    return result

图片

5)准备数据集

使用一个自定义的数据集来快速开始。其定义如下:

class TextToSQL_Dataset(torch.utils.data.Dataset):

'''

A simple text-to-sql dataset example.

'''

def __init__(self, text_l, schema_l, sql_l, tokenizer, block_size=1):

    self.tokenizer = tokenizer

    self.max_len = block_size

    self.text = text_l

    self.scheme = schema_l

    self.sql = sql_l

    

    

def _text_to_encoding(self, item):

    return self.tokenizer(item)

def _text_to_item(self, text):

    try:

        if (text is not None):

            return self._text_to_encoding(text)

        else:

            return None

    except:

        return None

    

    

def __len__(self):

    return len(self.sql)





def __getitem__(self, _id):

    text = self.text[_id]

    sql = self.sql[_id]

    schema = self.scheme[_id]

    text_encodings = self._text_to_item("translate Text to SQL: " + text)

    sql_encodings = self._text_to_item(sql)

    schema_encodings = self._text_to_item(schema)

    item = dict()

    item['text_encodings'] = {key: torch.tensor(value) for key, value in text_encodings.items()}

    item['sql_encodings'] = {key: torch.tensor(value) for key, value in sql_encodings.items()}

    item['schema_encodings'] = {key: torch.tensor(value) for key, value in schema_encodings.items()}

    

    return item

训练集和测试集如下:

以下为train_set

text_l = [

"Find all student names in student database.",

"Count student's number for class 1. ",

"Given the max student age in class 1.",

"Please find the minium student age in class 1.",

"Tell me the number of classes.",

"Who is the student that older than 15."

]

schema_l = [

'Table: student$$header: name%%age%%class%%',

]*len(text_l)

sql_l = [

"SELECT name FROM student",

"SELECT COUNT(*) FROM student WHERE class=1",

"SELECT MAX(age) FROM student WHERE class=1",

"SELECT MIN(age) FROM student WHERE class=1",

"SELECT COUNT(class) FROM student",

"SELECT name FROM student WHERE age>15",

]

以下为test_set

test_text_l = [

"Find all student ages in student database.",

"Count student's number for class 3. ",

"Given the min student age in class 2.",

"Please find the maxium student age in class 2.",

"Who is the student that younger than 14."

]

test_schema_l = [

'Table: student$$header: name%%age%%class%%',

]*len(text_l)

test_sql_l = [

"SELECT age FROM student",

"SELECT COUNT(*) FROM student WHERE class=3",

"SELECT MIN(age) FROM student WHERE class=2",

"SELECT MAX(age) FROM student WHERE class=2",

"SELECT name FROM student WHERE age<14",

]

在构建好数据集文本后,定义和生成dataset以及dataloader对象:

train_dataset = TextToSQL_Dataset(text_l, schema_l, sql_l, tokenizer)

test_dataset = TextToSQL_Dataset(test_text_l, test_schema_l, test_sql_l, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=1, shuffle=True)

test_loader = DataLoader(test_dataset, batch_size=1, shuffle=True)

图片

  1. 模型初始化

首先,初始化一个简单的模型:

model = T5ForTextToSQL()

为了验证训练的有效性,可以看看一个原始的不经过Fine-tune的原始T5模型在Text-to-SQL任务上的表现。

直接使用测试集进行评估测试,主要代码就是调用其generate函数来生成输出的Sequence并与ground truth进行对比。

device = torch.device('cuda:3') if torch.cuda.is_available() else torch.device('cpu')

model.eval()

model = model.to(device)

for i,batch in enumerate(test_loader):

input_ids = batch['text_encodings']['input_ids'].to(device)

sql_ids = batch['sql_encodings']['input_ids'].to(device)

result = model.generate(input_ids)

print("==="*20)

print("Question:")

print(tokenizer.decode(input_ids[0]))

print("SQL:")

print(tokenizer.decode(result[0]))

print()

运行以上代码,得到输出如下:

============================================================

Question:

translate Text to SQL: Who is the student that younger than 14.

SQL:

<pad> Text zu SQL: Wer ist der Student, der unter 14 Jahren ist?

============================================================

Question:

translate Text to SQL: Find all student ages in student database.

SQL:

<pad> Text in SQL: Finden Sie alle Studenten Alter in der Studentendatenbank.

============================================================

Question:

translate Text to SQL: Given the min student age in class 2.

SQL:

<pad> Text zu SQL: Angesichts des min Studenten Alters in der Klasse 2

============================================================

Question:

translate Text to SQL: Count student's number for class 3.

SQL:

<pad> Text in SQL: Count student's number for class 3.

============================================================

Question:

translate Text to SQL: Please find the maxium student age in class 2.

SQL:

<pad> Text in SQL: Bitte finden Sie das maxium Studentenalter in der Klasse

图片

7)模型的训练

在进行模型训练时,首先要设置好优化器。这里采用T5在预训练时使用的Adam优化器,学习率设置为5e-5(这些超参数后面都可以仔细精调)。

optim = AdamW(model.parameters(), lr=5e-5)

device = torch.device('cuda:3') if torch.cuda.is_available() else torch.device('cpu')

model.train()

model = model.to(device)

for epoch in range(100):

for i,batch in enumerate(train_loader):

    optim.zero_grad()

    input_ids = batch['text_encodings']['input_ids'].to(device)

    sql_ids = batch['sql_encodings']['input_ids'].to(device)

    loss = model(input_ids=input_ids, labels=sql_ids).loss

    loss.backward()

    optim.step()

    if epoch % 10 == 0 and i % 10 == 0:

        print("Epoch: ", epoch, " , step: ", i)

        print("training loss: ", loss.item())

运行以上代码,程序的输出如下:

Epoch: 0 , step: 0

training loss: 5.786584854125977

Epoch: 10 , step: 0

training loss: 3.1280531883239746

Epoch: 20 , step: 0

training loss: 1.795115351676941

Epoch: 30 , step: 0

training loss: 0.7517924308776855

Epoch: 40 , step: 0

training loss: 0.2508695125579834

Epoch: 50 , step: 0

training loss: 0.0881464034318924

Epoch: 60 , step: 0

training loss: 0.3708261251449585

Epoch: 70 , step: 0

training loss: 0.0828586220741272

Epoch: 80 , step: 0

training loss: 0.03668573126196861

Epoch: 90 , step: 0

training loss: 0.02559477463364601

可以看到,随着模型训练的进行,loss降低到一个比较低的水平后基本收敛。

图片

8)模型的评估

device = torch.device('cuda:3') if torch.cuda.is_available() else torch.device('cpu')

model.eval()

model = model.to(device)

for i,batch in enumerate(test_loader):

input_ids = batch['text_encodings']['input_ids'].to(device)

sql_ids = batch['sql_encodings']['input_ids'].to(device)

result = model.generate(input_ids)

print("==="*20)

print("Question:")

print(tokenizer.decode(input_ids[0]))

print("SQL:")

print(tokenizer.decode(result[0]))

print()

运行以上代码,其输出如下:

============================================================

Question:

translate Text to SQL: Given the min student age in class 2.

SQL:

<pad> SELECT MAX(age) FROM student WHERE class=2

============================================================

Question:

translate Text to SQL: Who is the student that younger than 14.

SQL:

<pad> SELECT name FROM student WHERE age>14

============================================================

Question:

translate Text to SQL: Find all student ages in student database.

SQL:

<pad> SELECT MIN(age) FROM student

============================================================

Question:

translate Text to SQL: Please find the maxium student age in class 2.

SQL:

<pad> SELECT MAX(age) FROM student WHERE class=2

============================================================

Question:

translate Text to SQL: Count student's number for class 3.

SQL:

<pad> SELECT COUNT(*) FROM student WHERE class=3


本文转载自: https://blog.csdn.net/chillfrog/article/details/135736164
版权归原作者 chillfrog 所有, 如有侵权,请联系我们删除。

“仅需5行代码,从HuggingFace到昇腾人工智能计算中心”的评论:

还没有评论