0


Llama中文大模型-模型量化

对中文微调的模型参数进行了量化,方便以更少的计算资源运行。目前已经在Hugging Face上传了13B中文微调模型FlagAlpha/Llama2-Chinese-13b-Chat的4bit压缩版本FlagAlpha/Llama2-Chinese-13b-Chat-4bit,具体调用方式如下:

环境准备:

pip install git+https://github.com/PanQiWei/AutoGPTQ.git
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized('FlagAlpha/Llama2-Chinese-13b-Chat-4bit', device="cuda:0")
tokenizer = AutoTokenizer.from_pretrained('FlagAlpha/Llama2-Chinese-13b-Chat-4bit',use_fast=False)
input_ids = tokenizer(['<s>Human: 怎么登上火星\n</s><s>Assistant: '], return_tensors="pt",add_special_tokens=False).input_ids.to('cuda')        
generate_input = {
    "input_ids":input_ids,
    "max_new_tokens":512,
    "do_sample":True,
    "top_k":50,
    "top_p":0.95,
    "temperature":0.3,
    "repetition_penalty":1.3,
    "eos_token_id":tokenizer.eos_token_id,
    "bos_token_id":tokenizer.bos_token_id,
    "pad_token_id":tokenizer.pad_token_id
}
generate_ids  = model.generate(**generate_input)
text = tokenizer.decode(generate_ids[0])
print(text)
标签: llama

本文转载自: https://blog.csdn.net/TH_NUM/article/details/136274435
版权归原作者 蓝鲸123 所有, 如有侵权,请联系我们删除。

“Llama中文大模型-模型量化”的评论:

还没有评论