0


Llama中文大模型-模型量化

对中文微调的模型参数进行了量化,方便以更少的计算资源运行。目前已经在Hugging Face上传了13B中文微调模型FlagAlpha/Llama2-Chinese-13b-Chat的4bit压缩版本FlagAlpha/Llama2-Chinese-13b-Chat-4bit,具体调用方式如下:

环境准备:

  1. pip install git+https://github.com/PanQiWei/AutoGPTQ.git
  1. from transformers import AutoTokenizer
  2. from auto_gptq import AutoGPTQForCausalLM
  3. model = AutoGPTQForCausalLM.from_quantized('FlagAlpha/Llama2-Chinese-13b-Chat-4bit', device="cuda:0")
  4. tokenizer = AutoTokenizer.from_pretrained('FlagAlpha/Llama2-Chinese-13b-Chat-4bit',use_fast=False)
  5. input_ids = tokenizer(['<s>Human: 怎么登上火星\n</s><s>Assistant: '], return_tensors="pt",add_special_tokens=False).input_ids.to('cuda')
  6. generate_input = {
  7. "input_ids":input_ids,
  8. "max_new_tokens":512,
  9. "do_sample":True,
  10. "top_k":50,
  11. "top_p":0.95,
  12. "temperature":0.3,
  13. "repetition_penalty":1.3,
  14. "eos_token_id":tokenizer.eos_token_id,
  15. "bos_token_id":tokenizer.bos_token_id,
  16. "pad_token_id":tokenizer.pad_token_id
  17. }
  18. generate_ids = model.generate(**generate_input)
  19. text = tokenizer.decode(generate_ids[0])
  20. print(text)
标签: llama

本文转载自: https://blog.csdn.net/TH_NUM/article/details/136274435
版权归原作者 蓝鲸123 所有, 如有侵权,请联系我们删除。

“Llama中文大模型-模型量化”的评论:

还没有评论