0


使用Accelerate库在多GPU上进行LLM推理

大型语言模型(llm)已经彻底改变了自然语言处理领域。随着这些模型在规模和复杂性上的增长,推理的计算需求也显著增加。为了应对这一挑战利用多个gpu变得至关重要。

所以本文将在多个gpu上并行执行推理,主要包括:Accelerate库介绍,简单的方法与工作代码示例和使用多个gpu的性能基准测试。

本文将使用多个3090将llama2-7b的推理扩展在多个GPU上

基本示例

我们首先介绍一个简单的示例来演示使用Accelerate进行多gpu“消息传递”。

  1. from accelerate import Accelerator
  2. from accelerate.utils import gather_object
  3. accelerator = Accelerator()
  4. # each GPU creates a string
  5. message=[ f"Hello this is GPU {accelerator.process_index}" ]
  6. # collect the messages from all GPUs
  7. messages=gather_object(message)
  8. # output the messages only on the main process with accelerator.print()
  9. accelerator.print(messages)

输出如下:

  1. ['Hello this is GPU 0',
  2. 'Hello this is GPU 1',
  3. 'Hello this is GPU 2',
  4. 'Hello this is GPU 3',
  5. 'Hello this is GPU 4']

多GPU推理

下面是一个简单的、非批处理的推理方法。代码很简单,因为Accelerate库已经帮我们做了很多工作,我们直接使用就可以:

  1. from accelerate import Accelerator
  2. from accelerate.utils import gather_object
  3. from transformers import AutoModelForCausalLM, AutoTokenizer
  4. from statistics import mean
  5. import torch, time, json
  6. accelerator = Accelerator()
  7. # 10*10 Prompts. Source: https://www.penguin.co.uk/articles/2022/04/best-first-lines-in-books
  8. prompts_all=[
  9. "The King is dead. Long live the Queen.",
  10. "Once there were four children whose names were Peter, Susan, Edmund, and Lucy.",
  11. "The story so far: in the beginning, the universe was created.",
  12. "It was a bright cold day in April, and the clocks were striking thirteen.",
  13. "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
  14. "The sweat wis lashing oafay Sick Boy; he wis trembling.",
  15. "124 was spiteful. Full of Baby's venom.",
  16. "As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect.",
  17. "I write this sitting in the kitchen sink.",
  18. "We were somewhere around Barstow on the edge of the desert when the drugs began to take hold.",
  19. ] * 10
  20. # load a base model and tokenizer
  21. model_path="models/llama2-7b"
  22. model = AutoModelForCausalLM.from_pretrained(
  23. model_path,
  24. device_map={"": accelerator.process_index},
  25. torch_dtype=torch.bfloat16,
  26. )
  27. tokenizer = AutoTokenizer.from_pretrained(model_path)
  28. # sync GPUs and start the timer
  29. accelerator.wait_for_everyone()
  30. start=time.time()
  31. # divide the prompt list onto the available GPUs
  32. with accelerator.split_between_processes(prompts_all) as prompts:
  33. # store output of generations in dict
  34. results=dict(outputs=[], num_tokens=0)
  35. # have each GPU do inference, prompt by prompt
  36. for prompt in prompts:
  37. prompt_tokenized=tokenizer(prompt, return_tensors="pt").to("cuda")
  38. output_tokenized = model.generate(**prompt_tokenized, max_new_tokens=100)[0]
  39. # remove prompt from output
  40. output_tokenized=output_tokenized[len(prompt_tokenized["input_ids"][0]):]
  41. # store outputs and number of tokens in result{}
  42. results["outputs"].append( tokenizer.decode(output_tokenized) )
  43. results["num_tokens"] += len(output_tokenized)
  44. results=[ results ] # transform to list, otherwise gather_object() will not collect correctly
  45. # collect results from all the GPUs
  46. results_gathered=gather_object(results)
  47. if accelerator.is_main_process:
  48. timediff=time.time()-start
  49. num_tokens=sum([r["num_tokens"] for r in results_gathered ])
  50. print(f"tokens/sec: {num_tokens//timediff}, time {timediff}, total tokens {num_tokens}, total prompts {len(prompts_all)}")

使用多个gpu会导致一些通信开销:性能在4个gpu时呈线性增长,然后在这种特定设置中趋于稳定。当然这里的性能取决于许多参数,如模型大小和量化、提示长度、生成的令牌数量和采样策略,所以我们只讨论一般的情况

1 GPU: 44个token /秒,时间:225.5s

2 gpu: 88个token /秒,时间:112.9s

3 gpu: 128个token /秒,时间:77.6s

4 gpu: 137个token /秒,时间:72.7s

5 gpu: 119个token /秒,时间:83.8s

在多GPU上进行批处理

现实世界中,我们可以使用批处理推理来加快速度。这会减少GPU之间的通讯,加快推理速度。我们只需要增加prepare_prompts函数将一批数据而不是单条数据输入到模型即可:

  1. from accelerate import Accelerator
  2. from accelerate.utils import gather_object
  3. from transformers import AutoModelForCausalLM, AutoTokenizer
  4. from statistics import mean
  5. import torch, time, json
  6. accelerator = Accelerator()
  7. def write_pretty_json(file_path, data):
  8. import json
  9. with open(file_path, "w") as write_file:
  10. json.dump(data, write_file, indent=4)
  11. # 10*10 Prompts. Source: https://www.penguin.co.uk/articles/2022/04/best-first-lines-in-books
  12. prompts_all=[
  13. "The King is dead. Long live the Queen.",
  14. "Once there were four children whose names were Peter, Susan, Edmund, and Lucy.",
  15. "The story so far: in the beginning, the universe was created.",
  16. "It was a bright cold day in April, and the clocks were striking thirteen.",
  17. "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
  18. "The sweat wis lashing oafay Sick Boy; he wis trembling.",
  19. "124 was spiteful. Full of Baby's venom.",
  20. "As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect.",
  21. "I write this sitting in the kitchen sink.",
  22. "We were somewhere around Barstow on the edge of the desert when the drugs began to take hold.",
  23. ] * 10
  24. # load a base model and tokenizer
  25. model_path="models/llama2-7b"
  26. model = AutoModelForCausalLM.from_pretrained(
  27. model_path,
  28. device_map={"": accelerator.process_index},
  29. torch_dtype=torch.bfloat16,
  30. )
  31. tokenizer = AutoTokenizer.from_pretrained(model_path)
  32. tokenizer.pad_token = tokenizer.eos_token
  33. # batch, left pad (for inference), and tokenize
  34. def prepare_prompts(prompts, tokenizer, batch_size=16):
  35. batches=[prompts[i:i + batch_size] for i in range(0, len(prompts), batch_size)]
  36. batches_tok=[]
  37. tokenizer.padding_side="left"
  38. for prompt_batch in batches:
  39. batches_tok.append(
  40. tokenizer(
  41. prompt_batch,
  42. return_tensors="pt",
  43. padding='longest',
  44. truncation=False,
  45. pad_to_multiple_of=8,
  46. add_special_tokens=False).to("cuda")
  47. )
  48. tokenizer.padding_side="right"
  49. return batches_tok
  50. # sync GPUs and start the timer
  51. accelerator.wait_for_everyone()
  52. start=time.time()
  53. # divide the prompt list onto the available GPUs
  54. with accelerator.split_between_processes(prompts_all) as prompts:
  55. results=dict(outputs=[], num_tokens=0)
  56. # have each GPU do inference in batches
  57. prompt_batches=prepare_prompts(prompts, tokenizer, batch_size=16)
  58. for prompts_tokenized in prompt_batches:
  59. outputs_tokenized=model.generate(**prompts_tokenized, max_new_tokens=100)
  60. # remove prompt from gen. tokens
  61. outputs_tokenized=[ tok_out[len(tok_in):]
  62. for tok_in, tok_out in zip(prompts_tokenized["input_ids"], outputs_tokenized) ]
  63. # count and decode gen. tokens
  64. num_tokens=sum([ len(t) for t in outputs_tokenized ])
  65. outputs=tokenizer.batch_decode(outputs_tokenized)
  66. # store in results{} to be gathered by accelerate
  67. results["outputs"].extend(outputs)
  68. results["num_tokens"] += num_tokens
  69. results=[ results ] # transform to list, otherwise gather_object() will not collect correctly
  70. # collect results from all the GPUs
  71. results_gathered=gather_object(results)
  72. if accelerator.is_main_process:
  73. timediff=time.time()-start
  74. num_tokens=sum([r["num_tokens"] for r in results_gathered ])
  75. print(f"tokens/sec: {num_tokens//timediff}, time elapsed: {timediff}, num_tokens {num_tokens}")

可以看到批处理会大大加快速度。

1 GPU: 520 token /sec,时间:19.2s

2 gpu: 900 token /sec,时间:11.1s

3 gpu: 1205个token /秒,时间:8.2s

4 gpu: 1655 token /sec,时间:6.0s

5 gpu: 1658 token /sec,时间:6.0s

总结

截止到本文为止,llama.cpp,ctransformer还不支持多GPU推理,好像llama.cpp在6月有个多GPU的merge,但是我没看到官方更新,所以这里暂时确定不支持多GPU。如果有小伙伴确认可以支持多GPU请留言。

huggingface的Accelerate包则为我们使用多GPU提供了一个很方便的选择,使用多个GPU推理可以显着提高性能,但gpu之间通信的开销随着gpu数量的增加而显著增加。

作者:Geronimo

“使用Accelerate库在多GPU上进行LLM推理”的评论:

还没有评论