LLMPerf测试工具使用指导

备注：
翻译自官方仓库remadme文件。

用于评估 LLM API 性能的工具。

安装

git clone https://github.com/ray-project/llmperf.git
cd llmperf
pip install-e.

基本用法

我们实施了 2 个测试来评估 LLM：一个用于检查性能的负载测试，一个用于检查正确性的正确性测试。

负载测试

负载测试会生成对 LLM API 的大量并发请求，并测量每个请求和跨并发请求的令牌间延迟和生成吞吐量。随每个请求一起发送的提示的格式为：

Randomly stream lines from the following text. Don't generate eos tokens:
LINE 1,
LINE 2,
LINE 3,
...

其中的台词是从莎士比亚十四行诗的一组台词中随机抽取的。无论正在测试哪个 LLM API，都使用分词器

LlamaTokenizer

进行Token计数。这是为了确保提示在不同的 LLM API 之间保持一致。

若要运行最基本的负载测试，可以token_benchmark_ray脚本。

注意事项和免责声明

端点提供程序后端可能会有很大差异，因此这并不反映软件在特定硬件上的运行方式。
结果可能因一天中的时间而异。
结果可能因负载而异。
结果可能与用户的工作负载不相关。

OpenAI Compatible APIs

exportOPENAI_API_KEY=secret_abcdefg
exportOPENAI_API_BASE="https://api.endpoints.anyscale.com/v1"

python token_benchmark_ray.py \--model"meta-llama/Llama-2-7b-chat-hf"\
--mean-input-tokens 550\
--stddev-input-tokens 150\
--mean-output-tokens 150\
--stddev-output-tokens 10\
--max-num-completed-requests 2\--timeout600\
--num-concurrent-requests 1\
--results-dir "result_outputs"\
--llm-api openai \
--additional-sampling-params '{}'

Anthropic

exportANTHROPIC_API_KEY=secret_abcdefg

python token_benchmark_ray.py \--model"claude-2"\
--mean-input-tokens 550\
--stddev-input-tokens 150\
--mean-output-tokens 150\
--stddev-output-tokens 10\
--max-num-completed-requests 2\--timeout600\
--num-concurrent-requests 1\
--results-dir "result_outputs"\
--llm-api anthropic \
--additional-sampling-params '{}'

TogetherAI

exportTOGETHERAI_API_KEY="YOUR_TOGETHER_KEY"

python token_benchmark_ray.py \--model"together_ai/togethercomputer/CodeLlama-7b-Instruct"\
--mean-input-tokens 550\
--stddev-input-tokens 150\
--mean-output-tokens 150\
--stddev-output-tokens 10\
--max-num-completed-requests 2\--timeout600\
--num-concurrent-requests 1\
--results-dir "result_outputs"\
--llm-api "litellm"\
--additional-sampling-params '{}'

Hugging Face

exportHUGGINGFACE_API_KEY="YOUR_HUGGINGFACE_API_KEY"exportHUGGINGFACE_API_BASE="YOUR_HUGGINGFACE_API_ENDPOINT"

python token_benchmark_ray.py \--model"huggingface/meta-llama/Llama-2-7b-chat-hf"\
--mean-input-tokens 550\
--stddev-input-tokens 150\
--mean-output-tokens 150\
--stddev-output-tokens 10\
--max-num-completed-requests 2\--timeout600\
--num-concurrent-requests 1\
--results-dir "result_outputs"\
--llm-api "litellm"\
--additional-sampling-params '{}'

LiteLLM

LLMPerf 可以使用 LiteLLM 向 LLM API 发送提示。查看要为提供程序设置的环境变量以及应为 model 和 extraalal-sampling-params 设置的参数。

请参阅 LiteLLM 提供程序文档。

python token_benchmark_ray.py \--model"meta-llama/Llama-2-7b-chat-hf"\
--mean-input-tokens 550\
--stddev-input-tokens 150\
--mean-output-tokens 150\
--stddev-output-tokens 10\
--max-num-completed-requests 2\--timeout600\
--num-concurrent-requests 1\
--results-dir "result_outputs"\
--llm-api "litellm"\
--additional-sampling-params '{}'

Vertex AI

在这里，–model 用于日志记录，而不是用于选择模型。该模型在 Vertex AI 终端节点 ID 中指定。

GCLOUD_ACCESS_TOKEN需要定期设置，因为生成的令牌会在 15 分钟左右后过期。gcloud auth print-access-token

Vertex AI 不会返回其端点生成的令牌总数，因此使用 LLama 分词器对令牌进行计数。

gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

exportGCLOUD_ACCESS_TOKEN=$(gcloud auth print-access-token)exportGCLOUD_PROJECT_ID=YOUR_PROJECT_ID
exportGCLOUD_REGION=YOUR_REGION
exportVERTEXAI_ENDPOINT_ID=YOUR_ENDPOINT_ID

python token_benchmark_ray.py \--model"meta-llama/Llama-2-7b-chat-hf"\
--mean-input-tokens 550\
--stddev-input-tokens 150\
--mean-output-tokens 150\
--stddev-output-tokens 10\
--max-num-completed-requests 2\--timeout600\
--num-concurrent-requests 1\
--results-dir "result_outputs"\
--llm-api "vertexai"\
--additional-sampling-params '{}'

SageMaker

SageMaker 不会返回其终端节点生成的令牌总数，因此使用 LLama 分词器对令牌进行计数。

exportAWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"exportAWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"s
exportAWS_SESSION_TOKEN="YOUR_SESSION_TOKEN"exportAWS_REGION_NAME="YOUR_ENDPOINTS_REGION_NAME"

python llm_correctness.py \--model"llama-2-7b"\
--llm-api "sagemaker"\
--max-num-completed-requests 2\--timeout600\
--num-concurrent-requests 1\
--results-dir "result_outputs"\

使用

python token_benchmark_ray.py --help

查看更多的参数说明。

正确性测试

正确性测试生成了许多对 LLM API 的并发请求，格式如下：

Convert the following sequence of words into a number: {random_number_in_word_format}. Output just your final answer.

例如，random_number_in_word_format可以是“一百二十三”。然后，测试检查响应是否包含数字格式的数字，在本例中为 123。

该测试对许多随机生成的数字执行此操作，并报告包含不匹配的响应数。

要运行最基本的正确性测试，您可以运行llm_correctness.py脚本。

OpenAI Compatible APIs

exportOPENAI_API_KEY=secret_abcdefg
exportOPENAI_API_BASE=https://console.endpoints.anyscale.com/m/v1

python llm_correctness.py \--model"meta-llama/Llama-2-7b-chat-hf"\
--max-num-completed-requests 150\--timeout600\
--num-concurrent-requests 10\
--results-dir "result_outputs"

Anthropic

exportANTHROPIC_API_KEY=secret_abcdefg

python llm_correctness.py \--model"claude-2"\
--llm-api "anthropic"\
--max-num-completed-requests 5\--timeout600\
--num-concurrent-requests 1\
--results-dir "result_outputs"

TogetherAI

exportTOGETHERAI_API_KEY="YOUR_TOGETHER_KEY"

python llm_correctness.py \--model"together_ai/togethercomputer/CodeLlama-7b-Instruct"\
--llm-api "litellm"\
--max-num-completed-requests 2\--timeout600\
--num-concurrent-requests 1\
--results-dir "result_outputs"\

Hugging Face

exportHUGGINGFACE_API_KEY="YOUR_HUGGINGFACE_API_KEY"exportHUGGINGFACE_API_BASE="YOUR_HUGGINGFACE_API_ENDPOINT"

python llm_correctness.py \--model"huggingface/meta-llama/Llama-2-7b-chat-hf"\
--llm-api "litellm"\
--max-num-completed-requests 2\--timeout600\
--num-concurrent-requests 1\
--results-dir "result_outputs"\

LiteLLM

LLMPerf 可以使用 LiteLLM 向 LLM API 发送提示。查看要为提供程序设置的环境变量以及应为 model 和 extraalal-sampling-params 设置的参数。

请参阅 LiteLLM 提供程序文档。

python llm_correctness.py \--model"meta-llama/Llama-2-7b-chat-hf"\
--llm-api "litellm"\
--max-num-completed-requests 2\--timeout600\
--num-concurrent-requests 1\
--results-dir "result_outputs"\
see formore details on the arguments.python llm_correctness.py --help

Vertex AI

在这里，–model 用于日志记录，而不是用于选择模型。该模型在 Vertex AI 终端节点 ID 中指定。

GCLOUD_ACCESS_TOKEN需要定期设置，因为生成的令牌会在 15 分钟左右后过期。gcloud auth print-access-token

Vertex AI 不会返回其端点生成的令牌总数，因此使用 LLama 分词器对令牌进行计数。

gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

exportGCLOUD_ACCESS_TOKEN=$(gcloud auth print-access-token)exportGCLOUD_PROJECT_ID=YOUR_PROJECT_ID
exportGCLOUD_REGION=YOUR_REGION
exportVERTEXAI_ENDPOINT_ID=YOUR_ENDPOINT_ID

python llm_correctness.py \--model"meta-llama/Llama-2-7b-chat-hf"\
--llm-api "vertexai"\
--max-num-completed-requests 2\--timeout600\
--num-concurrent-requests 1\
--results-dir "result_outputs"\

SageMaker

SageMaker 不会返回其终端节点生成的令牌总数，因此使用 LLama 分词器对令牌进行计数。

exportAWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"exportAWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"s
exportAWS_SESSION_TOKEN="YOUR_SESSION_TOKEN"exportAWS_REGION_NAME="YOUR_ENDPOINTS_REGION_NAME"

python llm_correctness.py \--model"llama-2-7b"\
--llm-api "sagemaker"\
--max-num-completed-requests 2\--timeout600\
--num-concurrent-requests 1\
--results-dir "result_outputs"\

保存结果

负载测试和正确性测试的结果保存在参数指定的结果目录中（–results-dir）。结果保存在 2 个文件中，一个包含测试的摘要指标，另一个包含返回的每个单独请求的指标。

高级用法

正确性测试考使用以下工作流程实现：

import ray
from transformers import LlamaTokenizerFast

from llmperf.ray_clients.openai_chat_completions_client import(
    OpenAIChatCompletionsClient,)from llmperf.models import RequestConfig
from llmperf.requests_launcher import RequestsLauncher

# Copying the environment variables and passing them to ray.init() is necessary# For making any clients work.
ray.init(runtime_env={"env_vars":{"OPENAI_API_BASE":"https://api.endpoints.anyscale.com/v1","OPENAI_API_KEY":"YOUR_API_KEY"}})

base_prompt ="hello_world"
tokenizer = LlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer")
base_prompt_len =len(tokenizer.encode(base_prompt))
prompt =(base_prompt, base_prompt_len)# Create a client for spawning requests
clients =[OpenAIChatCompletionsClient.remote()]

req_launcher = RequestsLauncher(clients)

req_config = RequestConfig(
    model="meta-llama/Llama-2-7b-chat-hf",
    prompt=prompt
    )

req_launcher.launch_requests(req_config)
result = req_launcher.get_next_ready(block=True)print(result)

实现新的 LLM 客户端

要实现新的 LLM 客户端，您需要实现基类

llmperf.ray_llm_client.LLMClient

,并将其装饰为一个ray Actor。

from llmperf.ray_llm_client import LLMClient
import ray

@ray.remoteclassCustomLLMClient(LLMClient):defllm_request(self, request_config: RequestConfig)-> Tuple[Metrics,str, RequestConfig]:"""Make a single completion request to a LLM API

        Returns:
            Metrics about the performance charateristics of the request.
            The text generated by the request to the LLM API.
            The request_config used to make the request. This is mainly for logging purposes.

        """...

旧版代码库

旧的 LLMPerf 代码库可以在 llmperf-legacy 存储库中找到。

标签：后端

本文转载自: https://blog.csdn.net/codelearning/article/details/138309110
版权归原作者 lldhsds 所有，如有侵权，请联系我们删除。

LLMPerf测试工具使用指导