0


开源模型应用落地-glm模型小试-glm-4-9b-chat-vLLM集成(四)

一、前言

  1. GLM-4是智谱AI团队于2024116日发布的基座大模型,旨在自动理解和规划用户的复杂指令,并能调用网页浏览器。其功能包括数据分析、图表创建、PPT生成等,支持128K的上下文窗口,使其在长文本处理和精度召回方面表现优异,且在中文对齐能力上超过GPT-4。与之前的GLM系列产品相比,GLM-4在各项性能上提高了60%,并且在指令跟随和多模态功能上有显著强化,适合于多种应用场景。尽管在某些领域仍逊于国际一流模型,GLM-4的中文处理能力使其在国内大模型中占据领先地位。该模型的研发历程自2020年始,经过多次迭代和改进,最终构建出这一高性能的AI系统。
  2. 在开源模型应用落地-glm模型小试-glm-4-9b-chat-快速体验(一)已经掌握了**glm-4-9b-chat**的基本入门。
  3. 在开源模型应用落地-glm模型小试-glm-4-9b-chat-批量推理(二)已经掌握了**glm-4-9b-chat**的批量推理。
  4. 在开源模型应用落地-glm模型小试-glm-4-9b-chat-Gradio集成(三)已经掌握了如何集成**Gradio**进行页面交互。
  5. 本篇将介绍如何集成**vLLM**进行推理加速。

二、术语

**2.1.**GLM-4-9B

  1. 是智谱 AI 推出的一个开源预训练模型,属于 GLM-4 系列。它于 2024 6 6 日发布,专为满足高效能语言理解和生成任务而设计,并支持最高 1M(约两百万字)的上下文输入。该模型拥有更强的基础能力,支持26种语言,并且在多模态能力上首次实现了显著进展。

GLM-4-9B的基础能力包括:

  • 中英文综合性能提升 40%,在特别的中文对齐能力、指令遵从和工程代码等任务中显著增强

  • 较 Llama 3 8B 的性能提升,尤其在数学问题解决和代码编写等复杂任务中表现优越

  • 增强的函数调用能力,提升了 40% 的性能

  • 支持多轮对话,还支持网页浏览、代码执行、自定义工具调用等高级功能,能够快速处理大量信息并给出高质量的回答

**2.2.**GLM-4-9B-Chat

  1. 是智谱 AI GLM-4-9B 系列中推出的对话版本模型。它设计用于处理多轮对话,并具有一些高级功能,使其在自然语言处理任务中更加高效和灵活。

**2.3.**vLLM

  1. vLLM是一个开源的大模型推理加速框架,通过PagedAttention高效地管理attention中缓存的张量,实现了比HuggingFace Transformers14-24倍的吞吐量。

三、前置条件

3.1.基础环境及前置条件

** 1. 操作系统:centos7**

** 2. NVIDIA Tesla V100 32GB CUDA Version: 12.2 **

** 3.最低硬件要求**

3.2.下载模型

huggingface:

https://huggingface.co/THUDM/glm-4-9b-chat/tree/main

ModelScope:

魔搭社区

使用git-lfs方式下载示例:

3.3.创建虚拟环境

  1. conda create --name glm4 python=3.10
  2. conda activate glm4

3.4.安装依赖库

  1. pip install torch>=2.5.0
  2. pip install torchvision>=0.20.0
  3. pip install transformers>=4.46.0
  4. pip install huggingface-hub>=0.25.1
  5. pip install sentencepiece>=0.2.0
  6. pip install jinja2>=3.1.4
  7. pip install pydantic>=2.9.2
  8. pip install timm>=1.0.9
  9. pip install tiktoken>=0.7.0
  10. pip install numpy==1.26.4
  11. pip install accelerate>=1.0.1
  12. pip install sentence_transformers>=3.1.1
  13. pip install openai>=1.51.0
  14. pip install einops>=0.8.0
  15. pip install pillow>=10.4.0
  16. pip install sse-starlette>=2.1.3
  17. pip install bitsandbytes>=0.43.3
  18. # using with VLLM Framework
  19. pip install vllm>=0.6.3

四、技术实现

4.1.vLLM服务端实现

  1. # -*- coding: utf-8 -*-
  2. import time
  3. from asyncio.log import logger
  4. import re
  5. import uvicorn
  6. import gc
  7. import json
  8. import torch
  9. import random
  10. import string
  11. from vllm import SamplingParams, AsyncEngineArgs, AsyncLLMEngine
  12. from fastapi import FastAPI, HTTPException, Response
  13. from fastapi.middleware.cors import CORSMiddleware
  14. from contextlib import asynccontextmanager
  15. from typing import List, Literal, Optional, Union
  16. from pydantic import BaseModel, Field
  17. from transformers import AutoTokenizer, LogitsProcessor
  18. from sse_starlette.sse import EventSourceResponse
  19. EventSourceResponse.DEFAULT_PING_INTERVAL = 1000
  20. MAX_MODEL_LENGTH = 8192
  21. @asynccontextmanager
  22. async def lifespan(app: FastAPI):
  23. yield
  24. if torch.cuda.is_available():
  25. torch.cuda.empty_cache()
  26. torch.cuda.ipc_collect()
  27. app = FastAPI(lifespan=lifespan)
  28. app.add_middleware(
  29. CORSMiddleware,
  30. allow_origins=["*"],
  31. allow_credentials=True,
  32. allow_methods=["*"],
  33. allow_headers=["*"],
  34. )
  35. def generate_id(prefix: str, k=29) -> str:
  36. suffix = ''.join(random.choices(string.ascii_letters + string.digits, k=k))
  37. return f"{prefix}{suffix}"
  38. class ModelCard(BaseModel):
  39. id: str = ""
  40. object: str = "model"
  41. created: int = Field(default_factory=lambda: int(time.time()))
  42. owned_by: str = "owner"
  43. root: Optional[str] = None
  44. parent: Optional[str] = None
  45. permission: Optional[list] = None
  46. class ModelList(BaseModel):
  47. object: str = "list"
  48. data: List[ModelCard] = ["glm-4"]
  49. class FunctionCall(BaseModel):
  50. name: Optional[str] = None
  51. arguments: Optional[str] = None
  52. class ChoiceDeltaToolCallFunction(BaseModel):
  53. name: Optional[str] = None
  54. arguments: Optional[str] = None
  55. class UsageInfo(BaseModel):
  56. prompt_tokens: int = 0
  57. total_tokens: int = 0
  58. completion_tokens: Optional[int] = 0
  59. class ChatCompletionMessageToolCall(BaseModel):
  60. index: Optional[int] = 0
  61. id: Optional[str] = None
  62. function: FunctionCall
  63. type: Optional[Literal["function"]] = 'function'
  64. class ChatMessage(BaseModel):
  65. # “function” 字段解释:
  66. # 使用较老的OpenAI API版本需要注意在这里添加 function 字段并在 process_messages函数中添加相应角色转换逻辑为 observation
  67. role: Literal["user", "assistant", "system", "tool"]
  68. content: Optional[str] = None
  69. function_call: Optional[ChoiceDeltaToolCallFunction] = None
  70. tool_calls: Optional[List[ChatCompletionMessageToolCall]] = None
  71. class DeltaMessage(BaseModel):
  72. role: Optional[Literal["user", "assistant", "system"]] = None
  73. content: Optional[str] = None
  74. function_call: Optional[ChoiceDeltaToolCallFunction] = None
  75. tool_calls: Optional[List[ChatCompletionMessageToolCall]] = None
  76. class ChatCompletionResponseChoice(BaseModel):
  77. index: int
  78. message: ChatMessage
  79. finish_reason: Literal["stop", "length", "tool_calls"]
  80. class ChatCompletionResponseStreamChoice(BaseModel):
  81. delta: DeltaMessage
  82. finish_reason: Optional[Literal["stop", "length", "tool_calls"]]
  83. index: int
  84. class ChatCompletionResponse(BaseModel):
  85. model: str
  86. id: Optional[str] = Field(default_factory=lambda: generate_id('chatcmpl-', 29))
  87. object: Literal["chat.completion", "chat.completion.chunk"]
  88. choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]
  89. created: Optional[int] = Field(default_factory=lambda: int(time.time()))
  90. system_fingerprint: Optional[str] = Field(default_factory=lambda: generate_id('fp_', 9))
  91. usage: Optional[UsageInfo] = None
  92. class ChatCompletionRequest(BaseModel):
  93. model: str
  94. messages: List[ChatMessage]
  95. temperature: Optional[float] = 0.8
  96. top_p: Optional[float] = 0.8
  97. max_tokens: Optional[int] = None
  98. stream: Optional[bool] = False
  99. tools: Optional[Union[dict, List[dict]]] = None
  100. tool_choice: Optional[Union[str, dict]] = None
  101. repetition_penalty: Optional[float] = 1.1
  102. class InvalidScoreLogitsProcessor(LogitsProcessor):
  103. def __call__(
  104. self, input_ids: torch.LongTensor, scores: torch.FloatTensor
  105. ) -> torch.FloatTensor:
  106. if torch.isnan(scores).any() or torch.isinf(scores).any():
  107. scores.zero_()
  108. scores[..., 5] = 5e4
  109. return scores
  110. def process_response(output: str, tools: dict | List[dict] = None, use_tool: bool = False) -> Union[str, dict]:
  111. lines = output.strip().split("\n")
  112. arguments_json = None
  113. special_tools = ["cogview", "simple_browser"]
  114. tools = {tool['function']['name'] for tool in tools} if tools else {}
  115. if len(lines) >= 2 and lines[1].startswith("{"):
  116. function_name = lines[0].strip()
  117. arguments = "\n".join(lines[1:]).strip()
  118. if function_name in tools or function_name in special_tools:
  119. try:
  120. arguments_json = json.loads(arguments)
  121. is_tool_call = True
  122. except json.JSONDecodeError:
  123. is_tool_call = function_name in special_tools
  124. if is_tool_call and use_tool:
  125. content = {
  126. "name": function_name,
  127. "arguments": json.dumps(arguments_json if isinstance(arguments_json, dict) else arguments,
  128. ensure_ascii=False)
  129. }
  130. if function_name == "simple_browser":
  131. search_pattern = re.compile(r'search\("(.+?)"\s*,\s*recency_days\s*=\s*(\d+)\)')
  132. match = search_pattern.match(arguments)
  133. if match:
  134. content["arguments"] = json.dumps({
  135. "query": match.group(1),
  136. "recency_days": int(match.group(2))
  137. }, ensure_ascii=False)
  138. elif function_name == "cogview":
  139. content["arguments"] = json.dumps({
  140. "prompt": arguments
  141. }, ensure_ascii=False)
  142. return content
  143. return output.strip()
  144. @torch.inference_mode()
  145. async def generate_stream_glm4(params):
  146. messages = params["messages"]
  147. tools = params["tools"]
  148. tool_choice = params["tool_choice"]
  149. temperature = float(params.get("temperature", 1.0))
  150. repetition_penalty = float(params.get("repetition_penalty", 1.0))
  151. top_p = float(params.get("top_p", 1.0))
  152. max_new_tokens = int(params.get("max_tokens", 8192))
  153. messages = process_messages(messages, tools=tools, tool_choice=tool_choice)
  154. inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
  155. params_dict = {
  156. "n": 1,
  157. "best_of": 1,
  158. "presence_penalty": 1.0,
  159. "frequency_penalty": 0.0,
  160. "temperature": temperature,
  161. "top_p": top_p,
  162. "top_k": -1,
  163. "repetition_penalty": repetition_penalty,
  164. "stop_token_ids": [151329, 151336, 151338],
  165. "ignore_eos": False,
  166. "max_tokens": max_new_tokens,
  167. "logprobs": None,
  168. "prompt_logprobs": None,
  169. "skip_special_tokens": True,
  170. }
  171. sampling_params = SamplingParams(**params_dict)
  172. async for output in engine.generate(prompt=inputs, sampling_params=sampling_params, request_id=f"{time.time()}"):
  173. output_len = len(output.outputs[0].token_ids)
  174. input_len = len(output.prompt_token_ids)
  175. ret = {
  176. "text": output.outputs[0].text,
  177. "usage": {
  178. "prompt_tokens": input_len,
  179. "completion_tokens": output_len,
  180. "total_tokens": output_len + input_len
  181. },
  182. "finish_reason": output.outputs[0].finish_reason,
  183. }
  184. yield ret
  185. gc.collect()
  186. torch.cuda.empty_cache()
  187. def process_messages(messages, tools=None, tool_choice="none"):
  188. _messages = messages
  189. processed_messages = []
  190. msg_has_sys = False
  191. def filter_tools(tool_choice, tools):
  192. function_name = tool_choice.get('function', {}).get('name', None)
  193. if not function_name:
  194. return []
  195. filtered_tools = [
  196. tool for tool in tools
  197. if tool.get('function', {}).get('name') == function_name
  198. ]
  199. return filtered_tools
  200. if tool_choice != "none":
  201. if isinstance(tool_choice, dict):
  202. tools = filter_tools(tool_choice, tools)
  203. if tools:
  204. processed_messages.append(
  205. {
  206. "role": "system",
  207. "content": None,
  208. "tools": tools
  209. }
  210. )
  211. msg_has_sys = True
  212. if isinstance(tool_choice, dict) and tools:
  213. processed_messages.append(
  214. {
  215. "role": "assistant",
  216. "metadata": tool_choice["function"]["name"],
  217. "content": ""
  218. }
  219. )
  220. for m in _messages:
  221. role, content, func_call = m.role, m.content, m.function_call
  222. tool_calls = getattr(m, 'tool_calls', None)
  223. if role == "function":
  224. processed_messages.append(
  225. {
  226. "role": "observation",
  227. "content": content
  228. }
  229. )
  230. elif role == "tool":
  231. processed_messages.append(
  232. {
  233. "role": "observation",
  234. "content": content,
  235. "function_call": True
  236. }
  237. )
  238. elif role == "assistant":
  239. if tool_calls:
  240. for tool_call in tool_calls:
  241. processed_messages.append(
  242. {
  243. "role": "assistant",
  244. "metadata": tool_call.function.name,
  245. "content": tool_call.function.arguments
  246. }
  247. )
  248. else:
  249. for response in content.split("\n"):
  250. if "\n" in response:
  251. metadata, sub_content = response.split("\n", maxsplit=1)
  252. else:
  253. metadata, sub_content = "", response
  254. processed_messages.append(
  255. {
  256. "role": role,
  257. "metadata": metadata,
  258. "content": sub_content.strip()
  259. }
  260. )
  261. else:
  262. if role == "system" and msg_has_sys:
  263. msg_has_sys = False
  264. continue
  265. processed_messages.append({"role": role, "content": content})
  266. if not tools or tool_choice == "none":
  267. for m in _messages:
  268. if m.role == 'system':
  269. processed_messages.insert(0, {"role": m.role, "content": m.content})
  270. break
  271. return processed_messages
  272. @app.get("/health")
  273. async def health() -> Response:
  274. """Health check."""
  275. return Response(status_code=200)
  276. @app.get("/v1/models", response_model=ModelList)
  277. async def list_models():
  278. model_card = ModelCard(id="glm-4")
  279. return ModelList(data=[model_card])
  280. @app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
  281. async def create_chat_completion(request: ChatCompletionRequest):
  282. if len(request.messages) < 1 or request.messages[-1].role == "assistant":
  283. raise HTTPException(status_code=400, detail="Invalid request")
  284. gen_params = dict(
  285. messages=request.messages,
  286. temperature=request.temperature,
  287. top_p=request.top_p,
  288. max_tokens=request.max_tokens or 1024,
  289. echo=False,
  290. stream=request.stream,
  291. repetition_penalty=request.repetition_penalty,
  292. tools=request.tools,
  293. tool_choice=request.tool_choice,
  294. )
  295. logger.debug(f"==== request ====\n{gen_params}")
  296. if request.stream:
  297. predict_stream_generator = predict_stream(request.model, gen_params)
  298. output = await anext(predict_stream_generator)
  299. if output:
  300. return EventSourceResponse(predict_stream_generator, media_type="text/event-stream")
  301. logger.debug(f"First result output:\n{output}")
  302. function_call = None
  303. if output and request.tools:
  304. try:
  305. function_call = process_response(output, request.tools, use_tool=True)
  306. except:
  307. logger.warning("Failed to parse tool call")
  308. if isinstance(function_call, dict):
  309. function_call = ChoiceDeltaToolCallFunction(**function_call)
  310. generate = parse_output_text(request.model, output, function_call=function_call)
  311. return EventSourceResponse(generate, media_type="text/event-stream")
  312. else:
  313. return EventSourceResponse(predict_stream_generator, media_type="text/event-stream")
  314. response = ""
  315. async for response in generate_stream_glm4(gen_params):
  316. pass
  317. if response["text"].startswith("\n"):
  318. response["text"] = response["text"][1:]
  319. response["text"] = response["text"].strip()
  320. usage = UsageInfo()
  321. function_call, finish_reason = None, "stop"
  322. tool_calls = None
  323. if request.tools:
  324. try:
  325. function_call = process_response(response["text"], request.tools, use_tool=True)
  326. except Exception as e:
  327. logger.warning(f"Failed to parse tool call: {e}")
  328. if isinstance(function_call, dict):
  329. finish_reason = "tool_calls"
  330. function_call_response = ChoiceDeltaToolCallFunction(**function_call)
  331. function_call_instance = FunctionCall(
  332. name=function_call_response.name,
  333. arguments=function_call_response.arguments
  334. )
  335. tool_calls = [
  336. ChatCompletionMessageToolCall(
  337. id=generate_id('call_', 24),
  338. function=function_call_instance,
  339. type="function")]
  340. message = ChatMessage(
  341. role="assistant",
  342. content=None if tool_calls else response["text"],
  343. function_call=None,
  344. tool_calls=tool_calls,
  345. )
  346. logger.debug(f"==== message ====\n{message}")
  347. choice_data = ChatCompletionResponseChoice(
  348. index=0,
  349. message=message,
  350. finish_reason=finish_reason,
  351. )
  352. task_usage = UsageInfo.model_validate(response["usage"])
  353. for usage_key, usage_value in task_usage.model_dump().items():
  354. setattr(usage, usage_key, getattr(usage, usage_key) + usage_value)
  355. return ChatCompletionResponse(
  356. model=request.model,
  357. choices=[choice_data],
  358. object="chat.completion",
  359. usage=usage
  360. )
  361. async def predict_stream(model_id, gen_params):
  362. output = ""
  363. is_function_call = False
  364. has_send_first_chunk = False
  365. created_time = int(time.time())
  366. function_name = None
  367. response_id = generate_id('chatcmpl-', 29)
  368. system_fingerprint = generate_id('fp_', 9)
  369. tools = {tool['function']['name'] for tool in gen_params['tools']} if gen_params['tools'] else {}
  370. delta_text = ""
  371. async for new_response in generate_stream_glm4(gen_params):
  372. decoded_unicode = new_response["text"]
  373. delta_text += decoded_unicode[len(output):]
  374. output = decoded_unicode
  375. lines = output.strip().split("\n")
  376. # 检查是否为工具
  377. # 这是一个简单的工具比较函数,不能保证拦截所有非工具输出的结果,比如参数未对齐等特殊情况。
  378. ##TODO 如果你希望做更多处理,可以在这里进行逻辑完善。
  379. if not is_function_call and len(lines) >= 2:
  380. first_line = lines[0].strip()
  381. if first_line in tools:
  382. is_function_call = True
  383. function_name = first_line
  384. delta_text = lines[1]
  385. # 工具调用返回
  386. if is_function_call:
  387. if not has_send_first_chunk:
  388. function_call = {"name": function_name, "arguments": ""}
  389. tool_call = ChatCompletionMessageToolCall(
  390. index=0,
  391. id=generate_id('call_', 24),
  392. function=FunctionCall(**function_call),
  393. type="function"
  394. )
  395. message = DeltaMessage(
  396. content=None,
  397. role="assistant",
  398. function_call=None,
  399. tool_calls=[tool_call]
  400. )
  401. choice_data = ChatCompletionResponseStreamChoice(
  402. index=0,
  403. delta=message,
  404. finish_reason=None
  405. )
  406. chunk = ChatCompletionResponse(
  407. model=model_id,
  408. id=response_id,
  409. choices=[choice_data],
  410. created=created_time,
  411. system_fingerprint=system_fingerprint,
  412. object="chat.completion.chunk"
  413. )
  414. yield ""
  415. yield chunk.model_dump_json(exclude_unset=True)
  416. has_send_first_chunk = True
  417. function_call = {"name": None, "arguments": delta_text}
  418. delta_text = ""
  419. tool_call = ChatCompletionMessageToolCall(
  420. index=0,
  421. id=None,
  422. function=FunctionCall(**function_call),
  423. type="function"
  424. )
  425. message = DeltaMessage(
  426. content=None,
  427. role=None,
  428. function_call=None,
  429. tool_calls=[tool_call]
  430. )
  431. choice_data = ChatCompletionResponseStreamChoice(
  432. index=0,
  433. delta=message,
  434. finish_reason=None
  435. )
  436. chunk = ChatCompletionResponse(
  437. model=model_id,
  438. id=response_id,
  439. choices=[choice_data],
  440. created=created_time,
  441. system_fingerprint=system_fingerprint,
  442. object="chat.completion.chunk"
  443. )
  444. yield chunk.model_dump_json(exclude_unset=True)
  445. # 用户请求了 Function Call 但是框架还没确定是否为Function Call
  446. elif (gen_params["tools"] and gen_params["tool_choice"] != "none") or is_function_call:
  447. continue
  448. # 常规返回
  449. else:
  450. finish_reason = new_response.get("finish_reason", None)
  451. if not has_send_first_chunk:
  452. message = DeltaMessage(
  453. content="",
  454. role="assistant",
  455. function_call=None,
  456. )
  457. choice_data = ChatCompletionResponseStreamChoice(
  458. index=0,
  459. delta=message,
  460. finish_reason=finish_reason
  461. )
  462. chunk = ChatCompletionResponse(
  463. model=model_id,
  464. id=response_id,
  465. choices=[choice_data],
  466. created=created_time,
  467. system_fingerprint=system_fingerprint,
  468. object="chat.completion.chunk"
  469. )
  470. yield chunk.model_dump_json(exclude_unset=True)
  471. has_send_first_chunk = True
  472. message = DeltaMessage(
  473. content=delta_text,
  474. role="assistant",
  475. function_call=None,
  476. )
  477. delta_text = ""
  478. choice_data = ChatCompletionResponseStreamChoice(
  479. index=0,
  480. delta=message,
  481. finish_reason=finish_reason
  482. )
  483. chunk = ChatCompletionResponse(
  484. model=model_id,
  485. id=response_id,
  486. choices=[choice_data],
  487. created=created_time,
  488. system_fingerprint=system_fingerprint,
  489. object="chat.completion.chunk"
  490. )
  491. yield chunk.model_dump_json(exclude_unset=True)
  492. # 工具调用需要额外返回一个字段以对齐 OpenAI 接口
  493. if is_function_call:
  494. yield ChatCompletionResponse(
  495. model=model_id,
  496. id=response_id,
  497. system_fingerprint=system_fingerprint,
  498. choices=[
  499. ChatCompletionResponseStreamChoice(
  500. index=0,
  501. delta=DeltaMessage(
  502. content=None,
  503. role=None,
  504. function_call=None,
  505. ),
  506. finish_reason="tool_calls"
  507. )],
  508. created=created_time,
  509. object="chat.completion.chunk",
  510. usage=None
  511. ).model_dump_json(exclude_unset=True)
  512. elif delta_text != "":
  513. message = DeltaMessage(
  514. content="",
  515. role="assistant",
  516. function_call=None,
  517. )
  518. choice_data = ChatCompletionResponseStreamChoice(
  519. index=0,
  520. delta=message,
  521. finish_reason=None
  522. )
  523. chunk = ChatCompletionResponse(
  524. model=model_id,
  525. id=response_id,
  526. choices=[choice_data],
  527. created=created_time,
  528. system_fingerprint=system_fingerprint,
  529. object="chat.completion.chunk"
  530. )
  531. yield chunk.model_dump_json(exclude_unset=True)
  532. finish_reason = 'stop'
  533. message = DeltaMessage(
  534. content=delta_text,
  535. role="assistant",
  536. function_call=None,
  537. )
  538. delta_text = ""
  539. choice_data = ChatCompletionResponseStreamChoice(
  540. index=0,
  541. delta=message,
  542. finish_reason=finish_reason
  543. )
  544. chunk = ChatCompletionResponse(
  545. model=model_id,
  546. id=response_id,
  547. choices=[choice_data],
  548. created=created_time,
  549. system_fingerprint=system_fingerprint,
  550. object="chat.completion.chunk"
  551. )
  552. yield chunk.model_dump_json(exclude_unset=True)
  553. yield '[DONE]'
  554. else:
  555. yield '[DONE]'
  556. async def parse_output_text(model_id: str, value: str, function_call: ChoiceDeltaToolCallFunction = None):
  557. delta = DeltaMessage(role="assistant", content=value)
  558. if function_call is not None:
  559. delta.function_call = function_call
  560. choice_data = ChatCompletionResponseStreamChoice(
  561. index=0,
  562. delta=delta,
  563. finish_reason=None
  564. )
  565. chunk = ChatCompletionResponse(
  566. model=model_id,
  567. choices=[choice_data],
  568. object="chat.completion.chunk"
  569. )
  570. yield "{}".format(chunk.model_dump_json(exclude_unset=True))
  571. yield '[DONE]'
  572. if __name__ == "__main__":
  573. MODEL_PATH = "/data/model/glm-4-9b-chat"
  574. tensor_parallel_size = 1
  575. tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
  576. engine_args = AsyncEngineArgs(
  577. model=MODEL_PATH,
  578. tokenizer=MODEL_PATH,
  579. # 如果你有多张显卡,可以在这里设置成你的显卡数量
  580. tensor_parallel_size=tensor_parallel_size,
  581. dtype=torch.float16,
  582. trust_remote_code=True,
  583. # 占用显存的比例,请根据你的显卡显存大小设置合适的值,例如,如果你的显卡有80G,您只想使用24G,请按照24/80=0.3设置
  584. gpu_memory_utilization=0.9,
  585. enforce_eager=True,
  586. worker_use_ray=False,
  587. disable_log_requests=True,
  588. max_model_len=MAX_MODEL_LENGTH,
  589. )
  590. engine = AsyncLLMEngine.from_engine_args(engine_args)
  591. uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)

4.2.vLLM服务端启动

  1. (glm4) [root@gpu test]# python -u glm_server.py
  2. WARNING 11-06 12:11:19 config.py:1668] Casting torch.bfloat16 to torch.float16.
  3. WARNING 11-06 12:11:23 config.py:395] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
  4. INFO 11-06 12:11:23 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/data/model/glm-4-9b-chat', speculative_config=None, tokenizer='/data/model/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/model/glm-4-9b-chat, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)
  5. WARNING 11-06 12:11:24 tokenizer.py:169] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
  6. INFO 11-06 12:11:24 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
  7. INFO 11-06 12:11:24 selector.py:115] Using XFormers backend.
  8. /usr/local/miniconda3/envs/glm4/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  9. @torch.library.impl_abstract("xformers_flash::flash_fwd")
  10. /usr/local/miniconda3/envs/glm4/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  11. @torch.library.impl_abstract("xformers_flash::flash_bwd")
  12. INFO 11-06 12:11:25 model_runner.py:1056] Starting to load model /data/model/glm-4-9b-chat...
  13. INFO 11-06 12:11:25 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
  14. INFO 11-06 12:11:25 selector.py:115] Using XFormers backend.
  15. Loading safetensors checkpoint shards: 0% Completed | 0/10 [00:00<?, ?it/s]
  16. Loading safetensors checkpoint shards: 10% Completed | 1/10 [00:00<00:08, 1.01it/s]
  17. Loading safetensors checkpoint shards: 20% Completed | 2/10 [00:01<00:07, 1.13it/s]
  18. Loading safetensors checkpoint shards: 30% Completed | 3/10 [00:02<00:06, 1.14it/s]
  19. Loading safetensors checkpoint shards: 40% Completed | 4/10 [00:03<00:05, 1.15it/s]
  20. Loading safetensors checkpoint shards: 50% Completed | 5/10 [00:04<00:04, 1.18it/s]
  21. Loading safetensors checkpoint shards: 60% Completed | 6/10 [00:05<00:03, 1.08it/s]
  22. Loading safetensors checkpoint shards: 70% Completed | 7/10 [00:06<00:02, 1.07it/s]
  23. Loading safetensors checkpoint shards: 80% Completed | 8/10 [00:07<00:01, 1.13it/s]
  24. Loading safetensors checkpoint shards: 90% Completed | 9/10 [00:08<00:00, 1.10it/s]
  25. Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:08<00:00, 1.10it/s]
  26. Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:08<00:00, 1.11it/s]
  27. INFO 11-06 12:11:35 model_runner.py:1067] Loading model weights took 17.5635 GB
  28. INFO 11-06 12:11:37 gpu_executor.py:122] # GPU blocks: 12600, # CPU blocks: 6553
  29. INFO 11-06 12:11:37 gpu_executor.py:126] Maximum concurrency for 8192 tokens per request: 24.61x
  30. INFO: Started server process [1627618]
  31. INFO: Waiting for application startup.
  32. INFO: Application startup complete.
  33. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

4.3.客户端实现

  1. # -*- coding: utf-8 -*-
  2. from openai import OpenAI
  3. base_url = "http://127.0.0.1:8000/v1/"
  4. client = OpenAI(api_key="EMPTY", base_url=base_url)
  5. MODEL_PATH = "/data/model/glm-4-9b-chat"
  6. def chat(use_stream=False):
  7. messages = [
  8. {
  9. "role": "system",
  10. "content": "你是一名专业的导游。",
  11. },
  12. {
  13. "role": "user",
  14. "content": "请推荐一些广州特色的景点?",
  15. }
  16. ]
  17. response = client.chat.completions.create(
  18. model=MODEL_PATH,
  19. messages=messages,
  20. stream=use_stream,
  21. max_tokens=8192,
  22. temperature=0.4,
  23. presence_penalty=1.2,
  24. top_p=0.9,
  25. )
  26. if response:
  27. if use_stream:
  28. for chunk in response:
  29. msg = chunk.choices[0].delta.content
  30. print(msg,end='',flush=True)
  31. else:
  32. print(response)
  33. else:
  34. print("Error:", response.status_code)
  35. if __name__ == "__main__":
  36. chat(use_stream=True)

4.4.客户端调用

  1. (glm4) [root@gpu test]# python -u glm_client.py
  2. 当然可以!广州是中国广东省的省会,历史悠久,文化底蕴深厚,同时也是一座现代化的大都市。以下是一些广州的特色景点推荐:
  3. 1. **白云山** - 广州著名的风景区,有“羊城第一秀”之称。山上空气清新,景色优美,是登山和观赏广州市区全景的好地方。
  4. 2. **珠江夜游** - 乘坐游船在珠江上欣赏两岸的夜景,可以看到广州塔、海心沙等著名地标,以及璀璨的灯光秀。
  5. 3. **长隆旅游度假区** - 包括长隆野生动物世界、长隆水上乐园、长隆国际大马戏等多个主题公园,适合家庭游玩。
  6. 4. **陈家祠** - 又称陈氏书院,是一座典型的岭南传统建筑,以其精美的木雕、石雕和砖雕闻名。
  7. 5. **越秀公园** - 公园内有五羊雕像,是广州的象征之一。还有中山纪念碑、镇海楼等历史遗迹。
  8. 6. **北京路步行街** - 这里集合了购物、餐饮、娱乐于一体,是一条充满活力的商业街区。
  9. 7. **上下九步行街** - 这条古老的街道以骑楼建筑为特色,两旁有许多老字号商店和小吃店,是体验广州传统文化的好去处。
  10. 8. **广州塔(小蛮腰)** - 作为广州的地标性建筑,游客可以从这里俯瞰整个城市的壮丽景观。
  11. 9. **南越王宫博物馆** - 展示了两千多年前南越国的历史文化,馆内有一座复原的宫殿模型。
  12. 10. **荔湾湖公园** - 一个集自然风光与人文景观于一体的公园,湖水清澈,环境宜人。
  13. 11. **广州动物园** - 是中国最大的城市动物园之一,拥有多种珍稀动物。
  14. 12. **广州艺术博物院** - 收藏了大量珍贵的艺术品和历史文物,是了解广东乃至华南地区文化艺术的重要场所。
  15. 这些景点不仅展示了广州的自然美景,也体现了其丰富的文化遗产和现代都市的风貌。希望您在广州旅行时能有一个愉快的体验!

本文转载自: https://blog.csdn.net/qq839019311/article/details/143567391
版权归原作者 开源技术探险家 所有, 如有侵权,请联系我们删除。

“开源模型应用落地-glm模型小试-glm-4-9b-chat-vLLM集成(四)”的评论:

还没有评论