cover

【大模型】Vllm基础学习

vllm是一个大语言模型高速推理框架，旨在提高大模型的服务效率。优势是内存管理，实现的核心是pageattetion算法。仅在gpu上加速，不在cpu加速。

idiotyi

1216人浏览 · 2024-06-26 16:21:21

idiotyi · 2024-06-26 16:21:21 发布

前言：vllm是一个大语言模型高速推理框架，旨在提高大模型的服务效率。优势是内存管理，实现的核心是pageattetion算法。仅在gpu上加速，不在cpu加速。

目录

1. PageAttention
2. 实践
3. 加载llama factory微调模型

1. PageAttention

核心思想：将每个序列的KV cache（键值缓存）分块处理，每块包含固定数量的token。
灵感来源：操作系统中的虚拟内存和分页管理技术，旨在动态地为请求分配KV cache显存，提升显存利用率
评估结果：vLLM可以将常用的LLM吞吐量提高了2-4倍

2. 实践

2.1 安装

 pip install vllm

2.2 离线推理

示例一

from vllm import llm

llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a")

示例二

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
 
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("/data/weisx/model/Qwen1.5-4B-Chat")
 
# Pass the default decoding hyperparameters of Qwen1.5-4B-Chat
# max_tokens is for the maximum length for generation.
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)
 
# Input the model name or path. Can be GPTQ or AWQ models.
llm = LLM(model="Qwen/l/Qwen1.5-4B-Chat", trust_remote_code=True)
 
# Prepare your prompts
prompt = "Tell me something about large language models."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
 
# generate outputs
outputs = llm.generate([text], sampling_params)
 
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

SamplingParams：在VLLM模型中主要负责调整采样过程。采样是在模型生成文本或其他类型输出时的一个关键步骤，它决定了模型如何从可能的输出中选择一个。
LLM的参数model是模型名，还可以输入其他大语言模型，但要注意不是所有的llm都被vllm支持。
message中定义了系统的角色内容以及用户的角色内容

2.3 适配OpenAI的api

a. 命令行输入

python -m vllm.entrypoints.openai.api_server --model your_model_path --trust-remote-code

默认监听 8000 端口，–host 和–port 参数可以指定主机和端口。
b. 使用curl与Qwen对接(命令行)

curl http://localhost:8000/generate \
    -d '{
        "prompt": "San Francisco is a",
        "use_beam_search": true,
        "n": 4,
        "temperature": 0
    }'

http://localhost:8000/generate是访问的http地址，也就是客户端地址
-d后面跟的是参数，可以根据需求配置不同的参数

c. 使用python和Qwen对接

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
 
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
 
chat_response = client.chat.completions.create(
    model="Qwen/Qwen1.5-4B-Chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me something about large language models."},
    ]
)
print("Chat response:", chat_response)

3. 加载llama factory微调模型

使用LoRA进行微调的模型你需要先merge-lora, 产生完整的checkpoint目录

python src/export_model.py \ 
--model_name_or_path llm/baichuan-inc/Baichuan2-7B-Chat \
--template baichuan2 \
--finetuning_type lora \
--adapter_name_or_path /home/checkpoint-6600 \ 
--export_dir /home/exports/checkpoint-6600

使用全参数微调的模型可以无缝使用VLLM进行推理加速.

MCP技术社区

欢迎加入 MCP 技术社区！与志同道合者携手前行，一同解锁 MCP 技术的无限可能！

更多推荐

cover

MCP 实战第一课：让 DeepSeek 接管你的电脑

MCP技术社区

cover

高德地图MCP Server介绍以及使用

MCP技术社区

cover

终于有人把 MCP 和 A2A 讲明白了！小白也能看懂的硬核科普！

MCP技术社区

所有评论(0)

查看更多评论

idiotyi

已为社区贡献1条内容