update vllm-docker-quick-start for vllm0.6.2 (#12392)

* update vllm-docker-quick-start for vllm0.6.2

* [UPDATE] rm max-num-seqs parameter in vllm-serving script
This commit is contained in:
Jun Wang 2024-11-27 08:47:03 +08:00 committed by GitHub
parent 7b40f9b372
commit cb7b08948b
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -242,15 +242,17 @@ python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--port 8000 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.75 \
--block-size 8 \
--gpu-memory-utilization 0.9 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit sym_int4 \
--max-model-len 4096 \
--max-num-batched-tokens 10240 \
--max-num-seqs 12 \
--tensor-parallel-size 1
--max-model-len 2048 \
--max-num-batched-tokens 4000 \
--tensor-parallel-size 1 \
--disable-async-output-proc \
--distributed-executor-backend ray
```
then run following command to start vllm service
@ -678,8 +680,8 @@ python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--load-in-low-bit fp8 \
--max-model-len 4096 \
--max-num-batched-tokens 10240 \
--max-num-seqs 12 \
--tensor-parallel-size 1 \
--distributed-executor-backend ray \
--enable-lora \
--lora-modules sql-lora=$SQL_LOARA
```
@ -742,16 +744,37 @@ vLLM Serving can be deployed as a server that implements the OpenAI API protocol
```bash
#!/bin/bash
model="/llm/models/Meta-Llama-3.1-8B-Instruct"
served_model_name="llama-3.1-8b"
...
model="/llm/models/Qwen1.5-14B-Chat"
served_model_name="Qwen1.5-14B-Chat"
#export SYCL_CACHE_PERSISTENT=1
export CCL_WORKER_COUNT=4
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
...
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit fp8 \
--max-model-len 2048 \
--max-num-batched-tokens 4000 \
--api-key <your-api-key> \
--tensor-parallel-size 2
--tensor-parallel-size 4 \
--distributed-executor-backend ray
```
2. Send http request with `api-key` header to verify the model has deployed successfully.
@ -761,7 +784,7 @@ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <your-api-key>" \
-d '{
"model": "llama-3.1-8b",
"model": "Qwen1.5-14B-Chat",
"prompt": "San Francisco is a",
"max_tokens": 128
}'
@ -793,7 +816,6 @@ docker run -itd \
We can set up model serving using `IPEX-LLM` as backend using FastChat, the following steps gives an example of how to deploy a demo using FastChat.
1. **Start the Docker Container**
Run the following command to launch a Docker container with device access:
@ -819,6 +841,7 @@ We can set up model serving using `IPEX-LLM` as backend using FastChat, the foll
2. **Start the FastChat Service**
Enter the container and start the FastChat service:
```bash
#/bin/bash