Refactor vLLM Documentation: Centralize Benchmarking and Improve Readability (#13141)

* update vllm doc

* update image name

* update

* update

* update

* update
This commit is contained in:
Shaojun Liu 2025-05-09 10:19:42 +08:00 committed by GitHub
parent f5d9c49a2a
commit 45f7bf6688
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 309 additions and 691 deletions

View file

@ -1,293 +1,2 @@
# IPEX-LLM-serving XPU Image: Build and Usage Guide
This document outlines the steps to build and use the `IPEX-LLM-serving-xpu` Docker image, including inference, serving, and benchmarking functionalities for XPU.
---
## 1. Build the Image
To build the `IPEX-LLM-serving-xpu` Docker image, use the following command:
```bash
docker build \
--build-arg http_proxy=.. \
--build-arg https_proxy=.. \
--build-arg no_proxy=.. \
--rm --no-cache -t intelanalytics/ipex-llm-serving-xpu:latest .
```
---
## 2. Using the Image for XPU Inference
To map the `XPU` into the container, you need to specify `--device=/dev/dri` when starting the container.
### Example:
```bash
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
sudo docker run -itd \
--net=host \
--device=/dev/dri \
--privileged \
--memory="32G" \
--name=CONTAINER_NAME \
--shm-size="16g" \
--entrypoint /bin/bash \
$DOCKER_IMAGE
```
Once the container is up and running, use `docker exec` to enter it.
To verify if the XPU device is successfully mapped into the container, run the following:
```bash
sycl-ls
```
For a machine with Arc A770, the output will be similar to:
```bash
root@arda-arc12:/# sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
```
For detailed instructions on running inference with `IPEX-LLM` on XPU, refer to this [documentation](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU).
---
## 3. Using the Image for XPU Serving
To run XPU serving, you need to map the XPU into the container by specifying `--device=/dev/dri` when booting the container.
### 3.1 **Start the Container and Automatically Launch the Service**
By default, the container is configured to automatically start the service when it is run. You can also specify the model path, model name, and tensor parallel size using environment variables (MODEL_PATH, SERVED_MODEL_NAME, and TENSOR_PARALLEL_SIZE). This allows the service to start with the specific model and tensor parallel configuration you want to use. Additionally, make sure to mount the model directory into the container using the `-v` option.
### Example:
```bash
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
sudo docker run -itd \
--net=host \
--device=/dev/dri \
--privileged \
--memory="32G" \
--name=CONTAINER_NAME \
--shm-size="16g" \
-e MODEL_PATH="/llm/models" \
-e SERVED_MODEL_NAME="my_model" \
-e TENSOR_PARALLEL_SIZE=4 \
-v /home/intel/LLM/:/llm/models/ \
$DOCKER_IMAGE
```
- This command will start the container and automatically launch the service with the specified model path (`/llm/models`), model name (`my_model`), and tensor parallel size (`4`).
- The `-e TENSOR_PARALLEL_SIZE=4` option specifies the number of GPUs (or cards) on which the service will run. You can adjust this value based on your parallelism needs.
- The `-v /home/intel/LLM/:/llm/models/` option mounts the model directory from the host (`/home/intel/LLM/`) to the container (`/llm/models/`).
Once the container is running, the service will be launched automatically based on the provided model or the default settings.
#### View Logs:
To view the logs of the container and monitor the service startup, you can use the following command:
```bash
docker logs CONTAINER_NAME
```
This will display the logs generated by the service, allowing you to check if everything is running as expected.
### 3.2 **Start the Container and Manually Launch the Service**
If you prefer to manually start the service or need to troubleshoot, you can override the entrypoint with `/bin/bash` when starting the container. This allows you to enter the container and run commands interactively. Use the following command:
### Example:
```bash
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
sudo docker run -itd \
--net=host \
--device=/dev/dri \
--privileged \
--memory="32G" \
--name=CONTAINER_NAME \
--shm-size="16g" \
--entrypoint /bin/bash \
-v /home/intel/LLM/:/llm/models/ \
$DOCKER_IMAGE
```
After running this command, the container will start and drop you into an interactive shell (`bash`). From there, you can manually start the service by running:
```bash
bash /llm/start-vllm-service.sh
```
This option provides more control over the container and allows you to start the service at your convenience.
To verify that the device is correctly mapped, run:
```bash
sycl-ls
```
The output will be similar to the example in the inference section above.
Currently, the image supports two different serving engines: **FastChat** and **vLLM**.
### 3.3 Serving Engines
#### 3.3.1 Lightweight Serving Engine
For running lightweight serving on Intel GPUs using `IPEX-LLM` as the backend, refer to the [Lightweight-Serving README](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Lightweight-Serving).
We have included a script `/llm/start-lightweight_serving-service` in the image. Make sure to install the correct `transformers` version before proceeding, like so:
```bash
pip install transformers==4.37.0
```
#### 3.3.2 Pipeline Parallel Serving Engine
To use the **Pipeline Parallel** serving engine with `IPEX-LLM` as the backend, refer to this [Pipeline-Parallel-FastAPI README](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Pipeline-Parallel-FastAPI).
A convenience script `/llm/start-pp_serving-service.sh` is included in the image. Be sure to install the required version of `transformers`, like so:
```bash
pip install transformers==4.37.0
```
#### 3.3.3 vLLM Serving Engine
For running the **vLLM engine** with `IPEX-LLM` as the backend, refer to this [vLLM Docker Quickstart Guide](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md).
The following example files are available in `/llm/` within the container:
1. `vllm_offline_inference.py`: vLLM offline inference example
2. `benchmark_vllm_throughput.py`: Throughput benchmarking
3. `payload-1024.lua`: Request-per-second test (using 1k-128 request)
4. `start-vllm-service.sh`: Template for starting the vLLM service
5. `vllm_offline_inference_vision_language.py`: vLLM offline inference for vision-based models
---
## 4. Benchmarking
> [!TIP]
> Before running benchmarks, it's recommended to lock CPU and GPU frequencies to ensure more stable, reliable, and better performance data.
>
> **Lock CPU Frequency:**
> Use the following command to set the minimum CPU frequency (adjust based on your CPU model):
>
> ```bash
> sudo cpupower frequency-set -d 3.8GHz
> ```
>
> **Lock GPU Frequencies:**
> Use these commands to lock GPU frequencies to 2400MHz:
>
> ```bash
> sudo xpu-smi config -d 0 -t 0 --frequencyrange 2400,2400
> sudo xpu-smi config -d 1 -t 0 --frequencyrange 2400,2400
> sudo xpu-smi config -d 2 -t 0 --frequencyrange 2400,2400
> sudo xpu-smi config -d 3 -t 0 --frequencyrange 2400,2400
> ```
### 4.1 Online Benchmark through API Server
To benchmark the API server and estimate TPS (transactions per second), follow these steps:
1. Start the service as per the instructions in this [section](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md#Serving).
2. Run the benchmark using `vllm_online_benchmark.py`:
```bash
python vllm_online_benchmark.py $model_name $max_seqs $input_length $output_length
```
If `input_length` and `output_length` are not provided, the script defaults to values of 1024 and 512 tokens, respectively. The output will look something like:
```bash
model_name: Qwen1.5-14B-Chat
max_seq: 12
Warm Up: 100%|█████████████████████████████████████████████████████| 24/24 [01:36<00:00, 4.03s/req]
Benchmarking: 100%|████████████████████████████████████████████████| 60/60 [04:03<00:00, 4.05s/req]
Total time for 60 requests with 12 concurrent requests: xxx seconds.
Average response time: xxx
Token throughput: xxx
Average first token latency: xxx milliseconds.
P90 first token latency: xxx milliseconds.
P95 first token latency: xxx milliseconds.
Average next token latency: xxx milliseconds.
P90 next token latency: xxx milliseconds.
P95 next token latency: xxx milliseconds.
```
### 4.2 Online Benchmark with Multimodal Input
After starting the vLLM service, you can benchmark multimodal inputs using `vllm_online_benchmark_multimodal.py`:
```bash
wget -O /llm/models/test.webp https://gd-hbimg.huaban.com/b7764d5f9c19b3e433d54ba66cce6a112050783e8182-Cjds3e_fw1200webp
export image_url="/llm/models/test.webp"
python vllm_online_benchmark_multimodal.py --model-name $model_name --image-url $image_url --port 8000 --max-seq 1 --input-length 512 --output-length 100
```
The `image_url` can be a local path (e.g., `/llm/xxx.jpg`) or an external URL (e.g., `"http://xxx.jpg`).
The output will be similar to the example in the API benchmarking section.
### 4.3 Online Benchmark through wrk
In the container, modify the `payload-1024.lua` to ensure the "model" attribute is correct. By default, it uses a prompt of about 1024 tokens.
Then, start the benchmark using `wrk`:
```bash
cd /llm
wrk -t12 -c12 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
```
### 4.4 Offline Benchmark through `benchmark_vllm_throughput.py`
To use the `benchmark_vllm_throughput.py` script, first download the test dataset:
```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```
Then, run the benchmark:
```bash
cd /llm/
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
export MODEL="YOUR_MODEL"
python3 /llm/benchmark_vllm_throughput.py \
--backend vllm \
--dataset /llm/ShareGPT_V3_unfiltered_cleaned_split.json \
--model $MODEL \
--num-prompts 1000 \
--seed 42 \
--trust-remote-code \
--enforce-eager \
--dtype float16 \
--device xpu \
--load-in-low-bit sym_int4 \
--gpu-memory-utilization 0.85
```
---
> 💡 **Tip**: For a detailed and up-to-date guide on running `vLLM` serving with `IPEX-LLM` on Intel GPUs via Docker, please refer to our official documentation:
> [vllm_docker_quickstart](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md)

View file

@ -1,67 +1,93 @@
# vLLM Serving with IPEX-LLM on Intel GPUs via Docker
This guide demonstrates how to run `vLLM` serving with `IPEX-LLM` on Intel GPUs via Docker.
This guide provides step-by-step instructions for running `vLLM` serving with `IPEX-LLM` on Intel GPUs using Docker.
## Install docker
---
Follow the instructions in this [guide](./docker_windows_gpu.md#linux) to install Docker on Linux.
## 1. Install Docker
## Build the Image
To build the `ipex-llm-serving-xpu` Docker image, use the following command:
Follow the instructions in [this guide](./docker_windows_gpu.md#linux) to install Docker on Linux.
---
## 2. Prepare the Docker Image
You can either pull a prebuilt Docker image from DockerHub, depending on your hardware platform:
* **For Intel Arc A770 GPUs**, use:
```bash
docker pull intelanalytics/ipex-llm-serving-xpu:0.8.3-b19
```
* **For Intel Arc BMG GPUs**, use:
```bash
docker pull intelanalytics/multi-arc-serving:0.2.0-b1
```
Or **build the image locally** from source:
```bash
cd docker/llm/serving/xpu/docker
docker build \
--build-arg http_proxy=.. \
--build-arg https_proxy=.. \
--build-arg no_proxy=.. \
--rm --no-cache -t intelanalytics/ipex-llm-serving-xpu:latest .
--build-arg http_proxy=... \
--build-arg https_proxy=... \
--build-arg no_proxy=... \
--rm --no-cache -t vllm-serving:test .
```
## Start Docker Container
---
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Change the `/path/to/models` to mount the models.
## 3. Start the Docker Container
To enable GPU access, map the device by adding `--device=/dev/dri`. Replace `/path/to/models` with the path to your local model directory.
```bash
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
export CONTAINER_NAME=ipex-llm-serving-xpu-container
export CONTAINER_NAME=multi-arc-container
sudo docker run -itd \
--net=host \
--privileged \
--device=/dev/dri \
-v /path/to/models:/llm/models \
-e no_proxy=localhost,127.0.0.1 \
--memory="32G" \
--name=$CONTAINER_NAME \
--shm-size="16g" \
--entrypoint /bin/bash \
$DOCKER_IMAGE
--net=host \
--privileged \
--device=/dev/dri \
-v /path/to/models:/llm/models \
-e no_proxy=localhost,127.0.0.1 \
-e http_proxy=$HTTP_PROXY \
-e https_proxy=$HTTPS_PROXY \
--name=$CONTAINER_NAME \
--shm-size="16g" \
--entrypoint /bin/bash \
$DOCKER_IMAGE
```
After the container is booted, you could get into the container through `docker exec`.
To enter the running container:
```bash
docker exec -it ipex-llm-serving-xpu-container /bin/bash
docker exec -it multi-arc-container /bin/bash
```
To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is:
### Verify GPU Access
Run `sycl-ls` inside the container to confirm GPU devices are visible. A successful output on a machine with Intel Arc A770 GPUs may look like:
```bash
root@arda-arc12:/# sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) w5-3435X OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.35.27191.9]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.35.27191.9]
[opencl:gpu:4] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.35.27191.9]
[opencl:gpu:5] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.35.27191.9]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191]
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191]
[ext_oneapi_level_zero:gpu:3] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191]
root@ws-arc-001:/llm# sycl-ls
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) A770 Graphics 12.55.8 [1.6.32224.500000]
[level_zero:gpu][level_zero:1] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) A770 Graphics 12.55.8 [1.6.32224.500000]
[level_zero:gpu][level_zero:2] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) A770 Graphics 12.55.8 [1.6.32224.500000]
[level_zero:gpu][level_zero:3] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) A770 Graphics 12.55.8 [1.6.32224.500000]
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) w5-3435X OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [24.52.32224.5]
[opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [24.52.32224.5]
[opencl:gpu][opencl:3] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [24.52.32224.5]
[opencl:gpu][opencl:4] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [24.52.32224.5]
```
## Running vLLM serving with IPEX-LLM on Intel GPU in Docker
---
## 4. Run vLLM Serving with IPEX-LLM
> [!TIP]
> Before running benchmarks, it's recommended to lock CPU and GPU frequencies to ensure more stable, reliable, and better performance data.
@ -83,461 +109,334 @@ root@arda-arc12:/# sycl-ls
> sudo xpu-smi config -d 3 -t 0 --frequencyrange 2400,2400
> ```
We have included multiple vLLM-related files in `/llm/`:
1. `vllm_offline_inference.py`: Used for vLLM offline inference example,
1. Modify following parameters in LLM class(line 48):
---
|parameters|explanation|
|:---|:---|
|`model="YOUR_MODEL"`| the model path in docker, for example `"/llm/models/Llama-2-7b-chat-hf"`|
|`load_in_low_bit="fp8"`| model quantization accuracy, acceptable ``'sym_int4'``, ``'asym_int4'``, ``'fp6'``, ``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``, ``'fp16'``; ``'sym_int4'`` means symmetric int 4, ``'asym_int4'`` means asymmetric int 4, etc. Relevant low bit optimizations will be applied to the model. default is ``'fp8'``, which is the same as ``'fp8_e5m2'``|
|`tensor_parallel_size=1`| number of tensor parallel replicas, default is `1`|
|`pipeline_parallel_size=1`| number of pipeline stages, default is `1`|
2. Run the python script
```bash
python vllm_offline_inference.py
```
3. The expected output should be as follows:
### Start the vLLM Service
```bash
INFO 09-25 21:37:31 gpu_executor.py:108] # GPU blocks: 747, # CPU blocks: 512
Processed prompts: 100%|█| 4/4 [00:22<00:00, 5.59s/it, est. speed input: 1.21 toks/s, output: 2.86 toks
Prompt: 'Hello, my name is', Generated text: ' [Your Name], and I am a member of the [Your Group Name].'
Prompt: 'The president of the United States is', Generated text: ' the head of the executive branch and the highest-ranking official in the federal'
Prompt: 'The capital of France is', Generated text: " Paris. It is the country's largest city and is known for its icon"
Prompt: 'The future of AI is', Generated text: ' vast and complex, with many different areas of research and application. Here are some'
```
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0
export SYCL_CACHE_PERSISTENT=1
2. `benchmark_vllm_throughput.py`: Used for benchmarking throughput
3. `payload-1024.lua`: Used for testing request per second using 1k-128 request
4. `start-vllm-service.sh`: Used for template for starting vLLM service
export FI_PROVIDER=shm
export CCL_WORKER_COUNT=2
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export CCL_SAME_STREAM=1
export CCL_BLOCKING_WAIT=0
Before performing benchmark or starting the service, you can refer to this [section](../Quickstart/install_linux_gpu.md#runtime-configurations) to setup our recommended runtime configurations.
export VLLM_USE_V1=0
export IPEX_LLM_LOWBIT="fp8"
### Serving
>
> A script named `/llm/start-vllm-service.sh` have been included in the image for starting the service conveniently. You can tune the service using these four arguments:
source /opt/intel/1ccl-wks/setvars.sh
numactl -C 0-11 python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--port 8000 \
--model "/llm/models/Qwen2.5-7B-Instruct/" \
--served-model-name "Qwen2.5-7B-Instruct" \
--trust-remote-code \
--gpu-memory-utilization "0.95" \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit "fp8" \
--max-model-len "2000" \
--max-num-batched-tokens "3000" \
--max-num-seqs "256" \
--tensor-parallel-size "2" \
--pipeline-parallel-size "1" \
--block-size 8 \
--distributed-executor-backend ray \
--disable-async-output-proc
```
#### Parameter Descriptions
|parameters|explanation|
|:---|:---|
|`--model`| the model path in docker, for example `"/llm/models/Llama-2-7b-chat-hf"`|
|`--served-model-name`| the model name, for example `"Llama-2-7b-chat-hf"`|
|`--load-in-low-bit`| model quantization accuracy, acceptable ``'sym_int4'``, ``'asym_int4'``, ``'fp6'``, ``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``, ``'fp16'``; ``'sym_int4'`` means symmetric int 4, ``'asym_int4'`` means asymmetric int 4, etc. Relevant low bit optimizations will be applied to the model. default is ``'fp8'``, which is the same as ``'fp8_e5m2'``|
|`--tensor-parallel-size`| number of tensor parallel replicas, default is `1`|
|`--pipeline-parallel-size`| number of pipeline stages, default is `1`|
|`--gpu-memory-utilization`| The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9.|
|`--max-model-len`| Model context length. If unspecified, will be automatically derived from the model config.|
|`--max-num-batched-token`| Maximum number of batched tokens per iteration.|
|`--max-num-seq`| Maximum number of sequences per iteration. Default: 256|
|`--block-size`| vLLM block size. Set to 8 to achieve a performance boost.|
|`--quantization`| If you want to enable low-bit quantization (e.g., AWQ or GPTQ), you can add the `--quantization` argument. The available values include `awq` and `gptq`. Note that `load_in_low_bit` should be set to `asym_int4` instead of `int4`. |
#### Single card serving
Here are the steps to serve on a single card.
1. Modify the `model` and `served_model_name` in the script so that it fits your requirement. The `served_model_name` indicates the model name used in the API, for example:
The script mentioned above has been included in the start-vllm-service.sh within the container. You can also modify the parameters in `/llm/start-vllm-service.sh` and start the service using the built-in script:
```bash
model="/llm/models/Llama-2-7b-chat-hf"
served_model_name="llama2-7b-chat"
bash /llm/start-vllm-service.sh
```
2. Start the service using `bash /llm/start-vllm-service.sh`, if the service have booted successfully, you should see the output similar to the following figure:
A successful startup will show logs similar to the screenshot:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" width=100%; />
</a>
3. Using following curl command to test the server
---
## 5. Test the vLLM Service
Send a test completion request using `curl`:
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama2-7b-chat",
"prompt": "San Francisco is a",
"max_tokens": 128
-d '{
"model": "llama2-7b-chat",
"prompt": "San Francisco is a",
"max_tokens": 128
}'
```
The expected output should be as follows:
---
```json
{
"id": "cmpl-0a86629065c3414396358743d7823385",
"object": "text_completion",
"created": 1727273935,
"model": "llama2-7b-chat",
"choices": [
{
"index": 0,
"text": "city that is known for its iconic landmarks, vibrant culture, and diverse neighborhoods. Here are some of the top things to do in San Francisco:. Visit Alcatraz Island: Take a ferry to the infamous former prison and experience the history of Alcatraz Island.2. Explore Golden Gate Park: This sprawling urban park is home to several museums, gardens, and the famous Japanese Tea Garden.3. Walk or Bike the Golden Gate Bridge: Take in the stunning views of the San Francisco Bay and the bridge from various v",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 5,
"total_tokens": 133,
"completion_tokens": 128
}
}
```
## 6. Benchmarking
#### Multi-card serving
If you want to run benchmarks, we recommend using the official vLLM benchmarking script to collect performance data. You can execute the following command inside the container.
For larger models (greater than 10b), we need to use multiple graphics cards for deployment. In the above script(`/llm/start-vllm-service.sh`), we need to make some modifications to achieve multi-card serving.
- --num_prompt specifies the number of concurrent requests
- --random-input-len sets the number of input tokens
- --random-output-len sets the number of output tokens
1. **Tensor Parallel Serving**: need modify the `-tensor-parallel-size` num, for example, using 2 cards for tp serving, add following parameter:
```bash
--tensor-parallel-size 2
python /llm/vllm/benchmarks/benchmark_serving.py \
--model "/llm/models/Qwen2.5-7B-Instruct" \
--served-model-name "Qwen2.5-7B-Instruct" \
--dataset-name random \
--trust_remote_code \
--ignore-eos \
--num_prompt $batch_size \
--random-input-len=$input_length \
--random-output-len=$output_length
```
or shortening:
### Sample Benchmark Output
```bash
-tp 2
```
============ Serving Benchmark Result ============
Successful requests: 1
Benchmark duration (s): 17.06
Total input tokens: 1024
Total generated tokens: 512
Request throughput (req/s): 0.06
Output token throughput (tok/s): 30.01
Total Token throughput (tok/s): 90.02
---------------Time to First Token----------------
Mean TTFT (ms): 520.10
Median TTFT (ms): 520.10
P99 TTFT (ms): 520.10
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 32.37
Median TPOT (ms): 32.37
P99 TPOT (ms): 32.37
---------------Inter-token Latency----------------
Mean ITL (ms): 64.87
Median ITL (ms): 64.84
P99 ITL (ms): 66.34
==================================================
```
2. **Pipeline Parallel Serving**: need modify the `-pipeline-parallel-size` num, for example, using 2 cards for pp serving, add following parameter:
## 7. Miscellaneous Tools
```bash
--pipeline-parallel-size 2
```
or shortening:
```bash
-pp 2
```
### 7.1 Offline Inference and Model Quantization (Optional)
<details>
If real-time services are not required, you can choose to use offline inference for testing or evaluation. Additionally, IPEX-LLM supports various model quantization schemes (such as FP8, INT4, AWQ, GPTQ) to reduce memory usage and improve performance.
3. **TP+PP Serving**: using tensor-parallel and pipline-parallel mixed, for example, if you have 4 GPUs in 2 nodes (2GPUs per node), you can set the tensor parallel size to 2 and the pipeline parallel size to 2.
Edit the parameters of the `LLM` class in `/llm/vllm_offline_inference.py`, such as:
```bash
--pipeline-parallel-size 2 \
--tensor-parallel-size 2
```
or shortening:
```bash
-pp 2 \
-tp 2
```
### Quantization
Quantizing model from FP16 to INT4 can effectively reduce the model size loaded into gpu memory by about 70 %. The main advantage is lower delay and memory usage.
#### IPEX-LLM
Two scripts are provided in the docker image for model inference.
1. vllm offline inference: `vllm_offline_inference.py`
> Only need change the `load_in_low_bit` value to use different quantization dtype. Commonly supported dtype containes:`sym_int4`, `fp6`, `fp8`, and `fp16`, full supported dtype refer to [load_in_low_bit](./vllm_docker_quickstart.md#running-vllm-serving-with-ipex-llm-on-intel-gpu-in-docker) in the llm class parameter table.
**Example 1:** Standard Model + IPEX-LLM Low-bit Format
```python
llm = LLM(model="YOUR_MODEL",
llm = LLM(model="/llm/models/Llama-2-7b-chat-hf",
device="xpu",
dtype="float16",
enforce_eager=True,
# Simply change here for the desired load_in_low_bit value
load_in_low_bit="sym_int4",
load_in_low_bit="sym_int4", # Optional values: sym_int4, asym_int4, fp6, fp8, fp16
tensor_parallel_size=1,
trust_remote_code=True)
```
then run
**Example 2:** AWQ Model (e.g., Llama-2-7B-Chat-AWQ)
```python
llm = LLM(model="/llm/models/Llama-2-7B-Chat-AWQ/",
quantization="AWQ",
load_in_low_bit="asym_int4", # Note: use asym_int4 here
device="xpu",
dtype="float16",
enforce_eager=True,
tensor_parallel_size=1)
```
**Example 3:** GPTQ Model (e.g., llama2-7b-chat-GPTQ)
```python
llm = LLM(model="/llm/models/llama2-7b-chat-GPTQ/",
quantization="GPTQ",
load_in_low_bit="sym_int4", # GPTQ recommends using sym_int4
device="xpu",
dtype="float16",
enforce_eager=True,
tensor_parallel_size=1)
```
Run the command:
```bash
python vllm_offline_inference.py
```
2. vllm online service `start-vllm-service.sh`
You should see output similar to the following:
> To fully utilize the continuous batching feature of the vLLM, you can send requests to the service using curl or other similar methods. The requests sent to the engine will be batched at token level. Queries will be executed in the same forward step of the LLM and be removed when they are finished instead of waiting for all sequences to be finished.
Modify the `--load-in-low-bit` value to `fp6`, `fp8`, `fp8_e4m3` or `fp16`
```plaintext
Prompt: 'The capital of France is', Generated text: ' Paris.'
Prompt: 'The future of AI is', Generated text: ' promising and transformative...'
```
</details>
### 7.2. Benchmarking
<details>
#### 7.2.1 Online Benchmark through API Server
To benchmark the API server and estimate TPS (transactions per second), follow these steps:
1. Start the service as per the instructions in this [section](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md#Serving).
2. Run the benchmark using `vllm_online_benchmark.py`:
```bash
# Change value --load-in-low-bit to [fp6, fp8, fp8_e4m3, fp16] to use different low-bit formats
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--block-size 8 \
--gpu-memory-utilization 0.9 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit sym_int4 \
--max-model-len 2048 \
--max-num-batched-tokens 4000 \
--tensor-parallel-size 1 \
--disable-async-output-proc \
--distributed-executor-backend ray
python vllm_online_benchmark.py $model_name $max_seqs $input_length $output_length
```
then run following command to start vllm service
If `input_length` and `output_length` are not provided, the script defaults to values of 1024 and 512 tokens, respectively. The output will look something like:
```bash
bash start-vllm-service.sh
model_name: Qwen1.5-14B-Chat
max_seq: 12
Warm Up: 100%|█████████████████████████████████████████████████████| 24/24 [01:36<00:00, 4.03s/req]
Benchmarking: 100%|████████████████████████████████████████████████| 60/60 [04:03<00:00, 4.05s/req]
Total time for 60 requests with 12 concurrent requests: xxx seconds.
Average response time: xxx
Token throughput: xxx
Average first token latency: xxx milliseconds.
P90 first token latency: xxx milliseconds.
P95 first token latency: xxx milliseconds.
Average next token latency: xxx milliseconds.
P90 next token latency: xxx milliseconds.
P95 next token latency: xxx milliseconds.
```
Lastly, using curl command to send a request to service, below shows an example output using `Qwen1.5-7B-Chat` with low-bit format `sym_int4`:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/vllm-curl-result.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/vllm-curl-result.png" width=100%; />
</a>
#### 7.2.2 Online Benchmark with Multimodal Input
#### AWQ
After starting the vLLM service, you can benchmark multimodal inputs using `vllm_online_benchmark_multimodal.py`:
Use AWQ as a way to reduce memory footprint. Firstly download the model after awq quantification, taking `Llama-2-7B-Chat-AWQ` as an example, download it on <https://huggingface.co/TheBloke/Llama-2-7B-Chat-AWQ>
```bash
wget -O /llm/models/test.webp https://gd-hbimg.huaban.com/b7764d5f9c19b3e433d54ba66cce6a112050783e8182-Cjds3e_fw1200webp
export image_url="/llm/models/test.webp"
python vllm_online_benchmark_multimodal.py --model-name $model_name --image-url $image_url --port 8000 --max-seq 1 --input-length 512 --output-length 100
```
1. Offline inference usage with `/llm/vllm_offline_inference.py`
The `image_url` can be a local path (e.g., `/llm/xxx.jpg`) or an external URL (e.g., `"http://xxx.jpg`).
1. Change the `/llm/vllm_offline_inference.py` LLM class code block's parameters `model`, `quantization` and `load_in_low_bit`, note that `load_in_low_bit` should be set to `asym_int4` instead of `int4`:
The output will be similar to the example in the API benchmarking section.
```python
llm = LLM(model="/llm/models/Llama-2-7B-chat-AWQ/",
quantization="AWQ",
load_in_low_bit="asym_int4",
device="xpu",
dtype="float16",
enforce_eager=True,
tensor_parallel_size=1)
```
#### 7.2.3 Online Benchmark through wrk
then run the following command
In the container, modify the `payload-1024.lua` to ensure the "model" attribute is correct. By default, it uses a prompt of about 1024 tokens.
```bash
python vllm_offline_inference.py
```
Then, start the benchmark using `wrk`:
2. Expected result shows as below:
```bash
cd /llm
wrk -t12 -c12 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
```
```bash
2024-09-29 10:06:34,272 - INFO - Converting the current model to asym_int4 format......
2024-09-29 10:06:34,272 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-09-29 10:06:40,080 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-09-29 10:06:41,258 - INFO - Loading model weights took 3.7381 GB
WARNING 09-29 10:06:47 utils.py:564] Pin memory is not supported on XPU.
INFO 09-29 10:06:47 gpu_executor.py:108] # GPU blocks: 1095, # CPU blocks: 512
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:22<00:00, 5.67s/it, est. speed input: 1.19 toks/s, output: 2.82 toks/s]
Prompt: 'Hello, my name is', Generated text: ' [Your Name], and I am a resident of [Your City/Town'
Prompt: 'The president of the United States is', Generated text: ' the head of the executive branch and is one of the most powerful political figures in'
Prompt: 'The capital of France is', Generated text: ' Paris. It is the most populous urban agglomeration in the European'
Prompt: 'The future of AI is', Generated text: ' vast and exciting, with many potential applications across various industries. Here are'
r
```
#### 7.2.4 Offline Benchmark through `benchmark_vllm_throughput.py`
2. Online serving usage with `/llm/start-vllm-service.sh`
1. Change the `/llm/start-vllm-service.sh`, set `model` parameter to awq model path and `served_model_name`. Add `quantization` and `load_in_low_bit`, note that `load_in_low_bit` should be set to `asym_int4` instead of `int4`:
To use the `benchmark_vllm_throughput.py` script, first download the test dataset:
```bash
#!/bin/bash
model="/llm/models/Llama-2-7B-Chat-AWQ/"
served_model_name="llama2-7b-awq"
...
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--model $model \
...
--quantization awq \
--load-in-low-bit asym_int4 \
...
```
```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```
2. Use `bash start-vllm-service.sh` to start awq model online serving. Serving start successfully log:
Then, run the benchmark:
```bash
2024-10-18 01:50:24,124 - INFO - Converting the current model to asym_int4 format......
2024-10-18 01:50:24,124 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-10-18 01:50:29,812 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-10-18 01:50:30,880 - INFO - Loading model weights took 3.7381 GB
WARNING 10-18 01:50:39 utils.py:564] Pin memory is not supported on XPU.
INFO 10-18 01:50:39 gpu_executor.py:108] # GPU blocks: 2254, # CPU blocks: 1024
WARNING 10-18 01:50:39 serving_embedding.py:171] embedding_mode is False. Embedding API will not work.
INFO 10-18 01:50:39 launcher.py:14] Available routes are:
INFO 10-18 01:50:39 launcher.py:22] Route: /openapi.json, Methods: HEAD, GET
INFO 10-18 01:50:39 launcher.py:22] Route: /docs, Methods: HEAD, GET
INFO 10-18 01:50:39 launcher.py:22] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 10-18 01:50:39 launcher.py:22] Route: /redoc, Methods: HEAD, GET
INFO 10-18 01:50:39 launcher.py:22] Route: /health, Methods: GET
INFO 10-18 01:50:39 launcher.py:22] Route: /tokenize, Methods: POST
INFO 10-18 01:50:39 launcher.py:22] Route: /detokenize, Methods: POST
INFO 10-18 01:50:39 launcher.py:22] Route: /v1/models, Methods: GET
INFO 10-18 01:50:39 launcher.py:22] Route: /version, Methods: GET
INFO 10-18 01:50:39 launcher.py:22] Route: /v1/chat/completions, Methods: POST
INFO 10-18 01:50:39 launcher.py:22] Route: /v1/completions, Methods: POST
INFO 10-18 01:50:39 launcher.py:22] Route: /v1/embeddings, Methods: POST
INFO: Started server process [995]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```
```bash
cd /llm/
3. In docker send request to verfiy the serving status.
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama2-7b-awq",
"prompt": "San Francisco is a",
"max_tokens": 128
}'
```
export MODEL="YOUR_MODEL"
and should get following output:
python3 /llm/benchmark_vllm_throughput.py \
--backend vllm \
--dataset /llm/ShareGPT_V3_unfiltered_cleaned_split.json \
--model $MODEL \
--num-prompts 1000 \
--seed 42 \
--trust-remote-code \
--enforce-eager \
--dtype float16 \
--device xpu \
--load-in-low-bit sym_int4 \
--gpu-memory-utilization 0.85
```
</details>
```json
{
"id": "cmpl-992e4c8463d24d0ab2e59e706123ef0d",
"object": "text_completion",
"created": 1729187735,
"model": "llama2-7b-awq",
"choices": [
{
"index": 0,
"text": " food lover's paradise with a diverse array of culinary options to suit any taste and budget. Here are some of the top attractions when it comes to food and drink in San Francisco:\n\n1. Fisherman's Wharf: This bustling waterfront district is known for its fresh seafood, street performers, and souvenir shops. Be sure to try some of the local specialties like Dungeness crab, abalone, or sourdough bread.\n\n2. Chinatown: San Francisco's Chinatown is one of the largest and oldest",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 5,
"total_tokens": 133,
"completion_tokens": 128
}
}
```
#### GPTQ
### 7.3 Automatically Launching the Service via Container
<details>
You can configure the container to automatically start the service with the desired model and parallel settings by using environment variables:
Use GPTQ as a way to reduce memory footprint. Firstly download the model after gptq quantification, taking `Llama-2-13B-Chat-GPTQ` as an example, download it on <https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ>
```bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
1. Offline inference usage with `/llm/vllm_offline_inference.py`
1. Change the `/llm/vllm_offline_inference` LLM class code block's parameters `model`, `quantization` and `load_in_low_bit`, note that `load_in_low_bit` should be set to `asym_int4` instead of `int4`:
sudo docker run -itd \
--net=host \
--device=/dev/dri \
--privileged \
--memory="32G" \
--name=CONTAINER_NAME \
--shm-size="16g" \
-e MODEL_PATH="/llm/models" \
-e SERVED_MODEL_NAME="my_model" \
-e TENSOR_PARALLEL_SIZE=4 \
-v /home/intel/LLM/:/llm/models/ \
$DOCKER_IMAGE
```
```python
llm = LLM(model="/llm/models/Llama-2-7B-Chat-GPTQ/",
quantization="GPTQ",
load_in_low_bit="asym_int4",
device="xpu",
dtype="float16",
enforce_eager=True,
tensor_parallel_size=1)
```
* `MODEL_PATH`, `SERVED_MODEL_NAME`, and `TENSOR_PARALLEL_SIZE` control the model used and degree of parallelism.
* Mount the model directory using `-v`.
then run the following command
To view logs and confirm service status:
```bash
python vllm_offline_inference.py
```
```bash
docker logs CONTAINER_NAME
```
</details>
2. Expected result shows as below:
```bash
2024-10-08 10:55:18,296 - INFO - Converting the current model to asym_int4 format......
2024-10-08 10:55:18,296 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-10-08 10:55:23,478 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-10-08 10:55:24,581 - INFO - Loading model weights took 3.7381 GB
WARNING 10-08 10:55:31 utils.py:564] Pin memory is not supported on XPU.
INFO 10-08 10:55:31 gpu_executor.py:108] # GPU blocks: 1095, # CPU blocks: 512
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]Processed prompts: 100%|██████████████████████████████████████████████████| 4/4 [00:22<00:00, 5.73s/it, est. speed input: 1.18 toks/s, output: 2.79 toks/s]Prompt: 'Hello, my name is', Generated text: ' [Your Name] and I am a [Your Profession] with [Your'
Prompt: 'The president of the United States is', Generated text: ' the head of the executive branch of the federal government and is one of the most'
Prompt: 'The capital of France is', Generated text: ' Paris, which is located in the northern part of the country.\nwhere is'
Prompt: 'The future of AI is', Generated text: ' vast and exciting, with many possibilities for growth and innovation. Here are'
```
2. Online serving usage with `/llm/start-vllm-service.sh`
1. Change the `/llm/start-vllm-service.sh`, set `model` parameter to gptq model path and `served_model_name`. Add `quantization` and `load_in_low_bit`, note that `load_in_low_bit` should be set to `asym_int4` instead of `int4`:
```bash
#!/bin/bash
model="/llm/models/Llama-2-7B-Chat-GPTQ/"
served_model_name="llama2-7b-gptq"
...
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--model $model \
...
--quantization gptq \
--load-in-low-bit asym_int4 \
...
```
2. Use `bash start-vllm-service.sh` to start gptq model online serving. Serving start successfully log:
```bash
2024-10-18 09:26:30,604 - INFO - Converting the current model to asym_int4 format......
2024-10-18 09:26:30,605 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-10-18 09:26:35,970 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-10-18 09:26:37,007 - INFO - Loading model weights took 3.7381 GB
WARNING 10-18 09:26:44 utils.py:564] Pin memory is not supported on XPU.
INFO 10-18 09:26:44 gpu_executor.py:108] # GPU blocks: 2254, # CPU blocks: 1024
WARNING 10-18 09:26:44 serving_embedding.py:171] embedding_mode is False. Embedding API will not work.
INFO 10-18 09:26:44 launcher.py:14] Available routes are:
INFO 10-18 09:26:44 launcher.py:22] Route: /openapi.json, Methods: GET, HEAD
INFO 10-18 09:26:44 launcher.py:22] Route: /docs, Methods: GET, HEAD
INFO 10-18 09:26:44 launcher.py:22] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 10-18 09:26:44 launcher.py:22] Route: /redoc, Methods: GET, HEAD
INFO 10-18 09:26:44 launcher.py:22] Route: /health, Methods: GET
INFO 10-18 09:26:44 launcher.py:22] Route: /tokenize, Methods: POST
INFO 10-18 09:26:44 launcher.py:22] Route: /detokenize, Methods: POST
INFO 10-18 09:26:44 launcher.py:22] Route: /v1/models, Methods: GET
INFO 10-18 09:26:44 launcher.py:22] Route: /version, Methods: GET
INFO 10-18 09:26:44 launcher.py:22] Route: /v1/chat/completions, Methods: POST
INFO 10-18 09:26:44 launcher.py:22] Route: /v1/completions, Methods: POST
INFO 10-18 09:26:44 launcher.py:22] Route: /v1/embeddings, Methods: POST
INFO: Started server process [1294]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```
3. In docker send request to verfiy the serving status.
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama2-7b-gptq",
"prompt": "San Francisco is a",
"max_tokens": 128
}'
```
and should get following output:
```json
{
"id": "cmpl-e20bdfe80656404baea930e0288396a9",
"object": "text_completion",
"created": 1729214854,
"model": "llama2-7b-gptq",
"choices": [
{
"index": 0,
"text": " food lover's paradise with a diverse array of culinary options to suit any taste and budget. Here are some of the top attractions when it comes to food and drink in San Francisco:\n\n1. Fisherman's Wharf: This bustling waterfront district is known for its fresh seafood, street performers, and souvenir shops. Be sure to try some of the local specialties like Dungeness crab, abalone, or sourdough bread.\n\n2. Chinatown: San Francisco's Chinatown is one of the largest and oldest",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 5,
"total_tokens": 133,
"completion_tokens": 128
}
}
```
### Advanced Features
## 8. Advanced Features
#### Multi-modal Model
<details>
vLLM serving with IPEX-LLM supports multi-modal models, such as [MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6), which can accept image and text input at the same time and respond.
1. Start MiniCPM service: change the `model` and `served_model_name` value in `/llm/start-vllm-service.sh`
@ -575,9 +474,10 @@ curl http://localhost:8000/v1/chat/completions \
```bash
{"id":"chat-0c8ea64a2f8e42d9a8f352c160972455","object":"chat.completion","created":1728373105,"model":"MiniCPM-V-2_6","choices":[{"index":0,"message":{"role":"assistant","content":"这幅图片展示了一个小孩,可能是女孩,根据服装和发型来判断。她穿着一件有红色和白色条纹的连衣裙,一个可见的白色蝴蝶结,以及一个白色的 头饰,上面有红色的点缀。孩子右手拿着一个白色泰迪熊,泰迪熊穿着一个粉色的裙子,带有褶边,它的左脸颊上有一个红色的心形图案。背景模糊,但显示出一个自然户外的环境,可能是一个花园或庭院,有红花和石头墙。阳光照亮了整个场景,暗示这可能是正午或下午。整体氛围是欢乐和天真。","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":225,"total_tokens":353,"completion_tokens":128}}
```
</details>
#### Preifx Caching
<details>
Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.
1. Set `enable_prefix_caching=True` in vLLM engine to enable APC. Here is an example python script to show the time reduce of APC:
@ -677,9 +577,10 @@ Processed prompts: 100%|██████████████████
Output: 30.
Generation time: 0.9657604694366455 seconds.
```
</details>
#### LoRA Adapter
<details>
This chapter shows how to use LoRA adapters with vLLM on top of a base model. Adapters can be efficiently served on a per request basis with minimal overhead.
1. Download the adapter(s) and save them locally first, for example, for `llama-2-7b`:
@ -759,9 +660,10 @@ python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--lora-modules sql-lora-1=$SQL_LOARA_1 sql-lora-2=$SQL_LOARA_2
```
</details>
#### OpenAI API Backend
<details>
vLLM Serving can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as backend for web applications such as [open-webui](https://github.com/open-webui/open-webui/) using OpenAI API.
1. Start vLLM Serving with `api-key`, just setting any string to `api-key` in `start-vllm-service.sh`, and run it.
@ -918,6 +820,7 @@ We can set up model serving using `IPEX-LLM` as backend using FastChat, the foll
```
This quick setup allows you to deploy FastChat with IPEX-LLM efficiently.
</details>
### Validated Models List

View file

@ -1,3 +1,9 @@
> 💡 **Tip**:
> For a detailed and up-to-date guide on running `vLLM` serving with `IPEX-LLM` on Intel GPUs **via Docker**, please refer to our official documentation:
> [vllm_docker_quickstart.md](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md)
>
> If you prefer to run **without Docker**, you can refer to this guide:
# vLLM continuous batching on Intel GPUs
This example demonstrates how to serve a LLaMA2-7B model using vLLM continuous batching on Intel GPU (with IPEX-LLM low-bits optimizations).