ipex-llm/docker/llm/serving/xpu/docker
Wang, Jian4 e1809a6295
Update multimodal on vllm 0.6.6 (#12816)
* add glm4v and minicpmv example

* fix
2025-02-19 10:04:42 +08:00
..
benchmark.sh Merge CPU & XPU Dockerfiles with Serving Images and Refactor (#12815) 2025-02-17 14:23:22 +08:00
benchmark_vllm_latency.py Upgrade to vLLM 0.6.6 (#12796) 2025-02-12 16:47:51 +08:00
benchmark_vllm_throughput.py Upgrade to vLLM 0.6.6 (#12796) 2025-02-12 16:47:51 +08:00
ccl_torch.patch Upgrade to vLLM 0.6.6 (#12796) 2025-02-12 16:47:51 +08:00
chat.py Merge CPU & XPU Dockerfiles with Serving Images and Refactor (#12815) 2025-02-17 14:23:22 +08:00
Dockerfile Merge CPU & XPU Dockerfiles with Serving Images and Refactor (#12815) 2025-02-17 14:23:22 +08:00
oneccl-binding.patch Update oneccl-binding.patch (#12377) 2024-11-11 22:34:08 +08:00
payload-1024.lua Add vLLM to ipex-llm serving image (#10807) 2024-04-29 17:25:42 +08:00
README.md Merge CPU & XPU Dockerfiles with Serving Images and Refactor (#12815) 2025-02-17 14:23:22 +08:00
start-lightweight_serving-service.sh Reenable pp and lightweight-serving serving on 0.6.6 (#12814) 2025-02-13 10:16:00 +08:00
start-pp_serving-service.sh Reenable pp and lightweight-serving serving on 0.6.6 (#12814) 2025-02-13 10:16:00 +08:00
start-vllm-service.sh Upgrade oneccl version to 0.0.6.3 (#12560) 2024-12-20 09:29:16 +08:00
vllm_offline_inference.py Fix (#12390) 2024-11-27 10:41:58 +08:00
vllm_offline_inference_vision_language.py Update multimodal on vllm 0.6.6 (#12816) 2025-02-19 10:04:42 +08:00
vllm_online_benchmark.py Update english prompt to 34k (#12429) 2024-11-22 11:20:35 +08:00
vllm_online_benchmark_multimodal.py Add multimodal benchmark (#12415) 2024-11-20 14:21:13 +08:00

IPEX-LLM-serving XPU Image: Build and Usage Guide

This document outlines the steps to build and use the IPEX-LLM-serving-xpu Docker image, including inference, serving, and benchmarking functionalities for XPU.

1. Build the Image

To build the IPEX-LLM-serving-xpu Docker image, use the following command:

docker build \
  --build-arg http_proxy=.. \
  --build-arg https_proxy=.. \
  --build-arg no_proxy=.. \
  --rm --no-cache -t intelanalytics/ipex-llm-serving-xpu:2.2.0-SNAPSHOT .

2. Using the Image for XPU Inference

To map the XPU into the container, you need to specify --device=/dev/dri when starting the container.

Example:

#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:2.2.0-SNAPSHOT

sudo docker run -itd \
        --net=host \
        --device=/dev/dri \
        --memory="32G" \
        --name=CONTAINER_NAME \
        --shm-size="16g" \
        $DOCKER_IMAGE

Once the container is up and running, use docker exec to enter it.

To verify if the XPU device is successfully mapped into the container, run the following:

sycl-ls

For a machine with Arc A770, the output will be similar to:

root@arda-arc12:/# sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]

For detailed instructions on running inference with IPEX-LLM on XPU, refer to this documentation.


3. Using the Image for XPU Serving

To run XPU serving, you need to map the XPU into the container by specifying --device=/dev/dri when booting the container.

Example:

#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest

sudo docker run -itd \
        --net=host \
        --device=/dev/dri \
        --name=CONTAINER_NAME \
        --shm-size="16g" \
        $DOCKER_IMAGE

After the container starts, access it using docker exec.

To verify that the device is correctly mapped, run:

sycl-ls

The output will be similar to the example in the inference section above.

Currently, the image supports two different serving engines: FastChat and vLLM.

Serving Engines

3.1 Lightweight Serving Engine

For running lightweight serving on Intel GPUs using IPEX-LLM as the backend, refer to the Lightweight-Serving README.

We have included a script /llm/start-lightweight_serving-service in the image. Make sure to install the correct transformers version before proceeding, like so:

pip install transformers==4.37.0

3.2 Pipeline Parallel Serving Engine

To use the Pipeline Parallel serving engine with IPEX-LLM as the backend, refer to this Pipeline-Parallel-FastAPI README.

A convenience script /llm/start-pp_serving-service.sh is included in the image. Be sure to install the required version of transformers, like so:

pip install transformers==4.37.0

3.3 vLLM Serving Engine

For running the vLLM engine with IPEX-LLM as the backend, refer to this vLLM Docker Quickstart Guide.

The following example files are available in /llm/ within the container:

  1. vllm_offline_inference.py: vLLM offline inference example
  2. benchmark_vllm_throughput.py: Throughput benchmarking
  3. payload-1024.lua: Request-per-second test (using 1k-128 request)
  4. start-vllm-service.sh: Template for starting the vLLM service
  5. vllm_offline_inference_vision_language.py: vLLM offline inference for vision-based models

4. Benchmarking

4.1 Online Benchmark through API Server

To benchmark the API server and estimate TPS (transactions per second), follow these steps:

  1. Start the service as per the instructions in this section.
  2. Run the benchmark using vllm_online_benchmark.py:
python vllm_online_benchmark.py $model_name $max_seqs $input_length $output_length

If input_length and output_length are not provided, the script defaults to values of 1024 and 512 tokens, respectively. The output will look something like:

model_name: Qwen1.5-14B-Chat
max_seq: 12
Warm Up: 100%|█████████████████████████████████████████████████████| 24/24 [01:36<00:00,  4.03s/req]
Benchmarking: 100%|████████████████████████████████████████████████| 60/60 [04:03<00:00,  4.05s/req]
Total time for 60 requests with 12 concurrent requests: xxx seconds.
Average response time: xxx
Token throughput: xxx

Average first token latency: xxx milliseconds.
P90 first token latency: xxx milliseconds.
P95 first token latency: xxx milliseconds.

Average next token latency: xxx milliseconds.
P90 next token latency: xxx milliseconds.
P95 next token latency: xxx milliseconds.

4.2 Online Benchmark with Multimodal Input

After starting the vLLM service, you can benchmark multimodal inputs using vllm_online_benchmark_multimodal.py:

export image_url="http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg"
python vllm_online_benchmark_multimodal.py --model-name $model_name --image-url $image_url --prompt "What is in the image?" --port 8000

The image_url can be a local path (e.g., /llm/xxx.jpg) or an external URL (e.g., "http://xxx.jpg).

The output will be similar to the example in the API benchmarking section.

4.3 Online Benchmark through wrk

In the container, modify the payload-1024.lua to ensure the "model" attribute is correct. By default, it uses a prompt of about 1024 tokens.

Then, start the benchmark using wrk:

cd /llm
wrk -t12 -c12 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h

4.4 Offline Benchmark through benchmark_vllm_throughput.py

To use the benchmark_vllm_throughput.py script, first download the test dataset:

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Then, run the benchmark:

cd /llm/

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

export MODEL="YOUR_MODEL"

python3 /llm/benchmark_vllm_throughput.py \
    --backend vllm \
    --dataset /llm/ShareGPT_V3_unfiltered_cleaned_split.json \
    --model $MODEL \
    --num-prompts 1000 \
    --seed 42 \
    --trust-remote-code \
    --enforce-eager \
    --dtype float16 \
    --device xpu \
    --load-in-low-bit sym_int4 \
    --gpu-memory-utilization 0.85