# vLLM Serving with IPEX-LLM on Intel GPUs via Docker This guide demonstrates how to run `vLLM` serving with `IPEX-LLM` on Intel GPUs via Docker. ## Install docker Follow the instructions in this [guide](./docker_windows_gpu.md#linux) to install Docker on Linux. ## Pull the latest image *Note: For running vLLM serving on Intel GPUs, you can currently use either the `intelanalytics/ipex-llm-serving-xpu:latest` or `intelanalytics/ipex-llm-serving-vllm-xpu:latest` Docker image.* ```bash # This image will be updated every day docker pull intelanalytics/ipex-llm-serving-xpu:latest ``` ## Start Docker Container To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Change the `/path/to/models` to mount the models. ```bash #/bin/bash export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest export CONTAINER_NAME=ipex-llm-serving-xpu-container sudo docker run -itd \ --privileged \ --net=host \ --device=/dev/dri \ -v /path/to/models:/llm/models \ -e no_proxy=localhost,127.0.0.1 \ --memory="32G" \ --name=$CONTAINER_NAME \ --shm-size="16g" \ $DOCKER_IMAGE ``` After the container is booted, you could get into the container through `docker exec`. ```bash docker exec -it ipex-llm-serving-xpu-container /bin/bash ``` To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is: ```bash root@arda-arc12:/# sycl-ls [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.5.0.08_160000.xmain-hotfix] [opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) w5-3435X OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix] [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.35.27191.9] [opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.35.27191.9] [opencl:gpu:4] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.35.27191.9] [opencl:gpu:5] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.35.27191.9] [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191] [ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191] [ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191] [ext_oneapi_level_zero:gpu:3] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191] ``` ## Running vLLM serving with IPEX-LLM on Intel GPU in Docker We have included multiple vLLM-related files in `/llm/`: 1. `vllm_offline_inference.py`: Used for vLLM offline inference example, 1. Modify following parameters in LLM class(line 48): |parameters|explanation| |:---|:---| |`model="YOUR_MODEL"`| the model path in docker, for example `"/llm/models/Llama-2-7b-chat-hf"`| |`load_in_low_bit="fp8"`| model quantization accuracy, acceptable ``'sym_int4'``, ``'asym_int4'``, ``'fp6'``, ``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``, ``'fp16'``; ``'sym_int4'`` means symmetric int 4, ``'asym_int4'`` means asymmetric int 4, etc. Relevant low bit optimizations will be applied to the model. default is ``'fp8'``, which is the same as ``'fp8_e5m2'``| |`tensor_parallel_size=1`| number of tensor parallel replicas, default is `1`| |`pipeline_parallel_size=1`| number of pipeline stages, default is `1`| 2. Run the python script ```bash python vllm_offline_inference.py ``` 3. The expected output should be as follows: ```bash INFO 09-25 21:37:31 gpu_executor.py:108] # GPU blocks: 747, # CPU blocks: 512 Processed prompts: 100%|█| 4/4 [00:22<00:00, 5.59s/it, est. speed input: 1.21 toks/s, output: 2.86 toks Prompt: 'Hello, my name is', Generated text: ' [Your Name], and I am a member of the [Your Group Name].' Prompt: 'The president of the United States is', Generated text: ' the head of the executive branch and the highest-ranking official in the federal' Prompt: 'The capital of France is', Generated text: " Paris. It is the country's largest city and is known for its icon" Prompt: 'The future of AI is', Generated text: ' vast and complex, with many different areas of research and application. Here are some' ``` 2. `benchmark_vllm_throughput.py`: Used for benchmarking throughput 3. `payload-1024.lua`: Used for testing request per second using 1k-128 request 4. `start-vllm-service.sh`: Used for template for starting vLLM service Before performing benchmark or starting the service, you can refer to this [section](../Quickstart/install_linux_gpu.md#runtime-configurations) to setup our recommended runtime configurations. ### Serving > > A script named `/llm/start-vllm-service.sh` have been included in the image for starting the service conveniently. You can tune the service using these four arguments: |parameters|explanation| |:---|:---| |`--gpu-memory-utilization`| The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9.| |`--max-model-len`| Model context length. If unspecified, will be automatically derived from the model config.| |`--max-num-batched-token`| Maximum number of batched tokens per iteration.| |`--max-num-seq`| Maximum number of sequences per iteration. Default: 256| |`--block-size`| vLLM block size. Set to 8 to achieve a performance boost.| #### Single card serving Here are the steps to serve on a single card. 1. Modify the `model` and `served_model_name` in the script so that it fits your requirement. The `served_model_name` indicates the model name used in the API, for example: ```bash model="/llm/models/Llama-2-7b-chat-hf" served_model_name="llama2-7b-chat" ``` 2. Start the service using `bash /llm/start-vllm-service.sh`, if the service have booted successfully, you should see the output similar to the following figure: 3. Using following curl command to test the server ```bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "llama2-7b-chat", "prompt": "San Francisco is a", "max_tokens": 128 }' ``` The expected output should be as follows: ```json { "id": "cmpl-0a86629065c3414396358743d7823385", "object": "text_completion", "created": 1727273935, "model": "llama2-7b-chat", "choices": [ { "index": 0, "text": "city that is known for its iconic landmarks, vibrant culture, and diverse neighborhoods. Here are some of the top things to do in San Francisco:. Visit Alcatraz Island: Take a ferry to the infamous former prison and experience the history of Alcatraz Island.2. Explore Golden Gate Park: This sprawling urban park is home to several museums, gardens, and the famous Japanese Tea Garden.3. Walk or Bike the Golden Gate Bridge: Take in the stunning views of the San Francisco Bay and the bridge from various v", "logprobs": null, "finish_reason": "length", "stop_reason": null } ], "usage": { "prompt_tokens": 5, "total_tokens": 133, "completion_tokens": 128 } } ``` #### Multi-card serving For larger models (greater than 10b), we need to use multiple graphics cards for deployment. In the above script(`/llm/start-vllm-service.sh`), we need to make some modifications to achieve multi-card serving. 1. **Tensor Parallel Serving**: need modify the `-tensor-parallel-size` num, for example, using 2 cards for tp serving, add following parameter: ```bash --tensor-parallel-size 2 ``` or shortening: ```bash -tp 2 ``` 2. **Pipeline Parallel Serving**: need modify the `-pipeline-parallel-size` num, for example, using 2 cards for pp serving, add following parameter: ```bash --pipeline-parallel-size 2 ``` or shortening: ```bash -pp 2 ``` 3. **TP+PP Serving**: using tensor-parallel and pipline-parallel mixed, for example, if you have 4 GPUs in 2 nodes (2GPUs per node), you can set the tensor parallel size to 2 and the pipeline parallel size to 2. ```bash --pipeline-parallel-size 2 \ --tensor-parallel-size 2 ``` or shortening: ```bash -pp 2 \ -tp 2 ``` ### Quantization Quantizing model from FP16 to INT4 can effectively reduce the model size loaded into gpu memory by about 70 %. The main advantage is lower delay and memory usage. #### IPEX-LLM Two scripts are provided in the docker image for model inference. 1. vllm offline inference: `vllm_offline_inference.py` > Only need change the `load_in_low_bit` value to use different quantization dtype. Commonly supported dtype containes:`sym_int4`, `fp6`, `fp8`, and `fp16`, full supported dtype refer to [load_in_low_bit](./vllm_docker_quickstart.md#running-vllm-serving-with-ipex-llm-on-intel-gpu-in-docker) in the llm class parameter table. ```python llm = LLM(model="YOUR_MODEL", device="xpu", dtype="float16", enforce_eager=True, # Simply change here for the desired load_in_low_bit value load_in_low_bit="sym_int4", tensor_parallel_size=1, trust_remote_code=True) ``` then run ```bash python vllm_offline_inference.py ``` 2. vllm online service `start-vllm-service.sh` > To fully utilize the continuous batching feature of the vLLM, you can send requests to the service using curl or other similar methods. The requests sent to the engine will be batched at token level. Queries will be executed in the same forward step of the LLM and be removed when they are finished instead of waiting for all sequences to be finished. Modify the `--load-in-low-bit` value to `fp6`, `fp8`, `fp8_e4m3` or `fp16` ```bash # Change value --load-in-low-bit to [fp6, fp8, fp8_e4m3, fp16] to use different low-bit formats python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \ --served-model-name $served_model_name \ --port 8000 \ --model $model \ --trust-remote-code \ --gpu-memory-utilization 0.75 \ --device xpu \ --dtype float16 \ --enforce-eager \ --load-in-low-bit sym_int4 \ --max-model-len 4096 \ --max-num-batched-tokens 10240 \ --max-num-seqs 12 \ --tensor-parallel-size 1 ``` then run following command to start vllm service ```bash bash start-vllm-service.sh ``` Lastly, using curl command to send a request to service, below shows an example output using `Qwen1.5-7B-Chat` with low-bit format `sym_int4`: #### AWQ Use AWQ as a way to reduce memory footprint. Firstly download the model after awq quantification, taking `Llama-2-7B-Chat-AWQ` as an example, download it on 1. Offline inference usage with `/llm/vllm_offline_inference.py` 1. Change the `/llm/vllm_offline_inference.py` LLM class code block's parameters `model`, `quantization` and `load_in_low_bit`, note that `load_in_low_bit` should be set to `asym_int4` instead of `int4`: ```python llm = LLM(model="/llm/models/Llama-2-7B-chat-AWQ/", quantization="AWQ", load_in_low_bit="asym_int4", device="xpu", dtype="float16", enforce_eager=True, tensor_parallel_size=1) ``` then run the following command ```bash python vllm_offline_inference.py ``` 2. Expected result shows as below: ```bash 2024-09-29 10:06:34,272 - INFO - Converting the current model to asym_int4 format...... 2024-09-29 10:06:34,272 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations 2024-09-29 10:06:40,080 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations 2024-09-29 10:06:41,258 - INFO - Loading model weights took 3.7381 GB WARNING 09-29 10:06:47 utils.py:564] Pin memory is not supported on XPU. INFO 09-29 10:06:47 gpu_executor.py:108] # GPU blocks: 1095, # CPU blocks: 512 Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:22<00:00, 5.67s/it, est. speed input: 1.19 toks/s, output: 2.82 toks/s] Prompt: 'Hello, my name is', Generated text: ' [Your Name], and I am a resident of [Your City/Town' Prompt: 'The president of the United States is', Generated text: ' the head of the executive branch and is one of the most powerful political figures in' Prompt: 'The capital of France is', Generated text: ' Paris. It is the most populous urban agglomeration in the European' Prompt: 'The future of AI is', Generated text: ' vast and exciting, with many potential applications across various industries. Here are' r ``` 2. Online serving usage with `/llm/start-vllm-service.sh` 1. Change the `/llm/start-vllm-service.sh`, set `model` parameter to awq model path and `served_model_name`. Add `quantization` and `load_in_low_bit`, note that `load_in_low_bit` should be set to `asym_int4` instead of `int4`: ```bash #!/bin/bash model="/llm/models/Llama-2-7B-Chat-AWQ/" served_model_name="llama2-7b-awq" ... python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \ --served-model-name $served_model_name \ --model $model \ ... --quantization awq \ --load-in-low-bit asym_int4 \ ... ``` 2. Use `bash start-vllm-service.sh` to start awq model online serving. Serving start successfully log: ```bash 2024-10-18 01:50:24,124 - INFO - Converting the current model to asym_int4 format...... 2024-10-18 01:50:24,124 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations 2024-10-18 01:50:29,812 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations 2024-10-18 01:50:30,880 - INFO - Loading model weights took 3.7381 GB WARNING 10-18 01:50:39 utils.py:564] Pin memory is not supported on XPU. INFO 10-18 01:50:39 gpu_executor.py:108] # GPU blocks: 2254, # CPU blocks: 1024 WARNING 10-18 01:50:39 serving_embedding.py:171] embedding_mode is False. Embedding API will not work. INFO 10-18 01:50:39 launcher.py:14] Available routes are: INFO 10-18 01:50:39 launcher.py:22] Route: /openapi.json, Methods: HEAD, GET INFO 10-18 01:50:39 launcher.py:22] Route: /docs, Methods: HEAD, GET INFO 10-18 01:50:39 launcher.py:22] Route: /docs/oauth2-redirect, Methods: HEAD, GET INFO 10-18 01:50:39 launcher.py:22] Route: /redoc, Methods: HEAD, GET INFO 10-18 01:50:39 launcher.py:22] Route: /health, Methods: GET INFO 10-18 01:50:39 launcher.py:22] Route: /tokenize, Methods: POST INFO 10-18 01:50:39 launcher.py:22] Route: /detokenize, Methods: POST INFO 10-18 01:50:39 launcher.py:22] Route: /v1/models, Methods: GET INFO 10-18 01:50:39 launcher.py:22] Route: /version, Methods: GET INFO 10-18 01:50:39 launcher.py:22] Route: /v1/chat/completions, Methods: POST INFO 10-18 01:50:39 launcher.py:22] Route: /v1/completions, Methods: POST INFO 10-18 01:50:39 launcher.py:22] Route: /v1/embeddings, Methods: POST INFO: Started server process [995] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) ``` 3. In docker send request to verfiy the serving status. ```bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "llama2-7b-awq", "prompt": "San Francisco is a", "max_tokens": 128 }' ``` and should get following output: ```json { "id": "cmpl-992e4c8463d24d0ab2e59e706123ef0d", "object": "text_completion", "created": 1729187735, "model": "llama2-7b-awq", "choices": [ { "index": 0, "text": " food lover's paradise with a diverse array of culinary options to suit any taste and budget. Here are some of the top attractions when it comes to food and drink in San Francisco:\n\n1. Fisherman's Wharf: This bustling waterfront district is known for its fresh seafood, street performers, and souvenir shops. Be sure to try some of the local specialties like Dungeness crab, abalone, or sourdough bread.\n\n2. Chinatown: San Francisco's Chinatown is one of the largest and oldest", "logprobs": null, "finish_reason": "length", "stop_reason": null } ], "usage": { "prompt_tokens": 5, "total_tokens": 133, "completion_tokens": 128 } } ``` #### GPTQ Use GPTQ as a way to reduce memory footprint. Firstly download the model after gptq quantification, taking `Llama-2-13B-Chat-GPTQ` as an example, download it on 1. Offline inference usage with `/llm/vllm_offline_inference.py` 1. Change the `/llm/vllm_offline_inference` LLM class code block's parameters `model`, `quantization` and `load_in_low_bit`, note that `load_in_low_bit` should be set to `asym_int4` instead of `int4`: ```python llm = LLM(model="/llm/models/Llama-2-7B-Chat-GPTQ/", quantization="GPTQ", load_in_low_bit="asym_int4", device="xpu", dtype="float16", enforce_eager=True, tensor_parallel_size=1) ``` then run the following command ```bash python vllm_offline_inference.py ``` 2. Expected result shows as below: ```bash 2024-10-08 10:55:18,296 - INFO - Converting the current model to asym_int4 format...... 2024-10-08 10:55:18,296 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations 2024-10-08 10:55:23,478 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations 2024-10-08 10:55:24,581 - INFO - Loading model weights took 3.7381 GB WARNING 10-08 10:55:31 utils.py:564] Pin memory is not supported on XPU. INFO 10-08 10:55:31 gpu_executor.py:108] # GPU blocks: 1095, # CPU blocks: 512 Processed prompts: 0%| | 0/4 [00:00 \ --tensor-parallel-size 2 ``` 2. Send http request with `api-key` header to verify the model has deployed successfully. ```bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer " \ -d '{ "model": "llama-3.1-8b", "prompt": "San Francisco is a", "max_tokens": 128 }' ``` 3. Start open-webui serving with following scripts. Note that the `OPENAI_API_KEY` must be consistent with the backend value. The `` in `OPENAI_API_BASE_URL` is the ipv4 address of the host that starts docker. For relevant details, please refer to official document [link](https://docs.openwebui.com/#installation-for-openai-api-usage-only) of open-webui. ```bash #!/bin/bash export DOCKER_IMAGE=ghcr.io/open-webui/open-webui:main export CONTAINER_NAME= docker rm -f $CONTAINER_NAME docker run -itd \ -p 3000:8080 \ -e OPENAI_API_KEY= \ -e OPENAI_API_BASE_URL=http://:8000/v1 \ -v open-webui:/app/backend/data \ --name $CONTAINER_NAME \ --restart always $DOCKER_IMAGE ``` Then you should start the docker on host that make sure you can visit vLLM backend serving. 4. After installation, you can access Open WebUI at . Enjoy! 😄 #### Serving with FastChat We can set up model serving using `IPEX-LLM` as backend using FastChat, the following steps gives an example of how to deploy a demo using FastChat. 1. **Start the Docker Container** Run the following command to launch a Docker container with device access: ```bash #/bin/bash export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest sudo docker run -itd \ --net=host \ --device=/dev/dri \ --name=demo-container \ # Example: map host model directory to container -v /LLM_MODELS/:/llm/models/ \ --shm-size="16g" \ # Optional: set proxy if needed -e http_proxy=... \ -e https_proxy=... \ -e no_proxy="127.0.0.1,localhost" \ $DOCKER_IMAGE ``` 2. **Start the FastChat Service** Enter the container and start the FastChat service: ```bash #/bin/bash # This command assumes that you have mapped the host model directory to the container # and the model directory is /llm/models/ # we take Yi-1.5-34B as an example, and you can replace it with your own model ps -ef | grep "fastchat" | awk '{print $2}' | xargs kill -9 pip install -U gradio==4.43.0 # start controller python -m fastchat.serve.controller & export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 export TORCH_LLM_ALLREDUCE=0 export CCL_DG2_ALLREDUCE=1 # CCL needed environment variables export CCL_WORKER_COUNT=4 # pin ccl worker to cores # export CCL_WORKER_AFFINITY=32,33,34,35 export FI_PROVIDER=shm export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets export CCL_ATL_SHM=1 source /opt/intel/1ccl-wks/setvars.sh python -m ipex_llm.serving.fastchat.vllm_worker \ --model-path /llm/models/Yi-1.5-34B \ --device xpu \ --enforce-eager \ --disable-async-output-proc \ --distributed-executor-backend ray \ --dtype float16 \ --load-in-low-bit fp8 \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.9 \ --max-model-len 4096 \ --max-num-batched-tokens 8000 & sleep 120 python -m fastchat.serve.gradio_web_server & ``` This quick setup allows you to deploy FastChat with IPEX-LLM efficiently. ### Validated Models List | models (fp8) | gpus | | ---------------- | :---: | | llama-3-8b | 1 | | Llama-2-7B | 1 | | Qwen2-7B | 1 | | Qwen1.5-7B | 1 | | GLM4-9B | 1 | | chatglm3-6b | 1 | | Baichuan2-7B | 1 | | Codegeex4-all-9b | 1 | | Llama-2-13B | 2 | | Qwen1.5-14b | 2 | | TeleChat-13B | 2 | | Qwen1.5-32b | 4 | | Yi-1.5-34B | 4 | | CodeLlama-34B | 4 |