Add serving docker quickstart (#11072)

* add temp file

* add initial docker readme

* temp

* done

* add fastchat service

* fix

* fix

* fix

* fix

* remove stale file
This commit is contained in:
Guancheng Fu 2024-05-21 17:00:58 +08:00 committed by GitHub
parent f00625f9a4
commit f654f7e08c
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
6 changed files with 287 additions and 1 deletions

View file

@ -86,6 +86,12 @@
<li>
<a href="doc/LLM/DockerGuides/docker_cpp_xpu_quickstart.html">Run llama.cpp/Ollama/Open-WebUI on an Intel GPU via Docker</a>
</li>
<li>
<a href="doc/LLM/DockerGuides/fastchat_docker_quickstart.html">Run IPEX-LLM integrated FastChat on an Intel GPU via Docker</a>
</li>
<li>
<a href="doc/LLM/DockerGuides/vllm_docker_quickstart.html">Run IPEX-LLM integrated vLLM on an Intel GPU via Docker</a>
</li>
</ul>
</li>
<li>

View file

@ -0,0 +1,117 @@
# Serving using IPEX-LLM integrated FastChat on Intel GPUs via docker
This guide demonstrates how to do LLM serving with `IPEX-LLM` integrated `FastChat` in Docker on Linux with Intel GPUs.
## Install docker
Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux.
## Pull the latest image
```bash
# This image will be updated every day
docker pull intelanalytics/ipex-llm-serving-xpu:latest
```
## Start Docker Container
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Change the `/path/to/models` to mount the models.
```
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
export CONTAINER_NAME=ipex-llm-serving-xpu-container
sudo docker run -itd \
--net=host \
--device=/dev/dri \
-v /path/to/models:/llm/models \
-e no_proxy=localhost,127.0.0.1 \
--memory="32G" \
--name=$CONTAINER_NAME \
--shm-size="16g" \
$DOCKER_IMAGE
```
After the container is booted, you could get into the container through `docker exec`.
```bash
docker exec -it ipex-llm-serving-xpu-container /bin/bash
```
To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is:
```bash
root@arda-arc12:/# sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
```
## Running FastChat serving with IPEX-LLM on Intel GPU in Docker
For convenience, we have provided a script named `/llm/start-fastchat-service.sh` for you to start the service.
However, the script only provide instructions for the most common scenarios. If this script doesn't meet your needs, you can always find the complete guidance for FastChat at [Serving using IPEX-LLM and FastChat](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#start-the-service).
Before starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#runtime-configurations) to setup our recommended runtime configurations.
Now we can start the FastChat service, you can use our provided script `/llm/start-fastchat-service.sh` like the following way:
```bash
# Only the MODEL_PATH needs to be set, other parameters have default values
export MODEL_PATH=YOUR_SELECTED_MODEL_PATH
export LOW_BIT_FORMAT=sym_int4
export CONTROLLER_HOST=localhost
export CONTROLLER_PORT=21001
export WORKER_HOST=localhost
export WORKER_PORT=21002
export API_HOST=localhost
export API_PORT=8000
# Use the default model_worker
bash /llm/start-fastchat-service.sh -w model_worker
```
If everything goes smoothly, the result should be similar to the following figure:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/start-fastchat.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/start-fastchat.png" width=100%; />
</a>
By default, we are using the `ipex_llm_worker` as the backend engine. You can also use `vLLM` as the backend engine. Try the following examples:
```bash
# Only the MODEL_PATH needs to be set, other parameters have default values
export MODEL_PATH=YOUR_SELECTED_MODEL_PATH
export LOW_BIT_FORMAT=sym_int4
export CONTROLLER_HOST=localhost
export CONTROLLER_PORT=21001
export WORKER_HOST=localhost
export WORKER_PORT=21002
export API_HOST=localhost
export API_PORT=8000
# Use the default model_worker
bash /llm/start-fastchat-service.sh -w vllm_worker
```
The `vllm_worker` may start slowly than normal `ipex_llm_worker`. The booted service should be similar to the following figure:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/fastchat-vllm-worker.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/fastchat-vllm-worker.png" width=100%; />
</a>
```eval_rst
.. note::
To verify/use the service booted by the script, follow the instructions in `this guide <https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#launch-restful-api-serve>`_.
```
After a request has been sent to the `openai_api_server`, the corresponding inference time latency can be found in the worker log as shown below:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/fastchat-benchmark.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/fastchat-benchmark.png" width=100%; />
</a>

View file

@ -6,4 +6,6 @@ In this section, you will find guides related to using IPEX-LLM with Docker, cov
* `Overview of IPEX-LLM Containers for Intel GPU <./docker_windows_gpu.html>`_
* `Run PyTorch Inference on an Intel GPU via Docker <./docker_pytorch_inference_gpu.html>`_
* `Run llama.cpp/Ollama/open-webui with Docker on Intel GPU <./docker_cpp_xpu_quickstart.html>`_
* `Run llama.cpp/Ollama/open-webui with Docker on Intel GPU <./docker_cpp_xpu_quickstart.html>`_
* `Run IPEX-LLM integrated FastChat with Docker on Intel GPU <./fastchat_docker_quickstart>`_
* `Run IPEX-LLM integrated vLLM with Docker on Intel GPU <./vllm_docker_quickstart>`_

View file

@ -0,0 +1,145 @@
# Serving using IPEX-LLM integrated vLLM on Intel GPUs via docker
This guide demonstrates how to do LLM serving with `IPEX-LLM` integrated `vLLM` in Docker on Linux with Intel GPUs.
## Install docker
Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux.
## Pull the latest image
```bash
# This image will be updated every day
docker pull intelanalytics/ipex-llm-serving-xpu:latest
```
## Start Docker Container
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Change the `/path/to/models` to mount the models.
```
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
export CONTAINER_NAME=ipex-llm-serving-xpu-container
sudo docker run -itd \
--net=host \
--device=/dev/dri \
-v /path/to/models:/llm/models \
-e no_proxy=localhost,127.0.0.1 \
--memory="32G" \
--name=$CONTAINER_NAME \
--shm-size="16g" \
$DOCKER_IMAGE
```
After the container is booted, you could get into the container through `docker exec`.
```bash
docker exec -it ipex-llm-serving-xpu-container /bin/bash
```
To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is:
```bash
root@arda-arc12:/# sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
```
## Running vLLM serving with IPEX-LLM on Intel GPU in Docker
We have included multiple vLLM-related files in `/llm/`:
1. `vllm_offline_inference.py`: Used for vLLM offline inference example
2. `benchmark_vllm_throughput.py`: Used for benchmarking throughput
3. `payload-1024.lua`: Used for testing request per second using 1k-128 request
4. `start-vllm-service.sh`: Used for template for starting vLLM service
Before performing benchmark or starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#runtime-configurations) to setup our recommended runtime configurations.
### Service
#### Single card serving
A script named `/llm/start-vllm-service.sh` have been included in the image for starting the service conveniently.
Modify the `model` and `served_model_name` in the script so that it fits your requirement. The `served_model_name` indicates the model name used in the API.
Then start the service using `bash /llm/start-vllm-service.sh`, the following message should be print if the service started successfully.
If the service have booted successfully, you should see the output similar to the following figure:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" width=100%; />
</a>
#### Multi-card serving
vLLM supports to utilize multiple cards through tensor parallel.
You can refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#about-tensor-paralle) on how to utilize the `tensor-parallel` feature and start the service.
#### Verify
After the service has been booted successfully, you can send a test request using `curl`. Here, `YOUR_MODEL` should be set equal to `served_model_name` in your booting script, e.g. `Qwen1.5`.
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "YOUR_MODEL",
"prompt": "San Francisco is a",
"max_tokens": 128,
"temperature": 0
}' | jq '.choices[0].text'
```
Below shows an example output using `Qwen1.5-7B-Chat` with low-bit format `sym_int4`:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/vllm-curl-result.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/vllm-curl-result.png" width=100%; />
</a>
#### Tuning
You can tune the service using these four arguments:
- `--gpu-memory-utilization`
- `--max-model-len`
- `--max-num-batched-token`
- `--max-num-seq`
You can refer to this [doc](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#service) for a detailed explaination on these parameters.
### Benchmark
#### Online benchmark throurgh api_server
We can benchmark the api_server to get an estimation about TPS (transactions per second). To do so, you need to start the service first according to the instructions mentioned above.
Then in the container, do the following:
1. modify the `/llm/payload-1024.lua` so that the "model" attribute is correct. By default, we use a prompt that is roughly 1024 token long, you can change it if needed.
2. Start the benchmark using `wrk` using the script below:
```bash
cd /llm
# warmup due to JIT compliation
wrk -t4 -c4 -d3m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
# You can change -t and -c to control the concurrency.
# By default, we use 12 connections to benchmark the service.
wrk -t12 -c12 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
```
The following figure shows performing benchmark on `Llama-2-7b-chat-hf` using the above script:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/service-benchmark-result.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/service-benchmark-result.png" width=100%; />
</a>
#### Offline benchmark through benchmark_vllm_throughput.py
Please refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#performing-benchmark) on how to use `benchmark_vllm_throughput.py` for benchmarking.

View file

@ -61,6 +61,15 @@ export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu"
```
We have also provided an option `--load-low-bit-model` to load models that have been converted and saved into disk using the `save_low_bit` interface as introduced in this [document](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/KeyFeatures/hugging_face_format.html#save-load).
Check the following examples:
```bash
# Or --device "cpu"
python -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path /Low/Bit/Model/Path --trust-remote-code --device "xpu" --load-low-bit-model
```
#### For self-speculative decoding example:
You can use IPEX-LLM to run `self-speculative decoding` example. Refer to [here](https://github.com/intel-analytics/ipex-llm/tree/c9fac8c26bf1e1e8f7376fa9a62b32951dd9e85d/python/llm/example/GPU/Speculative-Decoding) for more details on intel MAX GPUs. Refer to [here](https://github.com/intel-analytics/ipex-llm/tree/c9fac8c26bf1e1e8f7376fa9a62b32951dd9e85d/python/llm/example/GPU/Speculative-Decoding) for more details on intel CPUs.

View file

@ -4,6 +4,13 @@ vLLM is a fast and easy-to-use library for LLM inference and serving. You can fi
IPEX-LLM can be integrated into vLLM so that user can use `IPEX-LLM` to boost the performance of vLLM engine on Intel **GPUs** *(e.g., local PC with descrete GPU such as Arc, Flex and Max)*.
Currently, IPEX-LLM integrated vLLM only supports the following models:
- Qwen series models
- Llama series models
- ChatGLM series models
- Baichuan series models
## Quick Start