* transfer files in DockerGuides from rst to md * add some dividing lines * adjust the title hierarchy in docker_cpp_xpu_quickstart.md * restore * switch to the correct branch * small change --------- Co-authored-by: ATMxsp01 <shou.xu@intel.com>
		
			
				
	
	
		
			117 lines
		
	
	
	
		
			4.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			117 lines
		
	
	
	
		
			4.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
# FastChat Serving with IPEX-LLM on Intel GPUs via docker
 | 
						|
 | 
						|
This guide demonstrates how to run `FastChat` serving with `IPEX-LLM` on Intel GPUs via Docker.
 | 
						|
 | 
						|
## Install docker
 | 
						|
 | 
						|
Follow the instructions in this [guide](./docker_windows_gpu.md#linux) to install Docker on Linux.
 | 
						|
 | 
						|
## Pull the latest image
 | 
						|
 | 
						|
```bash
 | 
						|
# This image will be updated every day
 | 
						|
docker pull intelanalytics/ipex-llm-serving-xpu:latest
 | 
						|
```
 | 
						|
 | 
						|
## Start Docker Container
 | 
						|
 | 
						|
 To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Change the `/path/to/models` to mount the models. 
 | 
						|
 | 
						|
```bash
 | 
						|
#/bin/bash
 | 
						|
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
 | 
						|
export CONTAINER_NAME=ipex-llm-serving-xpu-container
 | 
						|
sudo docker run -itd \
 | 
						|
        --net=host \
 | 
						|
        --device=/dev/dri \
 | 
						|
        -v /path/to/models:/llm/models \
 | 
						|
        -e no_proxy=localhost,127.0.0.1 \
 | 
						|
        --memory="32G" \
 | 
						|
        --name=$CONTAINER_NAME \
 | 
						|
        --shm-size="16g" \
 | 
						|
        $DOCKER_IMAGE
 | 
						|
```
 | 
						|
 | 
						|
After the container is booted, you could get into the container through `docker exec`.
 | 
						|
 | 
						|
```bash
 | 
						|
docker exec -it ipex-llm-serving-xpu-container /bin/bash
 | 
						|
```
 | 
						|
 | 
						|
 | 
						|
To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is:
 | 
						|
 | 
						|
```bash
 | 
						|
root@arda-arc12:/# sycl-ls
 | 
						|
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
 | 
						|
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
 | 
						|
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
 | 
						|
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
 | 
						|
```
 | 
						|
 | 
						|
 | 
						|
## Running FastChat serving with IPEX-LLM on Intel GPU in Docker
 | 
						|
 | 
						|
For convenience, we have provided a script named `/llm/start-fastchat-service.sh` for you to start the service.  
 | 
						|
 | 
						|
However, the script only provide instructions for the most common scenarios. If this script doesn't meet your needs, you can always find the complete guidance for FastChat at [Serving using IPEX-LLM and FastChat](../Quickstart/fastchat_quickstart.md#2-start-the-service).
 | 
						|
 | 
						|
Before starting the service, you can refer to this [section](../Quickstart/install_linux_gpu.md#runtime-configurations) to setup our recommended runtime configurations.
 | 
						|
 | 
						|
Now we can start the FastChat service, you can use our provided script `/llm/start-fastchat-service.sh` like the following way:
 | 
						|
 | 
						|
```bash
 | 
						|
# Only the MODEL_PATH needs to be set, other parameters have default values
 | 
						|
export MODEL_PATH=YOUR_SELECTED_MODEL_PATH
 | 
						|
export LOW_BIT_FORMAT=sym_int4
 | 
						|
export CONTROLLER_HOST=localhost
 | 
						|
export CONTROLLER_PORT=21001
 | 
						|
export WORKER_HOST=localhost
 | 
						|
export WORKER_PORT=21002
 | 
						|
export API_HOST=localhost
 | 
						|
export API_PORT=8000
 | 
						|
 | 
						|
# Use the default model_worker
 | 
						|
bash /llm/start-fastchat-service.sh -w model_worker
 | 
						|
```
 | 
						|
 | 
						|
If everything goes smoothly, the result should be similar to the following figure:
 | 
						|
 | 
						|
<a href="https://llm-assets.readthedocs.io/en/latest/_images/start-fastchat.png" target="_blank">
 | 
						|
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/start-fastchat.png" width=100%; />
 | 
						|
</a>
 | 
						|
 | 
						|
By default, we are using the `ipex_llm_worker` as the backend engine.  You can also use `vLLM` as the backend engine.  Try the following examples:
 | 
						|
 | 
						|
```bash
 | 
						|
# Only the MODEL_PATH needs to be set, other parameters have default values
 | 
						|
export MODEL_PATH=YOUR_SELECTED_MODEL_PATH
 | 
						|
export LOW_BIT_FORMAT=sym_int4
 | 
						|
export CONTROLLER_HOST=localhost
 | 
						|
export CONTROLLER_PORT=21001
 | 
						|
export WORKER_HOST=localhost
 | 
						|
export WORKER_PORT=21002
 | 
						|
export API_HOST=localhost
 | 
						|
export API_PORT=8000
 | 
						|
 | 
						|
# Use the default model_worker
 | 
						|
bash /llm/start-fastchat-service.sh -w vllm_worker
 | 
						|
```
 | 
						|
 | 
						|
The `vllm_worker` may start slowly than normal `ipex_llm_worker`.  The booted service should be similar to the following figure:
 | 
						|
 | 
						|
<a href="https://llm-assets.readthedocs.io/en/latest/_images/fastchat-vllm-worker.png" target="_blank">
 | 
						|
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/fastchat-vllm-worker.png" width=100%; />
 | 
						|
</a>
 | 
						|
 | 
						|
 | 
						|
 | 
						|
> [!note]
 | 
						|
> To verify/use the service booted by the script, follow the instructions in [this guide](../Quickstart/fastchat_quickstart.md#launch-restful-api-server).
 | 
						|
 | 
						|
 | 
						|
After a request has been sent to the `openai_api_server`, the corresponding inference time latency can be found in the worker log as shown below:
 | 
						|
 | 
						|
<a href="https://llm-assets.readthedocs.io/en/latest/_images/fastchat-benchmark.png" target="_blank">
 | 
						|
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/fastchat-benchmark.png" width=100%; />
 | 
						|
</a>
 |