Update Readme for FastChat docker demo (#12354)

* update Readme for FastChat docker demo * update readme * add 'Serving with FastChat' part in docs * polish docs --------- Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-11-07 15:22:42 +08:00 · 2024-11-07 15:22:42 +08:00 · ce0c6ae423
commit ce0c6ae423
parent d880e534d2
2 changed files with 146 additions and 3 deletions
--- a/docker/llm/serving/xpu/docker/README.md
+++ b/docker/llm/serving/xpu/docker/README.md
@ -61,11 +61,79 @@ For convenience, we have included a file `/llm/start-pp_serving-service.sh` in t

 #### FastChat serving engine

-To run model-serving using `IPEX-LLM` as backend using FastChat, you can refer to this [quickstart](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#).
+To set up model serving using `IPEX-LLM` as backend using FastChat, you can refer to this [quickstart](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#) or follow these quick steps to deploy a demo.

-For convenience, we have included a file `/llm/fastchat-examples/start-fastchat-service.sh` in the image.
+##### Quick Setup for FastChat with IPEX-LLM

-You can modify this script to using fastchat with either `ipex_llm_worker` or `vllm_worker`.
+1. **Start the Docker Container** 
+   
+    Run the following command to launch a Docker container with device access:
+
+    ```bash
+    #/bin/bash
+    export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
+
+    sudo docker run -itd \
+            --net=host \
+            --device=/dev/dri \
+            --name=demo-container \
+            # Example: map host model directory to container
+            -v /LLM_MODELS/:/llm/models/ \  
+            --shm-size="16g" \
+            # Optional: set proxy if needed
+            -e http_proxy=... \ 
+            -e https_proxy=... \
+            -e no_proxy="127.0.0.1,localhost" \
+            $DOCKER_IMAGE
+    ```
+
+2. **Start the FastChat Service**
+   
+    Enter the container and start the FastChat service:
+    ```bash
+    #/bin/bash
+
+    # This command assumes that you have mapped the host model directory to the container
+    # and the model directory is /llm/models/
+    # we take Yi-1.5-34B as an example, and you can replace it with your own model
+
+    ps -ef | grep "fastchat" | awk '{print $2}' | xargs kill -9
+    pip install -U gradio==4.43.0
+    
+    # start controller
+    python -m fastchat.serve.controller &
+    
+    
+    export TORCH_LLM_ALLREDUCE=0
+    export CCL_DG2_ALLREDUCE=1
+    # CCL needed environment variables
+    export CCL_WORKER_COUNT=4
+    # pin ccl worker to cores
+    # export CCL_WORKER_AFFINITY=32,33,34,35
+    export FI_PROVIDER=shm
+    export CCL_ATL_TRANSPORT=ofi
+    export CCL_ZE_IPC_EXCHANGE=sockets
+    export CCL_ATL_SHM=1
+    
+    source /opt/intel/1ccl-wks/setvars.sh
+    
+    python -m ipex_llm.serving.fastchat.vllm_worker \
+    --model-path /llm/models/Yi-1.5-34B \
+    --device xpu \
+    --enforce-eager \
+    --dtype float16 \
+    --load-in-low-bit fp8 \
+    --tensor-parallel-size 4 \
+    --gpu-memory-utilization 0.9 \
+    --max-model-len 4096 \
+    --max-num-batched-tokens 8000 &
+    
+    sleep 120
+    
+    python -m fastchat.serve.gradio_web_server &
+    ```
+
+This quick setup allows you to deploy FastChat with IPEX-LLM efficiently.

 #### vLLM serving engine

--- a/docs/mddocs/DockerGuides/vllm_docker_quickstart.md
+++ b/docs/mddocs/DockerGuides/vllm_docker_quickstart.md
@ -789,6 +789,81 @@ docker run -itd \

 4. After installation, you can access Open WebUI at <http://localhost:3000>. Enjoy! 😄

+#### Serving with FastChat
+
+We can set up model serving using `IPEX-LLM` as backend using FastChat, the following steps gives an example of how to deploy a demo using FastChat.
+
+
+1. **Start the Docker Container** 
+   
+    Run the following command to launch a Docker container with device access:
+
+    ```bash
+    #/bin/bash
+    export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
+
+    sudo docker run -itd \
+            --net=host \
+            --device=/dev/dri \
+            --name=demo-container \
+            # Example: map host model directory to container
+            -v /LLM_MODELS/:/llm/models/ \  
+            --shm-size="16g" \
+            # Optional: set proxy if needed
+            -e http_proxy=... \ 
+            -e https_proxy=... \
+            -e no_proxy="127.0.0.1,localhost" \
+            $DOCKER_IMAGE
+    ```
+
+2. **Start the FastChat Service**
+   
+    Enter the container and start the FastChat service:
+    ```bash
+    #/bin/bash
+
+    # This command assumes that you have mapped the host model directory to the container
+    # and the model directory is /llm/models/
+    # we take Yi-1.5-34B as an example, and you can replace it with your own model
+
+    ps -ef | grep "fastchat" | awk '{print $2}' | xargs kill -9
+    pip install -U gradio==4.43.0
+    
+    # start controller
+    python -m fastchat.serve.controller &
+    
+    
+    export TORCH_LLM_ALLREDUCE=0
+    export CCL_DG2_ALLREDUCE=1
+    # CCL needed environment variables
+    export CCL_WORKER_COUNT=4
+    # pin ccl worker to cores
+    # export CCL_WORKER_AFFINITY=32,33,34,35
+    export FI_PROVIDER=shm
+    export CCL_ATL_TRANSPORT=ofi
+    export CCL_ZE_IPC_EXCHANGE=sockets
+    export CCL_ATL_SHM=1
+    
+    source /opt/intel/1ccl-wks/setvars.sh
+    
+    python -m ipex_llm.serving.fastchat.vllm_worker \
+    --model-path /llm/models/Yi-1.5-34B \
+    --device xpu \
+    --enforce-eager \
+    --dtype float16 \
+    --load-in-low-bit fp8 \
+    --tensor-parallel-size 4 \
+    --gpu-memory-utilization 0.9 \
+    --max-model-len 4096 \
+    --max-num-batched-tokens 8000 &
+    
+    sleep 120
+    
+    python -m fastchat.serve.gradio_web_server &
+    ```
+
+This quick setup allows you to deploy FastChat with IPEX-LLM efficiently.
+
 ### Validated Models List

 | models (fp8)     | gpus  |