Update Readme for FastChat docker demo (#12354)
* update Readme for FastChat docker demo * update readme * add 'Serving with FastChat' part in docs * polish docs --------- Co-authored-by: ATMxsp01 <shou.xu@intel.com>
This commit is contained in:
		
							parent
							
								
									d880e534d2
								
							
						
					
					
						commit
						ce0c6ae423
					
				
					 2 changed files with 146 additions and 3 deletions
				
			
		| 
						 | 
				
			
			@ -61,11 +61,79 @@ For convenience, we have included a file `/llm/start-pp_serving-service.sh` in t
 | 
			
		|||
 | 
			
		||||
#### FastChat serving engine
 | 
			
		||||
 | 
			
		||||
To run model-serving using `IPEX-LLM` as backend using FastChat, you can refer to this [quickstart](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#).
 | 
			
		||||
To set up model serving using `IPEX-LLM` as backend using FastChat, you can refer to this [quickstart](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#) or follow these quick steps to deploy a demo.
 | 
			
		||||
 | 
			
		||||
For convenience, we have included a file `/llm/fastchat-examples/start-fastchat-service.sh` in the image.
 | 
			
		||||
##### Quick Setup for FastChat with IPEX-LLM
 | 
			
		||||
 | 
			
		||||
You can modify this script to using fastchat with either `ipex_llm_worker` or `vllm_worker`.
 | 
			
		||||
1. **Start the Docker Container** 
 | 
			
		||||
   
 | 
			
		||||
    Run the following command to launch a Docker container with device access:
 | 
			
		||||
 | 
			
		||||
    ```bash
 | 
			
		||||
    #/bin/bash
 | 
			
		||||
    export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
 | 
			
		||||
 | 
			
		||||
    sudo docker run -itd \
 | 
			
		||||
            --net=host \
 | 
			
		||||
            --device=/dev/dri \
 | 
			
		||||
            --name=demo-container \
 | 
			
		||||
            # Example: map host model directory to container
 | 
			
		||||
            -v /LLM_MODELS/:/llm/models/ \  
 | 
			
		||||
            --shm-size="16g" \
 | 
			
		||||
            # Optional: set proxy if needed
 | 
			
		||||
            -e http_proxy=... \ 
 | 
			
		||||
            -e https_proxy=... \
 | 
			
		||||
            -e no_proxy="127.0.0.1,localhost" \
 | 
			
		||||
            $DOCKER_IMAGE
 | 
			
		||||
    ```
 | 
			
		||||
 | 
			
		||||
2. **Start the FastChat Service**
 | 
			
		||||
   
 | 
			
		||||
    Enter the container and start the FastChat service:
 | 
			
		||||
    ```bash
 | 
			
		||||
    #/bin/bash
 | 
			
		||||
 | 
			
		||||
    # This command assumes that you have mapped the host model directory to the container
 | 
			
		||||
    # and the model directory is /llm/models/
 | 
			
		||||
    # we take Yi-1.5-34B as an example, and you can replace it with your own model
 | 
			
		||||
 | 
			
		||||
    ps -ef | grep "fastchat" | awk '{print $2}' | xargs kill -9
 | 
			
		||||
    pip install -U gradio==4.43.0
 | 
			
		||||
    
 | 
			
		||||
    # start controller
 | 
			
		||||
    python -m fastchat.serve.controller &
 | 
			
		||||
    
 | 
			
		||||
    
 | 
			
		||||
    export TORCH_LLM_ALLREDUCE=0
 | 
			
		||||
    export CCL_DG2_ALLREDUCE=1
 | 
			
		||||
    # CCL needed environment variables
 | 
			
		||||
    export CCL_WORKER_COUNT=4
 | 
			
		||||
    # pin ccl worker to cores
 | 
			
		||||
    # export CCL_WORKER_AFFINITY=32,33,34,35
 | 
			
		||||
    export FI_PROVIDER=shm
 | 
			
		||||
    export CCL_ATL_TRANSPORT=ofi
 | 
			
		||||
    export CCL_ZE_IPC_EXCHANGE=sockets
 | 
			
		||||
    export CCL_ATL_SHM=1
 | 
			
		||||
    
 | 
			
		||||
    source /opt/intel/1ccl-wks/setvars.sh
 | 
			
		||||
    
 | 
			
		||||
    python -m ipex_llm.serving.fastchat.vllm_worker \
 | 
			
		||||
    --model-path /llm/models/Yi-1.5-34B \
 | 
			
		||||
    --device xpu \
 | 
			
		||||
    --enforce-eager \
 | 
			
		||||
    --dtype float16 \
 | 
			
		||||
    --load-in-low-bit fp8 \
 | 
			
		||||
    --tensor-parallel-size 4 \
 | 
			
		||||
    --gpu-memory-utilization 0.9 \
 | 
			
		||||
    --max-model-len 4096 \
 | 
			
		||||
    --max-num-batched-tokens 8000 &
 | 
			
		||||
    
 | 
			
		||||
    sleep 120
 | 
			
		||||
    
 | 
			
		||||
    python -m fastchat.serve.gradio_web_server &
 | 
			
		||||
    ```
 | 
			
		||||
 | 
			
		||||
This quick setup allows you to deploy FastChat with IPEX-LLM efficiently.
 | 
			
		||||
 | 
			
		||||
#### vLLM serving engine
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -789,6 +789,81 @@ docker run -itd \
 | 
			
		|||
 | 
			
		||||
4. After installation, you can access Open WebUI at <http://localhost:3000>. Enjoy! 😄
 | 
			
		||||
 | 
			
		||||
#### Serving with FastChat
 | 
			
		||||
 | 
			
		||||
We can set up model serving using `IPEX-LLM` as backend using FastChat, the following steps gives an example of how to deploy a demo using FastChat.
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
1. **Start the Docker Container** 
 | 
			
		||||
   
 | 
			
		||||
    Run the following command to launch a Docker container with device access:
 | 
			
		||||
 | 
			
		||||
    ```bash
 | 
			
		||||
    #/bin/bash
 | 
			
		||||
    export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
 | 
			
		||||
 | 
			
		||||
    sudo docker run -itd \
 | 
			
		||||
            --net=host \
 | 
			
		||||
            --device=/dev/dri \
 | 
			
		||||
            --name=demo-container \
 | 
			
		||||
            # Example: map host model directory to container
 | 
			
		||||
            -v /LLM_MODELS/:/llm/models/ \  
 | 
			
		||||
            --shm-size="16g" \
 | 
			
		||||
            # Optional: set proxy if needed
 | 
			
		||||
            -e http_proxy=... \ 
 | 
			
		||||
            -e https_proxy=... \
 | 
			
		||||
            -e no_proxy="127.0.0.1,localhost" \
 | 
			
		||||
            $DOCKER_IMAGE
 | 
			
		||||
    ```
 | 
			
		||||
 | 
			
		||||
2. **Start the FastChat Service**
 | 
			
		||||
   
 | 
			
		||||
    Enter the container and start the FastChat service:
 | 
			
		||||
    ```bash
 | 
			
		||||
    #/bin/bash
 | 
			
		||||
 | 
			
		||||
    # This command assumes that you have mapped the host model directory to the container
 | 
			
		||||
    # and the model directory is /llm/models/
 | 
			
		||||
    # we take Yi-1.5-34B as an example, and you can replace it with your own model
 | 
			
		||||
 | 
			
		||||
    ps -ef | grep "fastchat" | awk '{print $2}' | xargs kill -9
 | 
			
		||||
    pip install -U gradio==4.43.0
 | 
			
		||||
    
 | 
			
		||||
    # start controller
 | 
			
		||||
    python -m fastchat.serve.controller &
 | 
			
		||||
    
 | 
			
		||||
    
 | 
			
		||||
    export TORCH_LLM_ALLREDUCE=0
 | 
			
		||||
    export CCL_DG2_ALLREDUCE=1
 | 
			
		||||
    # CCL needed environment variables
 | 
			
		||||
    export CCL_WORKER_COUNT=4
 | 
			
		||||
    # pin ccl worker to cores
 | 
			
		||||
    # export CCL_WORKER_AFFINITY=32,33,34,35
 | 
			
		||||
    export FI_PROVIDER=shm
 | 
			
		||||
    export CCL_ATL_TRANSPORT=ofi
 | 
			
		||||
    export CCL_ZE_IPC_EXCHANGE=sockets
 | 
			
		||||
    export CCL_ATL_SHM=1
 | 
			
		||||
    
 | 
			
		||||
    source /opt/intel/1ccl-wks/setvars.sh
 | 
			
		||||
    
 | 
			
		||||
    python -m ipex_llm.serving.fastchat.vllm_worker \
 | 
			
		||||
    --model-path /llm/models/Yi-1.5-34B \
 | 
			
		||||
    --device xpu \
 | 
			
		||||
    --enforce-eager \
 | 
			
		||||
    --dtype float16 \
 | 
			
		||||
    --load-in-low-bit fp8 \
 | 
			
		||||
    --tensor-parallel-size 4 \
 | 
			
		||||
    --gpu-memory-utilization 0.9 \
 | 
			
		||||
    --max-model-len 4096 \
 | 
			
		||||
    --max-num-batched-tokens 8000 &
 | 
			
		||||
    
 | 
			
		||||
    sleep 120
 | 
			
		||||
    
 | 
			
		||||
    python -m fastchat.serve.gradio_web_server &
 | 
			
		||||
    ```
 | 
			
		||||
 | 
			
		||||
This quick setup allows you to deploy FastChat with IPEX-LLM efficiently.
 | 
			
		||||
 | 
			
		||||
### Validated Models List
 | 
			
		||||
 | 
			
		||||
| models (fp8)     | gpus  |
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
		Loading…
	
		Reference in a new issue