Docs: Fix CPU Serving Docker README (#11351)

Fix CPU Serving Docker README
2024-06-18 16:27:51 +08:00 · 2024-06-18 16:27:51 +08:00 · ef9f740801
commit ef9f740801
parent c9b4cadd81
2 changed files with 6 additions and 70 deletions
--- a/docker/llm/serving/cpu/docker/README.md
+++ b/docker/llm/serving/cpu/docker/README.md
@ -30,72 +30,8 @@ sudo docker run -itd \
 After the container is booted, you could get into the container through `docker exec`.
-To run model-serving using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/src/ipex_llm/serving/fastchat).
+#### FastChat serving engine
-Also you can set environment variables and start arguments while running a container to get serving started initially. You may need to boot several containers to support. One controller container and at least one worker container are needed. The api server address(host and port) and controller address are set in controller container, and you need to set the same controller address as above, model path on your machine and worker address in worker container.
+To run FastChat-serving using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/src/ipex_llm/serving/fastchat).
 To start a controller container:
 ```bash
 #/bin/bash
 export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:2.1.0-SNAPSHOT
 controller_host=localhost
 controller_port=23000
 api_host=localhost
 api_port=8000
 sudo docker run -itd \
        --net=host \
 	--privileged \
        --cpuset-cpus="0-47" \
        --cpuset-mems="0" \
        --memory="64G" \
        --name=serving-cpu-controller \
        --shm-size="16g" \
 	-e ENABLE_PERF_OUTPUT="true" \
        -e CONTROLLER_HOST=$controller_host \
        -e CONTROLLER_PORT=$controller_port \
        -e API_HOST=$api_host \
        -e API_PORT=$api_port \
        $DOCKER_IMAGE -m controller
 ```
 To start a worker container:
 ```bash
 #/bin/bash
 export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:2.1.0-SNAPSHOT
 export MODEL_PATH=YOUR_MODEL_PATH
 controller_host=localhost
 controller_port=23000
 worker_host=localhost
 worker_port=23001
 sudo docker run -itd \
        --net=host \
 	--privileged \
        --cpuset-cpus="0-47" \
        --cpuset-mems="0" \
        --memory="64G" \
        --name="serving-cpu-worker" \
        --shm-size="16g" \
 	-e ENABLE_PERF_OUTPUT="true" \
        -e CONTROLLER_HOST=$controller_host \
        -e CONTROLLER_PORT=$controller_port \
        -e WORKER_HOST=$worker_host \
        -e WORKER_PORT=$worker_port \
        -e OMP_NUM_THREADS=48 \
        -e MODEL_PATH=/llm/models/Llama-2-7b-chat-hf \
 	-v $MODEL_PATH:/llm/models/ \
        $DOCKER_IMAGE -m worker -w vllm_worker # use -w model_worker if vllm worker is not needed
 ```
 Then you can use `curl` for testing, an example could be:
 ```bash
 curl -X POST -H "Content-Type: application/json" -d '{
    "model": "YOUR_MODEL_NAME",
    "prompt": "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun",
    "n": 1,
    "best_of": 1,
    "use_beam_search": false,
    "stream": false
 }' http://localhost:8000/v1/completions
 ```
 #### vLLM serving engine
--- a/docker/llm/serving/cpu/docker/benchmark_vllm_throughput.py
+++ b/docker/llm/serving/cpu/docker/benchmark_vllm_throughput.py
@ -332,8 +332,8 @@ if __name__ == "__main__":
    parser.add_argument(
        "--device",
        type=str,
-        default="cuda",
+        default="cpu",
-        choices=["cuda", "xpu"],
+        choices=["cuda", "xpu", "cpu"],
        help='device type for vLLM execution, supporting CUDA only currently.')
    parser.add_argument(
        "--enable-prefix-caching",
@ -342,8 +342,8 @@ if __name__ == "__main__":
    parser.add_argument(
        "--load-in-low-bit",
        type=str,
-        choices=["sym_int4", "fp6", "fp8", "fp16"],
+        choices=["sym_int4", "fp6", "fp8", "bf16"],
-        default="sym_int4",
+        default="bf16",
        help="Low-bit format quantization with IPEX-LLM")
    parser.add_argument('--max-num-batched-tokens',
        type=int,