Docs: Fix CPU Serving Docker README (#11351)
Fix CPU Serving Docker README
This commit is contained in:
parent
c9b4cadd81
commit
ef9f740801
2 changed files with 6 additions and 70 deletions
|
|
@ -30,72 +30,8 @@ sudo docker run -itd \
|
||||||
|
|
||||||
After the container is booted, you could get into the container through `docker exec`.
|
After the container is booted, you could get into the container through `docker exec`.
|
||||||
|
|
||||||
To run model-serving using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/src/ipex_llm/serving/fastchat).
|
#### FastChat serving engine
|
||||||
Also you can set environment variables and start arguments while running a container to get serving started initially. You may need to boot several containers to support. One controller container and at least one worker container are needed. The api server address(host and port) and controller address are set in controller container, and you need to set the same controller address as above, model path on your machine and worker address in worker container.
|
To run FastChat-serving using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/src/ipex_llm/serving/fastchat).
|
||||||
|
|
||||||
To start a controller container:
|
|
||||||
```bash
|
|
||||||
#/bin/bash
|
|
||||||
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:2.1.0-SNAPSHOT
|
|
||||||
controller_host=localhost
|
|
||||||
controller_port=23000
|
|
||||||
api_host=localhost
|
|
||||||
api_port=8000
|
|
||||||
sudo docker run -itd \
|
|
||||||
--net=host \
|
|
||||||
--privileged \
|
|
||||||
--cpuset-cpus="0-47" \
|
|
||||||
--cpuset-mems="0" \
|
|
||||||
--memory="64G" \
|
|
||||||
--name=serving-cpu-controller \
|
|
||||||
--shm-size="16g" \
|
|
||||||
-e ENABLE_PERF_OUTPUT="true" \
|
|
||||||
-e CONTROLLER_HOST=$controller_host \
|
|
||||||
-e CONTROLLER_PORT=$controller_port \
|
|
||||||
-e API_HOST=$api_host \
|
|
||||||
-e API_PORT=$api_port \
|
|
||||||
$DOCKER_IMAGE -m controller
|
|
||||||
```
|
|
||||||
To start a worker container:
|
|
||||||
```bash
|
|
||||||
#/bin/bash
|
|
||||||
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:2.1.0-SNAPSHOT
|
|
||||||
export MODEL_PATH=YOUR_MODEL_PATH
|
|
||||||
controller_host=localhost
|
|
||||||
controller_port=23000
|
|
||||||
worker_host=localhost
|
|
||||||
worker_port=23001
|
|
||||||
sudo docker run -itd \
|
|
||||||
--net=host \
|
|
||||||
--privileged \
|
|
||||||
--cpuset-cpus="0-47" \
|
|
||||||
--cpuset-mems="0" \
|
|
||||||
--memory="64G" \
|
|
||||||
--name="serving-cpu-worker" \
|
|
||||||
--shm-size="16g" \
|
|
||||||
-e ENABLE_PERF_OUTPUT="true" \
|
|
||||||
-e CONTROLLER_HOST=$controller_host \
|
|
||||||
-e CONTROLLER_PORT=$controller_port \
|
|
||||||
-e WORKER_HOST=$worker_host \
|
|
||||||
-e WORKER_PORT=$worker_port \
|
|
||||||
-e OMP_NUM_THREADS=48 \
|
|
||||||
-e MODEL_PATH=/llm/models/Llama-2-7b-chat-hf \
|
|
||||||
-v $MODEL_PATH:/llm/models/ \
|
|
||||||
$DOCKER_IMAGE -m worker -w vllm_worker # use -w model_worker if vllm worker is not needed
|
|
||||||
```
|
|
||||||
|
|
||||||
Then you can use `curl` for testing, an example could be:
|
|
||||||
```bash
|
|
||||||
curl -X POST -H "Content-Type: application/json" -d '{
|
|
||||||
"model": "YOUR_MODEL_NAME",
|
|
||||||
"prompt": "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun",
|
|
||||||
"n": 1,
|
|
||||||
"best_of": 1,
|
|
||||||
"use_beam_search": false,
|
|
||||||
"stream": false
|
|
||||||
}' http://localhost:8000/v1/completions
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
#### vLLM serving engine
|
#### vLLM serving engine
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -332,8 +332,8 @@ if __name__ == "__main__":
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--device",
|
"--device",
|
||||||
type=str,
|
type=str,
|
||||||
default="cuda",
|
default="cpu",
|
||||||
choices=["cuda", "xpu"],
|
choices=["cuda", "xpu", "cpu"],
|
||||||
help='device type for vLLM execution, supporting CUDA only currently.')
|
help='device type for vLLM execution, supporting CUDA only currently.')
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--enable-prefix-caching",
|
"--enable-prefix-caching",
|
||||||
|
|
@ -342,8 +342,8 @@ if __name__ == "__main__":
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--load-in-low-bit",
|
"--load-in-low-bit",
|
||||||
type=str,
|
type=str,
|
||||||
choices=["sym_int4", "fp6", "fp8", "fp16"],
|
choices=["sym_int4", "fp6", "fp8", "bf16"],
|
||||||
default="sym_int4",
|
default="bf16",
|
||||||
help="Low-bit format quantization with IPEX-LLM")
|
help="Low-bit format quantization with IPEX-LLM")
|
||||||
parser.add_argument('--max-num-batched-tokens',
|
parser.add_argument('--max-num-batched-tokens',
|
||||||
type=int,
|
type=int,
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue