Add usage of vllm (#9564)
* add usage of vllm * add usage of vllm * add usage of vllm * add usage of vllm * add usage of vllm * add usage of vllm
This commit is contained in:
parent
a0a80d232e
commit
2554ba0913
3 changed files with 77 additions and 7 deletions
|
|
@ -9,7 +9,6 @@ docker build \
|
|||
--rm --no-cache -t intelanalytics/bigdl-llm-serving-cpu:2.5.0-SNAPSHOT .
|
||||
```
|
||||
|
||||
|
||||
### Use the image for doing cpu serving
|
||||
|
||||
|
||||
|
|
@ -29,7 +28,70 @@ sudo docker run -itd \
|
|||
$DOCKER_IMAGE
|
||||
```
|
||||
|
||||
|
||||
After the container is booted, you could get into the container through `docker exec`.
|
||||
|
||||
To run model-serving using `BigDL-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/BigDL/tree/main/python/llm/src/bigdl/llm/serving).
|
||||
Also you can set environment variables and start arguments while running a container to get serving started initially. You may need to boot several containers to support. One controller container and at least one worker container are needed. The api server address(host and port) and controller address are set in controller container, and you need to set the same controller address as above, model path on your machine and worker address in worker container.
|
||||
|
||||
To start a controller container:
|
||||
```bash
|
||||
#/bin/bash
|
||||
export DOCKER_IMAGE=intelanalytics/bigdl-llm-serving-cpu:2.5.0-SNAPSHOT
|
||||
controller_host=localhost
|
||||
controller_port=23000
|
||||
api_host=localhost
|
||||
api_port=8000
|
||||
sudo docker run -itd \
|
||||
--net=host \
|
||||
--privileged \
|
||||
--cpuset-cpus="0-47" \
|
||||
--cpuset-mems="0" \
|
||||
--memory="64G" \
|
||||
--name=serving-cpu-controller \
|
||||
--shm-size="16g" \
|
||||
-e ENABLE_PERF_OUTPUT="true" \
|
||||
-e CONTROLLER_HOST=$controller_host \
|
||||
-e CONTROLLER_PORT=$controller_port \
|
||||
-e API_HOST=$api_host \
|
||||
-e API_PORT=$api_port \
|
||||
$DOCKER_IMAGE -m controller
|
||||
```
|
||||
To start a worker container:
|
||||
```bash
|
||||
#/bin/bash
|
||||
export DOCKER_IMAGE=intelanalytics/bigdl-llm-serving-cpu:2.5.0-SNAPSHOT
|
||||
export MODEL_PATH=YOUR_MODEL_PATH
|
||||
controller_host=localhost
|
||||
controller_port=23000
|
||||
worker_host=localhost
|
||||
worker_port=23001
|
||||
sudo docker run -itd \
|
||||
--net=host \
|
||||
--privileged \
|
||||
--cpuset-cpus="0-47" \
|
||||
--cpuset-mems="0" \
|
||||
--memory="64G" \
|
||||
--name="serving-cpu-worker" \
|
||||
--shm-size="16g" \
|
||||
-e ENABLE_PERF_OUTPUT="true" \
|
||||
-e CONTROLLER_HOST=$controller_host \
|
||||
-e CONTROLLER_PORT=$controller_port \
|
||||
-e WORKER_HOST=$worker_host \
|
||||
-e WORKER_PORT=$worker_port \
|
||||
-e OMP_NUM_THREADS=48 \
|
||||
-e MODEL_PATH=/llm/models/Llama-2-7b-chat-hf \
|
||||
-v $MODEL_PATH:/llm/models/ \
|
||||
$DOCKER_IMAGE -m worker -w vllm_worker # use -w model_worker if vllm worker is not needed
|
||||
```
|
||||
|
||||
Then you can use `curl` for testing, an example could be:
|
||||
```bash
|
||||
curl -X POST -H "Content-Type: application/json" -d '{
|
||||
"model": "YOUR_MODEL_NAME",
|
||||
"prompt": "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun",
|
||||
"n": 1,
|
||||
"best_of": 1,
|
||||
"use_beam_search": false,
|
||||
"stream": false
|
||||
}' http://localhost:8000/v1/completions
|
||||
```
|
||||
|
|
@ -31,6 +31,10 @@ Set hyper-threading to off, ensure that only physical cores are used during depl
|
|||
|
||||
The entrypoint of the image will try to set `OMP_NUM_THREADS` to the correct number by reading configs from the `runtime`. However, this only happens correctly if the `core-binding` feature is enabled. If not, please set environment variable `OMP_NUM_THREADS` manually in the yaml file.
|
||||
|
||||
### Vllm usage
|
||||
|
||||
If you want to use the vllm AsyncLLMEngine for serving, you should set the args -w vllm_worker in worker part of deployment.yaml.
|
||||
|
||||
|
||||
### Controller
|
||||
|
||||
|
|
@ -132,8 +136,8 @@ spec:
|
|||
fieldPath: status.podIP
|
||||
- name: WORKER_PORT # fixed
|
||||
value: "21841"
|
||||
- name: MODEL_PATH # Change this
|
||||
value: "/llm/models/vicuna-7b-v1.5-bigdl/"
|
||||
- name: MODEL_PATH
|
||||
value: "/llm/models/vicuna-7b-v1.5-bigdl/" # change this to your model
|
||||
- name: OMP_NUM_THREADS
|
||||
value: "16"
|
||||
resources:
|
||||
|
|
@ -143,7 +147,7 @@ spec:
|
|||
limits:
|
||||
memory: 32Gi
|
||||
cpu: 16
|
||||
args: ["-m", "worker"]
|
||||
args: ["-m", "worker"] # add , "-w", "vllm_worker" if vllm_worker is expected
|
||||
volumeMounts:
|
||||
- name: llm-models
|
||||
mountPath: /llm/models/
|
||||
|
|
@ -159,6 +163,10 @@ You may want to change the `MODEL_PATH` variable in the yaml. Also, please reme
|
|||
|
||||
### Testing
|
||||
|
||||
#### Check pod ip and port mappings
|
||||
|
||||
If you need to access the serving on host , you can use `kubectl get nodes -o wide` to get internal ip and `kubectl get service` to get port mappings.
|
||||
|
||||
#### Using openai-python
|
||||
|
||||
First, install openai-python:
|
||||
|
|
|
|||
|
|
@ -87,8 +87,8 @@ spec:
|
|||
fieldPath: status.podIP
|
||||
- name: WORKER_PORT # fixed
|
||||
value: "21841"
|
||||
- name: MODEL_PATH # Change this
|
||||
value: "/llm/models/vicuna-7b-v1.5-bigdl/"
|
||||
- name: MODEL_PATH
|
||||
value: "/llm/models/vicuna-7b-v1.5-bigdl/" # change this to your model
|
||||
- name: OMP_NUM_THREADS
|
||||
value: "16"
|
||||
resources:
|
||||
|
|
|
|||
Loading…
Reference in a new issue