* fix ipex with ipex_llm * fix ipex with ipex_llm * update * update * update * update * update * update * update * update
7.3 KiB
Deployment ipex-llm serving service in K8S environment
Image
To deploy IPEX-LLM-serving cpu in Kubernetes environment, please use this image: intelanalytics/ipex-llm-serving-cpu:2.5.0-SNAPSHOT
Before deployment
Models
In this document, we will use vicuna-7b-v1.5 as the deployment model.
After downloading the model, please change name from vicuna-7b-v1.5 to vicuna-7b-v1.5-ipex-llm to use ipex-llm as the backend. The ipex-llm backend will be used if model path contains ipex-llm. Otherwise, the original transformer-backend will be used.
You can download the model from here.
Kubernetes config
We recommend to setup your kubernetes cluster before deployment. Mostly importantly, please set cpu-management-policy to static by using this tutorial. Also, it would be great to also set the topology management policy to single-numa-node.
Machine config
Set hyper-threading to off, ensure that only physical cores are used during deployment.
Deployment
Reminder on OMP_NUM_THREADS
The entrypoint of the image will try to set OMP_NUM_THREADS to the correct number by reading configs from the runtime. However, this only happens correctly if the core-binding feature is enabled. If not, please set environment variable OMP_NUM_THREADS manually in the yaml file.
Vllm usage
If you want to use the vllm AsyncLLMEngine for serving, you should set the args -w vllm_worker in worker part of deployment.yaml.
PersistentVolume
We use the following yaml file for PersistentVolume deployment:
apiVersion: v1
kind: PersistentVolume
metadata:
name: models-pv
labels:
app: models
spec:
capacity:
storage: 10Gi #Modify according to model size
accessModes:
- ReadWriteMany
storageClassName: models
nfs:
path: YOUR_NFS_PATH
server: YOUR_NFS_SERVER
Then you should upload model to YOUR_NFS_PATH
Controller
We use the following yaml file for controller deployment:
apiVersion: v1
kind: Pod
metadata:
name: ipex-llm-fschat-a1234bd-controller
labels:
fastchat-appid: a1234bd
fastchat-app-type: controller
spec:
dnsPolicy: "ClusterFirst"
containers:
- name: fastchat-controller # fixed
image: intelanalytics/ipex-llm-serving-cpu:2.5.0-SNAPSHOT
imagePullPolicy: IfNotPresent
env:
- name: CONTROLLER_HOST # fixed
value: "0.0.0.0"
- name: CONTROLLER_PORT # fixed
value: "21005"
- name: API_HOST # fixed
value: "0.0.0.0"
- name: API_PORT # fixed
value: "8000"
- name: "GRADIO_PORT" # You can change this port
value: "8002"
ports:
- containerPort: 21005
name: con-port
- containerPort: 8000
name: api-port
resources:
requests:
memory: 16Gi
cpu: 4
limits:
memory: 16Gi
cpu: 4
args: ["-m", "controller"]
restartPolicy: "Never"
---
# Service for the controller
apiVersion: v1
kind: Service
metadata:
name: ipex-llm-a1234bd-fschat-controller-service
spec:
# You may also want to change this to use the cluster's feature
type: NodePort
selector:
fastchat-appid: a1234bd
fastchat-app-type: controller
ports:
- name: cont-port
protocol: TCP
port: 21005
targetPort: 21005
- name: api-port
protocol: TCP
port: 8000
targetPort: 8000
Worker
We use the following deployment for worker deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ipex-llm-fschat-a1234bd-worker-deployment
spec:
# Change this to the number you want
replicas: 1
selector:
matchLabels:
fastchat: worker
template:
metadata:
labels:
fastchat: worker
spec:
dnsPolicy: "ClusterFirst"
containers:
- name: fastchat-worker # fixed
image: intelanalytics/ipex-llm-serving-cpu:2.5.0-SNAPSHOT
imagePullPolicy: IfNotPresent
env:
- name: CONTROLLER_HOST # fixed
value: ipex-llm-a1234bd-fschat-controller-service
- name: CONTROLLER_PORT # fixed
value: "21005"
- name: WORKER_HOST # fixed
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: WORKER_PORT # fixed
value: "21841"
- name: MODEL_PATH
value: "/llm/models/vicuna-7b-v1.5-ipex-llm/" # change this to your model
- name: OMP_NUM_THREADS
value: "16"
resources:
requests:
memory: 32Gi
cpu: 16
limits:
memory: 32Gi
cpu: 16
args: ["-m", "worker"] # add , "-w", "vllm_worker" if vllm_worker is expected
volumeMounts:
- name: llm-models
mountPath: /llm/models/
restartPolicy: "Always"
volumes:
- name: llm-models
persistentVolumeClaim:
claimName: models-pvc
You may want to change the MODEL_PATH variable in the yaml. Also, please remember to change the volume path accordingly.
Use gradio_web_ui
We have set port using GRADIO_PORT envrionment variable in deployment.yaml, you can use this command
k port-forward ipex-llm-fschat-a1234bd-controller --address 0.0.0.0 8002:8002
Then visit http://YOUR_HOST_IP:8002 to access ui.
Testing
Check pod ip and port mappings
If you need to access the serving on host , you can use kubectl get nodes -o wide to get internal ip and kubectl get service to get port mappings.
Using openai-python
First, install openai-python:
pip install --upgrade openai
Then, interact with model vicuna-7b-v1.5-ipex-llm:
import openai
openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"
model = "vicuna-7b-v1.5-ipex-llm"
prompt = "Once upon a time"
# create a completion
completion = openai.Completion.create(model=model, prompt=prompt, max_tokens=64)
# print the completion
print(prompt + completion.choices[0].text)
# create a chat completion
completion = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": "Hello! What is your name?"}]
)
# print the completion
print(completion.choices[0].message.content)
cURL
cURL is another good tool for observing the output of the api.
Before using cURL, you should set your http_proxy and https_proxy to empty
export http_proxy=
export https_proxy=
For the following examples, you may also change the service deployment address.
List Models:
curl http://localhost:8000/v1/models
If you have jq installed, you can use it to format the output like this:
curl http://localhost:8000/v1/models | jq
Chat Completions:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "YOUR_MODEL",
"messages": [{"role": "user", "content": "Hello! What is your name?"}]
}'
Text Completions:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "YOUR_MODEL",
"prompt": "Once upon a time",
"max_tokens": 41,
"temperature": 0.5
}'
Embeddings:
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "YOUR_MODEL",
"input": "Hello world!"
}'