separate trusted and native llm cpu finetune from lora (#9050)

* seperate trusted-llm and bigdl from lora finetuning

* add k8s for trusted llm finetune

* refine

* refine

* rename cpu to tdx in trusted llm

* solve conflict

* fix typo

* resolving conflict

* Delete docker/llm/finetune/lora/README.md

* fix

---------

Co-authored-by: Uxito-Ada <seusunheyang@foxmail.com>
Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
This commit is contained in:
Heyang Sun 2023-10-07 15:26:59 +08:00 committed by GitHub
parent b773d67dd4
commit 0b40ef8261
19 changed files with 115 additions and 441 deletions

View file

@ -109,3 +109,4 @@ Example responce:
```json
{"quote_list":{"bigdl-lora-finetuning-job-worker-0":"BAACAIEAAAAAAA...","bigdl-lora-finetuning-job-worker-1":"BAACAIEAAAAAAA...","launcher":"BAACAIEAAAAAA..."}}
```

View file

@ -0,0 +1,57 @@
## Run BF16-Optimized Lora Finetuning on Kubernetes with OneCCL
[Alpaca Lora](https://github.com/tloen/alpaca-lora/tree/main) uses [low-rank adaption](https://arxiv.org/pdf/2106.09685.pdf) to speed up the finetuning process of base model [Llama2-7b](https://huggingface.co/meta-llama/Llama-2-7b), and tries to reproduce the standard Alpaca, a general finetuned LLM. This is on top of Hugging Face transformers with Pytorch backend, which natively requires a number of expensive GPU resources and takes significant time.
By constract, BigDL here provides a CPU optimization to accelerate the lora finetuning of Llama2-7b, in the power of mixed-precision and distributed training. Detailedly, [Intel OneCCL](https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html), an available Hugging Face backend, is able to speed up the Pytorch computation with BF16 datatype on CPUs, as well as parallel processing on Kubernetes enabled by [Intel MPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html).
The architecture is illustrated in the following:
![image](https://github.com/Jasonzzt/BigDL/assets/60865256/b66416bc-ad07-49af-8cb0-8967dffb5f58)
As above, BigDL implements its MPI training with [Kubeflow MPI operator](https://github.com/kubeflow/mpi-operator/tree/master), which encapsulates the deployment as MPIJob CRD, and assists users to handle the construction of a MPI worker cluster on Kubernetes, such as public key distribution, SSH connection, and log collection.
Now, let's go to deploy a Lora finetuning to create a LLM from Llama2-7b.
**Note: Please make sure you have already have an available Kubernetes infrastructure and NFS shared storage, and install [Helm CLI](https://helm.sh/docs/helm/helm_install/) for Kubernetes job submission.**
### 1. Install Kubeflow MPI Operator
Follow [here](https://github.com/kubeflow/mpi-operator/tree/master#installation) to install a Kubeflow MPI operator in your Kubernetes, which will listen and receive the following MPIJob request at backend.
### 2. Download Image, Base Model and Finetuning Data
Follow [here](https://github.com/intel-analytics/BigDL/tree/main/docker/llm/finetune/lora/docker#prepare-bigdl-image-for-lora-finetuning) to prepare BigDL Lora Finetuning image in your cluster.
As finetuning is from a base model, first download [Llama2-7b model from the public download site of Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b). Then, download [cleaned alpaca data](https://raw.githubusercontent.com/tloen/alpaca-lora/main/alpaca_data_cleaned_archive.json), which contains all kinds of general knowledge and has already been cleaned. Next, move the downloaded files to a shared directory on your NFS server.
### 3. Deploy through Helm Chart
You are allowed to edit and experiment with different parameters in `./kubernetes/values.yaml` to improve finetuning performance and accuracy. For example, you can adjust `trainerNum` and `cpuPerPod` according to node and CPU core numbers in your cluster to make full use of these resources, and different `microBatchSize` result in different training speed and loss (here note that `microBatchSize`×`trainerNum` should not more than 128, as it is the batch size).
**Note: `dataSubPath` and `modelSubPath` need to have the same names as files under the NFS directory in step 2.**
After preparing parameters in `./kubernetes/values.yaml`, submit the job as beflow:
```bash
cd ./kubernetes
helm install bigdl-lora-finetuning .
```
### 4. Check Deployment
```bash
kubectl get all -n bigdl-lora-finetuning # you will see launcher and worker pods running
```
### 5. Check Finetuning Process
After deploying successfully, you can find a launcher pod, and then go inside this pod and check the logs collected from all workers.
```bash
kubectl get all -n bigdl-lora-finetuning # you will see a launcher pod
kubectl exec -it <launcher_pod_name> bash -n bigdl-ppml-finetuning # enter launcher pod
cat launcher.log # display logs collected from other workers
```
From the log, you can see whether finetuning process has been invoked successfully in all MPI worker pods, and a progress bar with finetuning speed and estimated time will be showed after some data preprocessing steps (this may take quiet a while).
For the fine-tuned model, it is written by the worker 0 (who holds rank 0), so you can find the model output inside the pod, which can be saved to host by command tools like `kubectl cp` or `scp`.

View file

@ -0,0 +1,52 @@
ARG http_proxy
ARG https_proxy
FROM mpioperator/intel as builder
ARG http_proxy
ARG https_proxy
ENV PIP_NO_CACHE_DIR=false
ADD ./requirements.txt /ppml/requirements.txt
RUN mkdir /ppml/data && mkdir /ppml/model && \
# install pytorch 2.0.1
apt-get update && \
apt-get install -y python3-pip python3.9-dev python3-wheel git software-properties-common && \
pip3 install --upgrade pip && \
pip install torch==2.0.1 && \
# install ipex and oneccl
pip install intel_extension_for_pytorch==2.0.100 && \
pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable && \
# install transformers etc.
cd /ppml && \
git clone https://github.com/huggingface/transformers.git && \
cd transformers && \
git reset --hard 057e1d74733f52817dc05b673a340b4e3ebea08c && \
pip install . && \
pip install -r /ppml/requirements.txt && \
# install python
add-apt-repository ppa:deadsnakes/ppa -y && \
apt-get install -y python3.9 && \
rm /usr/bin/python3 && \
ln -s /usr/bin/python3.9 /usr/bin/python3 && \
ln -s /usr/bin/python3 /usr/bin/python && \
pip install --no-cache requests argparse cryptography==3.3.2 urllib3 && \
pip install --upgrade requests && \
pip install setuptools==58.4.0 && \
# Install OpenSSH for MPI to communicate between containers
apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd && \
# Allow OpenSSH to talk to containers without asking for confirmation
# by disabling StrictHostKeyChecking.
# mpi-operator mounts the .ssh folder from a Secret. For that to work, we need
# to disable UserKnownHostsFile to avoid write permissions.
# Disabling StrictModes avoids directory and files read permission checks.
sed -i 's/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g' /etc/ssh/ssh_config && \
echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config
ADD ./bigdl-lora-finetuing-entrypoint.sh /ppml/bigdl-lora-finetuing-entrypoint.sh
ADD ./lora_finetune.py /ppml/lora_finetune.py
RUN chown -R mpiuser /ppml
USER mpiuser

View file

@ -3,7 +3,7 @@
You can download directly from Dockerhub like:
```bash
docker pull intelanalytics/bigdl-lora-finetuning:2.4.0-SNAPSHOT
docker pull intelanalytics/bigdl-llm-finetune-cpu:2.4.0-SNAPSHOT
```
Or build the image from source:
@ -13,8 +13,8 @@ export HTTP_PROXY=your_http_proxy
export HTTPS_PROXY=your_https_proxy
docker build \
--build-arg HTTP_PROXY=${HTTP_PROXY} \
--build-arg HTTPS_PROXY=${HTTPS_PROXY} \
-t intelanalytics/bigdl-lora-finetuning:2.4.0-SNAPSHOT \
--build-arg http_proxy=${HTTP_PROXY} \
--build-arg https_proxy=${HTTPS_PROXY} \
-t intelanalytics/bigdl-llm-finetune-cpu:2.4.0-SNAPSHOT \
-f ./Dockerfile .
```

View file

@ -1,4 +1,3 @@
{{- if eq .Values.TEEMode "native" }}
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
@ -90,4 +89,3 @@ spec:
- name: nfs-storage
persistentVolumeClaim:
claimName: nfs-pvc
{{- end }}

View file

@ -1,15 +1,9 @@
imageName: intelanalytics/bigdl-lora-finetuning:2.4.0-SNAPSHOT
imageName: intelanalytics/bigdl-llm-finetune-cpu:2.4.0-SNAPSHOT
trainerNum: 8
microBatchSize: 8
TEEMode: tdx # tdx or native
nfsServerIp: your_nfs_server_ip
nfsPath: a_nfs_shared_folder_path_on_the_server
dataSubPath: alpaca_data_cleaned_archive.json # a subpath of the data file under nfs directory
modelSubPath: llama-7b-hf # a subpath of the model file (dir) under nfs directory
ompNumThreads: 14
cpuPerPod: 42
attestionApiServicePort: 9870
enableTLS: false # true or false
base64ServerCrt: "your_base64_format_server_crt"
base64ServerKey: "your_base64_format_server_key"

View file

@ -1,83 +0,0 @@
ARG HTTP_PROXY
ARG HTTPS_PROXY
FROM mpioperator/intel as builder
ARG HTTP_PROXY
ARG HTTPS_PROXY
ADD ./requirements.txt /ppml/requirements.txt
RUN mkdir /ppml/data && mkdir /ppml/model && mkdir /ppml/output && \
# install pytorch 2.0.1
export http_proxy=$HTTP_PROXY && \
export https_proxy=$HTTPS_PROXY && \
apt-get update && \
# Basic dependencies and DCAP
apt-get update && \
apt install -y build-essential apt-utils wget git sudo vim && \
mkdir -p /opt/intel/ && \
cd /opt/intel && \
wget https://download.01.org/intel-sgx/sgx-dcap/1.16/linux/distro/ubuntu20.04-server/sgx_linux_x64_sdk_2.19.100.3.bin && \
chmod a+x ./sgx_linux_x64_sdk_2.19.100.3.bin && \
printf "no\n/opt/intel\n"|./sgx_linux_x64_sdk_2.19.100.3.bin && \
. /opt/intel/sgxsdk/environment && \
cd /opt/intel && \
wget https://download.01.org/intel-sgx/sgx-dcap/1.16/linux/distro/ubuntu20.04-server/sgx_debian_local_repo.tgz && \
tar xzf sgx_debian_local_repo.tgz && \
echo 'deb [trusted=yes arch=amd64] file:///opt/intel/sgx_debian_local_repo focal main' | tee /etc/apt/sources.list.d/intel-sgx.list && \
wget -qO - https://download.01.org/intel-sgx/sgx_repo/ubuntu/intel-sgx-deb.key | apt-key add - && \
env DEBIAN_FRONTEND=noninteractive apt-get update && apt install -y libsgx-enclave-common-dev libsgx-qe3-logic libsgx-pce-logic libsgx-ae-qe3 libsgx-ae-qve libsgx-urts libsgx-dcap-ql libsgx-dcap-default-qpl libsgx-dcap-quote-verify-dev libsgx-dcap-ql-dev libsgx-dcap-default-qpl-dev libsgx-ra-network libsgx-ra-uefi libtdx-attest libtdx-attest-dev && \
apt-get install -y python3-pip python3.9-dev python3-wheel && \
pip3 install --upgrade pip && \
pip install torch==2.0.1 && \
# install ipex and oneccl
pip install intel_extension_for_pytorch==2.0.100 && \
pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable && \
# install transformers etc.
cd /ppml && \
apt-get update && \
apt-get install -y git && \
git clone https://github.com/huggingface/transformers.git && \
cd transformers && \
git reset --hard 057e1d74733f52817dc05b673a340b4e3ebea08c && \
pip install . && \
pip install -r /ppml/requirements.txt && \
# install python
env DEBIAN_FRONTEND=noninteractive apt-get update && \
apt install software-properties-common -y && \
add-apt-repository ppa:deadsnakes/ppa -y && \
apt-get install -y python3.9 && \
rm /usr/bin/python3 && \
ln -s /usr/bin/python3.9 /usr/bin/python3 && \
ln -s /usr/bin/python3 /usr/bin/python && \
apt-get install -y python3-pip python3.9-dev python3-wheel && \
pip install --upgrade pip && \
pip install --no-cache requests argparse cryptography==3.3.2 urllib3 && \
pip install --upgrade requests && \
pip install setuptools==58.4.0 && \
# Install OpenSSH for MPI to communicate between containers
apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd && \
# Allow OpenSSH to talk to containers without asking for confirmation
# by disabling StrictHostKeyChecking.
# mpi-operator mounts the .ssh folder from a Secret. For that to work, we need
# to disable UserKnownHostsFile to avoid write permissions.
# Disabling StrictModes avoids directory and files read permission checks.
sed -i 's/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g' /etc/ssh/ssh_config && \
echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config && \
echo 'port=4050' | tee /etc/tdx-attest.conf && \
pip install flask && \
echo "mpiuser ALL = NOPASSWD:SETENV: /opt/intel/oneapi/mpi/2021.9.0/bin/mpirun\nmpiuser ALL = NOPASSWD:SETENV: /usr/bin/python" > /etc/sudoers.d/mpivisudo && \
chmod 440 /etc/sudoers.d/mpivisudo
ADD ./bigdl_aa.py /ppml/bigdl_aa.py
ADD ./quote_generator.py /ppml/quote_generator.py
ADD ./worker_quote_generate.py /ppml/worker_quote_generate.py
ADD ./get_worker_quote.sh /ppml/get_worker_quote.sh
ADD ./bigdl-lora-finetuing-entrypoint.sh /ppml/bigdl-lora-finetuing-entrypoint.sh
ADD ./lora_finetune.py /ppml/lora_finetune.py
RUN chown -R mpiuser /ppml
USER mpiuser

View file

@ -1,58 +0,0 @@
import quote_generator
from flask import Flask, request
from configparser import ConfigParser
import ssl, os
import base64
import requests
import subprocess
app = Flask(__name__)
@app.route('/gen_quote', methods=['POST'])
def gen_quote():
data = request.get_json()
user_report_data = data.get('user_report_data')
try:
quote_b = quote_generator.generate_tdx_quote(user_report_data)
quote = base64.b64encode(quote_b).decode('utf-8')
return {'quote': quote}
except Exception as e:
return {'quote': "quote generation failed: %s" % (e)}
@app.route('/attest', methods=['POST'])
def get_cluster_quote_list():
data = request.get_json()
user_report_data = data.get('user_report_data')
quote_list = []
try:
quote_b = quote_generator.generate_tdx_quote(user_report_data)
quote = base64.b64encode(quote_b).decode("utf-8")
quote_list.append(("launcher", quote))
except Exception as e:
quote_list.append("launcher", "quote generation failed: %s" % (e))
command = "sudo -u mpiuser -E bash /ppml/get_worker_quote.sh %s" % (user_report_data)
output = subprocess.check_output(command, shell=True)
with open("/ppml/output/quote.log", "r") as quote_file:
for line in quote_file:
line = line.strip()
if line:
parts = line.split(":")
if len(parts) == 2:
quote_list.append((parts[0].strip(), parts[1].strip()))
return {"quote_list": dict(quote_list)}
if __name__ == '__main__':
print("BigDL-AA: Agent Started.")
port = int(os.environ.get('ATTESTATION_API_SERVICE_PORT'))
enable_tls = os.environ.get('ENABLE_TLS')
if enable_tls == 'true':
context = ssl.SSLContext(ssl.PROTOCOL_TLS)
context.load_cert_chain(certfile='/ppml/keys/server.crt', keyfile='/ppml/keys/server.key')
# https_key_store_token = os.environ.get('HTTPS_KEY_STORE_TOKEN')
# context.load_cert_chain(certfile='/ppml/keys/server.crt', keyfile='/ppml/keys/server.key', password=https_key_store_token)
app.run(host='0.0.0.0', port=port, ssl_context=context)
else:
app.run(host='0.0.0.0', port=port)

View file

@ -1,17 +0,0 @@
#!/bin/bash
set -x
source /opt/intel/oneapi/setvars.sh
export CCL_WORKER_COUNT=$WORLD_SIZE
export CCL_WORKER_AFFINITY=auto
export SAVE_PATH="/ppml/output"
mpirun \
-n $WORLD_SIZE \
-ppn 1 \
-f /home/mpiuser/hostfile \
-iface eth0 \
-genv OMP_NUM_THREADS=$OMP_NUM_THREADS \
-genv KMP_AFFINITY="granularity=fine,none" \
-genv KMP_BLOCKTIME=1 \
-genv TF_ENABLE_ONEDNN_OPTS=1 \
sudo -E python /ppml/worker_quote_generate.py --user_report_data $1 > $SAVE_PATH/quote.log 2>&1

View file

@ -1,88 +0,0 @@
#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import ctypes
import base64
import os
def generate_tdx_quote(user_report_data):
# Define the uuid data structure
TDX_UUID_SIZE = 16
class TdxUuid(ctypes.Structure):
_fields_ = [("d", ctypes.c_uint8 * TDX_UUID_SIZE)]
# Define the report data structure
TDX_REPORT_DATA_SIZE = 64
class TdxReportData(ctypes.Structure):
_fields_ = [("d", ctypes.c_uint8 * TDX_REPORT_DATA_SIZE)]
# Define the report structure
TDX_REPORT_SIZE = 1024
class TdxReport(ctypes.Structure):
_fields_ = [("d", ctypes.c_uint8 * TDX_REPORT_SIZE)]
# Load the library
tdx_attest = ctypes.cdll.LoadLibrary("/usr/lib/x86_64-linux-gnu/libtdx_attest.so.1")
# Set the argument and return types for the function
tdx_attest.tdx_att_get_report.argtypes = [ctypes.POINTER(TdxReportData), ctypes.POINTER(TdxReport)]
tdx_attest.tdx_att_get_report.restype = ctypes.c_uint16
tdx_attest.tdx_att_get_quote.argtypes = [ctypes.POINTER(TdxReportData), ctypes.POINTER(TdxUuid), ctypes.c_uint32, ctypes.POINTER(TdxUuid), ctypes.POINTER(ctypes.POINTER(ctypes.c_uint8)), ctypes.POINTER(ctypes.c_uint32), ctypes.c_uint32]
tdx_attest.tdx_att_get_quote.restype = ctypes.c_uint16
# Call the function and check the return code
byte_array_data = bytearray(user_report_data.ljust(64)[:64], "utf-8").replace(b' ', b'\x00')
report_data = TdxReportData()
report_data.d = (ctypes.c_uint8 * 64).from_buffer(byte_array_data)
report = TdxReport()
result = tdx_attest.tdx_att_get_report(ctypes.byref(report_data), ctypes.byref(report))
if result != 0:
print("Error: " + hex(result))
att_key_id_list = None
list_size = 0
att_key_id = TdxUuid()
p_quote = ctypes.POINTER(ctypes.c_uint8)()
quote_size = ctypes.c_uint32()
flags = 0
result = tdx_attest.tdx_att_get_quote(ctypes.byref(report_data), att_key_id_list, list_size, ctypes.byref(att_key_id), ctypes.byref(p_quote), ctypes.byref(quote_size), flags)
if result != 0:
print("Error: " + hex(result))
else:
quote = ctypes.string_at(p_quote, quote_size.value)
return quote
def generate_gramine_quote(user_report_data):
USER_REPORT_PATH = "/dev/attestation/user_report_data"
QUOTE_PATH = "/dev/attestation/quote"
if not os.path.isfile(USER_REPORT_PATH):
print(f"File {USER_REPORT_PATH} not found.")
return ""
if not os.path.isfile(QUOTE_PATH):
print(f"File {QUOTE_PATH} not found.")
return ""
with open(USER_REPORT_PATH, 'w') as out:
out.write(user_report_data)
with open(QUOTE_PATH, "rb") as f:
quote = f.read()
return quote
if __name__ == "__main__":
print(generate_tdx_quote("ppml"))

View file

@ -1,20 +0,0 @@
import quote_generator
import argparse
import ssl, os
import base64
import requests
parser = argparse.ArgumentParser()
parser.add_argument("--user_report_data", type=str, default="ppml")
args = parser.parse_args()
host = os.environ.get('HYDRA_BSTRAP_LOCALHOST').split('.')[0]
user_report_data = args.user_report_data
try:
quote_b = quote_generator.generate_tdx_quote(user_report_data)
quote = base64.b64encode(quote_b).decode('utf-8')
except Exception as e:
quote = "quote generation failed: %s" % (e)
print("%s: %s"%(host, quote))

View file

@ -1,162 +0,0 @@
{{- if eq .Values.TEEMode "tdx" }}
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: bigdl-lora-finetuning-job
namespace: bigdl-lora-finetuning
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
sshAuthMountPath: /home/mpiuser/.ssh
mpiImplementation: Intel
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
volumes:
- name: nfs-storage
persistentVolumeClaim:
claimName: nfs-pvc
- name: dev
hostPath:
path: /dev
{{- if eq .Values.enableTLS true }}
- name: ssl-keys
secret:
secretName: ssl-keys
{{- end }}
runtimeClassName: kata-qemu-tdx
containers:
- image: {{ .Values.imageName }}
name: bigdl-ppml-finetuning-launcher
securityContext:
runAsUser: 0
privileged: true
command: ["/bin/sh", "-c"]
args:
- |
nohup python /ppml/bigdl_aa.py > /ppml/bigdl_aa.log 2>&1 &
sudo -E -u mpiuser bash /ppml/bigdl-lora-finetuing-entrypoint.sh
env:
- name: WORKER_ROLE
value: "launcher"
- name: WORLD_SIZE
value: "{{ .Values.trainerNum }}"
- name: MICRO_BATCH_SIZE
value: "{{ .Values.microBatchSize }}"
- name: MASTER_PORT
value: "42679"
- name: MASTER_ADDR
value: "bigdl-lora-finetuning-job-worker-0.bigdl-lora-finetuning-job-worker"
- name: DATA_SUB_PATH
value: "{{ .Values.dataSubPath }}"
- name: OMP_NUM_THREADS
value: "{{ .Values.ompNumThreads }}"
- name: LOCAL_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: HF_DATASETS_CACHE
value: "/ppml/output/cache"
- name: ATTESTATION_API_SERVICE_PORT
value: "{{ .Values.attestionApiServicePort }}"
- name: ENABLE_TLS
value: "{{ .Values.enableTLS }}"
volumeMounts:
- name: nfs-storage
subPath: {{ .Values.modelSubPath }}
mountPath: /ppml/model
- name: nfs-storage
subPath: {{ .Values.dataSubPath }}
mountPath: "/ppml/data/{{ .Values.dataSubPath }}"
- name: dev
mountPath: /dev
{{- if eq .Values.enableTLS true }}
- name: ssl-keys
mountPath: /ppml/keys
{{- end }}
Worker:
replicas: {{ .Values.trainerNum }}
template:
spec:
runtimeClassName: kata-qemu-tdx
containers:
- image: {{ .Values.imageName }}
name: bigdl-ppml-finetuning-worker
securityContext:
runAsUser: 0
privileged: true
command: ["/bin/sh", "-c"]
args:
- |
chown nobody /home/mpiuser/.ssh/id_rsa &
sudo -E -u mpiuser bash /ppml/bigdl-lora-finetuing-entrypoint.sh
env:
- name: WORKER_ROLE
value: "trainer"
- name: WORLD_SIZE
value: "{{ .Values.trainerNum }}"
- name: MICRO_BATCH_SIZE
value: "{{ .Values.microBatchSize }}"
- name: MASTER_PORT
value: "42679"
- name: MASTER_ADDR
value: "bigdl-lora-finetuning-job-worker-0.bigdl-lora-finetuning-job-worker"
- name: LOCAL_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumeMounts:
- name: nfs-storage
subPath: {{ .Values.modelSubPath }}
mountPath: /ppml/model
- name: nfs-storage
subPath: {{ .Values.dataSubPath }}
mountPath: "/ppml/data/{{ .Values.dataSubPath }}"
- name: dev
mountPath: /dev
resources:
requests:
cpu: {{ .Values.cpuPerPod }}
limits:
cpu: {{ .Values.cpuPerPod }}
volumes:
- name: nfs-storage
persistentVolumeClaim:
claimName: nfs-pvc
- name: dev
hostPath:
path: /dev
---
apiVersion: v1
kind: Service
metadata:
name: bigdl-lora-finetuning-launcher-attestation-api-service
namespace: bigdl-lora-finetuning
spec:
selector:
job-name: bigdl-lora-finetuning-job-launcher
training.kubeflow.org/job-name: bigdl-lora-finetuning-job
training.kubeflow.org/job-role: launcher
ports:
- name: launcher-attestation-api-service-port
protocol: TCP
port: {{ .Values.attestionApiServicePort }}
targetPort: {{ .Values.attestionApiServicePort }}
type: ClusterIP
---
{{- if eq .Values.enableTLS true }}
apiVersion: v1
kind: Secret
metadata:
name: ssl-keys
namespace: bigdl-lora-finetuning
type: Opaque
data:
server.crt: {{ .Values.base64ServerCrt }}
server.key: {{ .Values.base64ServerKey }}
{{- end }}
{{- end }}