separate trusted and native llm cpu finetune from lora (#9050)
* seperate trusted-llm and bigdl from lora finetuning * add k8s for trusted llm finetune * refine * refine * rename cpu to tdx in trusted llm * solve conflict * fix typo * resolving conflict * Delete docker/llm/finetune/lora/README.md * fix --------- Co-authored-by: Uxito-Ada <seusunheyang@foxmail.com> Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
This commit is contained in:
parent
b773d67dd4
commit
0b40ef8261
19 changed files with 115 additions and 441 deletions
|
|
@ -109,3 +109,4 @@ Example responce:
|
|||
```json
|
||||
{"quote_list":{"bigdl-lora-finetuning-job-worker-0":"BAACAIEAAAAAAA...","bigdl-lora-finetuning-job-worker-1":"BAACAIEAAAAAAA...","launcher":"BAACAIEAAAAAA..."}}
|
||||
```
|
||||
|
||||
|
|
|
|||
57
docker/llm/finetune/lora/cpu/README.md
Normal file
57
docker/llm/finetune/lora/cpu/README.md
Normal file
|
|
@ -0,0 +1,57 @@
|
|||
## Run BF16-Optimized Lora Finetuning on Kubernetes with OneCCL
|
||||
|
||||
[Alpaca Lora](https://github.com/tloen/alpaca-lora/tree/main) uses [low-rank adaption](https://arxiv.org/pdf/2106.09685.pdf) to speed up the finetuning process of base model [Llama2-7b](https://huggingface.co/meta-llama/Llama-2-7b), and tries to reproduce the standard Alpaca, a general finetuned LLM. This is on top of Hugging Face transformers with Pytorch backend, which natively requires a number of expensive GPU resources and takes significant time.
|
||||
|
||||
By constract, BigDL here provides a CPU optimization to accelerate the lora finetuning of Llama2-7b, in the power of mixed-precision and distributed training. Detailedly, [Intel OneCCL](https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html), an available Hugging Face backend, is able to speed up the Pytorch computation with BF16 datatype on CPUs, as well as parallel processing on Kubernetes enabled by [Intel MPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html).
|
||||
|
||||
The architecture is illustrated in the following:
|
||||
|
||||

|
||||
|
||||
As above, BigDL implements its MPI training with [Kubeflow MPI operator](https://github.com/kubeflow/mpi-operator/tree/master), which encapsulates the deployment as MPIJob CRD, and assists users to handle the construction of a MPI worker cluster on Kubernetes, such as public key distribution, SSH connection, and log collection.
|
||||
|
||||
Now, let's go to deploy a Lora finetuning to create a LLM from Llama2-7b.
|
||||
|
||||
**Note: Please make sure you have already have an available Kubernetes infrastructure and NFS shared storage, and install [Helm CLI](https://helm.sh/docs/helm/helm_install/) for Kubernetes job submission.**
|
||||
|
||||
### 1. Install Kubeflow MPI Operator
|
||||
|
||||
Follow [here](https://github.com/kubeflow/mpi-operator/tree/master#installation) to install a Kubeflow MPI operator in your Kubernetes, which will listen and receive the following MPIJob request at backend.
|
||||
|
||||
### 2. Download Image, Base Model and Finetuning Data
|
||||
|
||||
Follow [here](https://github.com/intel-analytics/BigDL/tree/main/docker/llm/finetune/lora/docker#prepare-bigdl-image-for-lora-finetuning) to prepare BigDL Lora Finetuning image in your cluster.
|
||||
|
||||
As finetuning is from a base model, first download [Llama2-7b model from the public download site of Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b). Then, download [cleaned alpaca data](https://raw.githubusercontent.com/tloen/alpaca-lora/main/alpaca_data_cleaned_archive.json), which contains all kinds of general knowledge and has already been cleaned. Next, move the downloaded files to a shared directory on your NFS server.
|
||||
|
||||
### 3. Deploy through Helm Chart
|
||||
|
||||
You are allowed to edit and experiment with different parameters in `./kubernetes/values.yaml` to improve finetuning performance and accuracy. For example, you can adjust `trainerNum` and `cpuPerPod` according to node and CPU core numbers in your cluster to make full use of these resources, and different `microBatchSize` result in different training speed and loss (here note that `microBatchSize`×`trainerNum` should not more than 128, as it is the batch size).
|
||||
|
||||
**Note: `dataSubPath` and `modelSubPath` need to have the same names as files under the NFS directory in step 2.**
|
||||
|
||||
After preparing parameters in `./kubernetes/values.yaml`, submit the job as beflow:
|
||||
|
||||
```bash
|
||||
cd ./kubernetes
|
||||
helm install bigdl-lora-finetuning .
|
||||
```
|
||||
|
||||
### 4. Check Deployment
|
||||
```bash
|
||||
kubectl get all -n bigdl-lora-finetuning # you will see launcher and worker pods running
|
||||
```
|
||||
|
||||
### 5. Check Finetuning Process
|
||||
|
||||
After deploying successfully, you can find a launcher pod, and then go inside this pod and check the logs collected from all workers.
|
||||
|
||||
```bash
|
||||
kubectl get all -n bigdl-lora-finetuning # you will see a launcher pod
|
||||
kubectl exec -it <launcher_pod_name> bash -n bigdl-ppml-finetuning # enter launcher pod
|
||||
cat launcher.log # display logs collected from other workers
|
||||
```
|
||||
|
||||
From the log, you can see whether finetuning process has been invoked successfully in all MPI worker pods, and a progress bar with finetuning speed and estimated time will be showed after some data preprocessing steps (this may take quiet a while).
|
||||
|
||||
For the fine-tuned model, it is written by the worker 0 (who holds rank 0), so you can find the model output inside the pod, which can be saved to host by command tools like `kubectl cp` or `scp`.
|
||||
52
docker/llm/finetune/lora/cpu/docker/Dockerfile
Normal file
52
docker/llm/finetune/lora/cpu/docker/Dockerfile
Normal file
|
|
@ -0,0 +1,52 @@
|
|||
ARG http_proxy
|
||||
ARG https_proxy
|
||||
|
||||
FROM mpioperator/intel as builder
|
||||
|
||||
ARG http_proxy
|
||||
ARG https_proxy
|
||||
ENV PIP_NO_CACHE_DIR=false
|
||||
ADD ./requirements.txt /ppml/requirements.txt
|
||||
|
||||
RUN mkdir /ppml/data && mkdir /ppml/model && \
|
||||
# install pytorch 2.0.1
|
||||
apt-get update && \
|
||||
apt-get install -y python3-pip python3.9-dev python3-wheel git software-properties-common && \
|
||||
pip3 install --upgrade pip && \
|
||||
pip install torch==2.0.1 && \
|
||||
# install ipex and oneccl
|
||||
pip install intel_extension_for_pytorch==2.0.100 && \
|
||||
pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable && \
|
||||
# install transformers etc.
|
||||
cd /ppml && \
|
||||
git clone https://github.com/huggingface/transformers.git && \
|
||||
cd transformers && \
|
||||
git reset --hard 057e1d74733f52817dc05b673a340b4e3ebea08c && \
|
||||
pip install . && \
|
||||
pip install -r /ppml/requirements.txt && \
|
||||
# install python
|
||||
add-apt-repository ppa:deadsnakes/ppa -y && \
|
||||
apt-get install -y python3.9 && \
|
||||
rm /usr/bin/python3 && \
|
||||
ln -s /usr/bin/python3.9 /usr/bin/python3 && \
|
||||
ln -s /usr/bin/python3 /usr/bin/python && \
|
||||
pip install --no-cache requests argparse cryptography==3.3.2 urllib3 && \
|
||||
pip install --upgrade requests && \
|
||||
pip install setuptools==58.4.0 && \
|
||||
# Install OpenSSH for MPI to communicate between containers
|
||||
apt-get install -y --no-install-recommends openssh-client openssh-server && \
|
||||
mkdir -p /var/run/sshd && \
|
||||
# Allow OpenSSH to talk to containers without asking for confirmation
|
||||
# by disabling StrictHostKeyChecking.
|
||||
# mpi-operator mounts the .ssh folder from a Secret. For that to work, we need
|
||||
# to disable UserKnownHostsFile to avoid write permissions.
|
||||
# Disabling StrictModes avoids directory and files read permission checks.
|
||||
sed -i 's/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g' /etc/ssh/ssh_config && \
|
||||
echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
|
||||
sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config
|
||||
|
||||
ADD ./bigdl-lora-finetuing-entrypoint.sh /ppml/bigdl-lora-finetuing-entrypoint.sh
|
||||
ADD ./lora_finetune.py /ppml/lora_finetune.py
|
||||
|
||||
RUN chown -R mpiuser /ppml
|
||||
USER mpiuser
|
||||
|
|
@ -3,7 +3,7 @@
|
|||
You can download directly from Dockerhub like:
|
||||
|
||||
```bash
|
||||
docker pull intelanalytics/bigdl-lora-finetuning:2.4.0-SNAPSHOT
|
||||
docker pull intelanalytics/bigdl-llm-finetune-cpu:2.4.0-SNAPSHOT
|
||||
```
|
||||
|
||||
Or build the image from source:
|
||||
|
|
@ -13,8 +13,8 @@ export HTTP_PROXY=your_http_proxy
|
|||
export HTTPS_PROXY=your_https_proxy
|
||||
|
||||
docker build \
|
||||
--build-arg HTTP_PROXY=${HTTP_PROXY} \
|
||||
--build-arg HTTPS_PROXY=${HTTPS_PROXY} \
|
||||
-t intelanalytics/bigdl-lora-finetuning:2.4.0-SNAPSHOT \
|
||||
--build-arg http_proxy=${HTTP_PROXY} \
|
||||
--build-arg https_proxy=${HTTPS_PROXY} \
|
||||
-t intelanalytics/bigdl-llm-finetune-cpu:2.4.0-SNAPSHOT \
|
||||
-f ./Dockerfile .
|
||||
```
|
||||
|
|
@ -1,4 +1,3 @@
|
|||
{{- if eq .Values.TEEMode "native" }}
|
||||
apiVersion: kubeflow.org/v2beta1
|
||||
kind: MPIJob
|
||||
metadata:
|
||||
|
|
@ -90,4 +89,3 @@ spec:
|
|||
- name: nfs-storage
|
||||
persistentVolumeClaim:
|
||||
claimName: nfs-pvc
|
||||
{{- end }}
|
||||
|
|
@ -1,15 +1,9 @@
|
|||
imageName: intelanalytics/bigdl-lora-finetuning:2.4.0-SNAPSHOT
|
||||
imageName: intelanalytics/bigdl-llm-finetune-cpu:2.4.0-SNAPSHOT
|
||||
trainerNum: 8
|
||||
microBatchSize: 8
|
||||
TEEMode: tdx # tdx or native
|
||||
nfsServerIp: your_nfs_server_ip
|
||||
nfsPath: a_nfs_shared_folder_path_on_the_server
|
||||
dataSubPath: alpaca_data_cleaned_archive.json # a subpath of the data file under nfs directory
|
||||
modelSubPath: llama-7b-hf # a subpath of the model file (dir) under nfs directory
|
||||
ompNumThreads: 14
|
||||
cpuPerPod: 42
|
||||
attestionApiServicePort: 9870
|
||||
|
||||
enableTLS: false # true or false
|
||||
base64ServerCrt: "your_base64_format_server_crt"
|
||||
base64ServerKey: "your_base64_format_server_key"
|
||||
|
|
@ -1,83 +0,0 @@
|
|||
ARG HTTP_PROXY
|
||||
ARG HTTPS_PROXY
|
||||
|
||||
FROM mpioperator/intel as builder
|
||||
|
||||
ARG HTTP_PROXY
|
||||
ARG HTTPS_PROXY
|
||||
ADD ./requirements.txt /ppml/requirements.txt
|
||||
|
||||
RUN mkdir /ppml/data && mkdir /ppml/model && mkdir /ppml/output && \
|
||||
# install pytorch 2.0.1
|
||||
export http_proxy=$HTTP_PROXY && \
|
||||
export https_proxy=$HTTPS_PROXY && \
|
||||
apt-get update && \
|
||||
# Basic dependencies and DCAP
|
||||
apt-get update && \
|
||||
apt install -y build-essential apt-utils wget git sudo vim && \
|
||||
mkdir -p /opt/intel/ && \
|
||||
cd /opt/intel && \
|
||||
wget https://download.01.org/intel-sgx/sgx-dcap/1.16/linux/distro/ubuntu20.04-server/sgx_linux_x64_sdk_2.19.100.3.bin && \
|
||||
chmod a+x ./sgx_linux_x64_sdk_2.19.100.3.bin && \
|
||||
printf "no\n/opt/intel\n"|./sgx_linux_x64_sdk_2.19.100.3.bin && \
|
||||
. /opt/intel/sgxsdk/environment && \
|
||||
cd /opt/intel && \
|
||||
wget https://download.01.org/intel-sgx/sgx-dcap/1.16/linux/distro/ubuntu20.04-server/sgx_debian_local_repo.tgz && \
|
||||
tar xzf sgx_debian_local_repo.tgz && \
|
||||
echo 'deb [trusted=yes arch=amd64] file:///opt/intel/sgx_debian_local_repo focal main' | tee /etc/apt/sources.list.d/intel-sgx.list && \
|
||||
wget -qO - https://download.01.org/intel-sgx/sgx_repo/ubuntu/intel-sgx-deb.key | apt-key add - && \
|
||||
env DEBIAN_FRONTEND=noninteractive apt-get update && apt install -y libsgx-enclave-common-dev libsgx-qe3-logic libsgx-pce-logic libsgx-ae-qe3 libsgx-ae-qve libsgx-urts libsgx-dcap-ql libsgx-dcap-default-qpl libsgx-dcap-quote-verify-dev libsgx-dcap-ql-dev libsgx-dcap-default-qpl-dev libsgx-ra-network libsgx-ra-uefi libtdx-attest libtdx-attest-dev && \
|
||||
apt-get install -y python3-pip python3.9-dev python3-wheel && \
|
||||
pip3 install --upgrade pip && \
|
||||
pip install torch==2.0.1 && \
|
||||
# install ipex and oneccl
|
||||
pip install intel_extension_for_pytorch==2.0.100 && \
|
||||
pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable && \
|
||||
# install transformers etc.
|
||||
cd /ppml && \
|
||||
apt-get update && \
|
||||
apt-get install -y git && \
|
||||
git clone https://github.com/huggingface/transformers.git && \
|
||||
cd transformers && \
|
||||
git reset --hard 057e1d74733f52817dc05b673a340b4e3ebea08c && \
|
||||
pip install . && \
|
||||
pip install -r /ppml/requirements.txt && \
|
||||
# install python
|
||||
env DEBIAN_FRONTEND=noninteractive apt-get update && \
|
||||
apt install software-properties-common -y && \
|
||||
add-apt-repository ppa:deadsnakes/ppa -y && \
|
||||
apt-get install -y python3.9 && \
|
||||
rm /usr/bin/python3 && \
|
||||
ln -s /usr/bin/python3.9 /usr/bin/python3 && \
|
||||
ln -s /usr/bin/python3 /usr/bin/python && \
|
||||
apt-get install -y python3-pip python3.9-dev python3-wheel && \
|
||||
pip install --upgrade pip && \
|
||||
pip install --no-cache requests argparse cryptography==3.3.2 urllib3 && \
|
||||
pip install --upgrade requests && \
|
||||
pip install setuptools==58.4.0 && \
|
||||
# Install OpenSSH for MPI to communicate between containers
|
||||
apt-get install -y --no-install-recommends openssh-client openssh-server && \
|
||||
mkdir -p /var/run/sshd && \
|
||||
# Allow OpenSSH to talk to containers without asking for confirmation
|
||||
# by disabling StrictHostKeyChecking.
|
||||
# mpi-operator mounts the .ssh folder from a Secret. For that to work, we need
|
||||
# to disable UserKnownHostsFile to avoid write permissions.
|
||||
# Disabling StrictModes avoids directory and files read permission checks.
|
||||
sed -i 's/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g' /etc/ssh/ssh_config && \
|
||||
echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
|
||||
sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config && \
|
||||
echo 'port=4050' | tee /etc/tdx-attest.conf && \
|
||||
pip install flask && \
|
||||
echo "mpiuser ALL = NOPASSWD:SETENV: /opt/intel/oneapi/mpi/2021.9.0/bin/mpirun\nmpiuser ALL = NOPASSWD:SETENV: /usr/bin/python" > /etc/sudoers.d/mpivisudo && \
|
||||
chmod 440 /etc/sudoers.d/mpivisudo
|
||||
|
||||
ADD ./bigdl_aa.py /ppml/bigdl_aa.py
|
||||
ADD ./quote_generator.py /ppml/quote_generator.py
|
||||
ADD ./worker_quote_generate.py /ppml/worker_quote_generate.py
|
||||
ADD ./get_worker_quote.sh /ppml/get_worker_quote.sh
|
||||
|
||||
ADD ./bigdl-lora-finetuing-entrypoint.sh /ppml/bigdl-lora-finetuing-entrypoint.sh
|
||||
ADD ./lora_finetune.py /ppml/lora_finetune.py
|
||||
|
||||
RUN chown -R mpiuser /ppml
|
||||
USER mpiuser
|
||||
|
|
@ -1,58 +0,0 @@
|
|||
import quote_generator
|
||||
from flask import Flask, request
|
||||
from configparser import ConfigParser
|
||||
import ssl, os
|
||||
import base64
|
||||
import requests
|
||||
import subprocess
|
||||
|
||||
app = Flask(__name__)
|
||||
|
||||
@app.route('/gen_quote', methods=['POST'])
|
||||
def gen_quote():
|
||||
data = request.get_json()
|
||||
user_report_data = data.get('user_report_data')
|
||||
try:
|
||||
quote_b = quote_generator.generate_tdx_quote(user_report_data)
|
||||
quote = base64.b64encode(quote_b).decode('utf-8')
|
||||
return {'quote': quote}
|
||||
except Exception as e:
|
||||
return {'quote': "quote generation failed: %s" % (e)}
|
||||
|
||||
@app.route('/attest', methods=['POST'])
|
||||
def get_cluster_quote_list():
|
||||
data = request.get_json()
|
||||
user_report_data = data.get('user_report_data')
|
||||
quote_list = []
|
||||
|
||||
try:
|
||||
quote_b = quote_generator.generate_tdx_quote(user_report_data)
|
||||
quote = base64.b64encode(quote_b).decode("utf-8")
|
||||
quote_list.append(("launcher", quote))
|
||||
except Exception as e:
|
||||
quote_list.append("launcher", "quote generation failed: %s" % (e))
|
||||
|
||||
command = "sudo -u mpiuser -E bash /ppml/get_worker_quote.sh %s" % (user_report_data)
|
||||
output = subprocess.check_output(command, shell=True)
|
||||
|
||||
with open("/ppml/output/quote.log", "r") as quote_file:
|
||||
for line in quote_file:
|
||||
line = line.strip()
|
||||
if line:
|
||||
parts = line.split(":")
|
||||
if len(parts) == 2:
|
||||
quote_list.append((parts[0].strip(), parts[1].strip()))
|
||||
return {"quote_list": dict(quote_list)}
|
||||
|
||||
if __name__ == '__main__':
|
||||
print("BigDL-AA: Agent Started.")
|
||||
port = int(os.environ.get('ATTESTATION_API_SERVICE_PORT'))
|
||||
enable_tls = os.environ.get('ENABLE_TLS')
|
||||
if enable_tls == 'true':
|
||||
context = ssl.SSLContext(ssl.PROTOCOL_TLS)
|
||||
context.load_cert_chain(certfile='/ppml/keys/server.crt', keyfile='/ppml/keys/server.key')
|
||||
# https_key_store_token = os.environ.get('HTTPS_KEY_STORE_TOKEN')
|
||||
# context.load_cert_chain(certfile='/ppml/keys/server.crt', keyfile='/ppml/keys/server.key', password=https_key_store_token)
|
||||
app.run(host='0.0.0.0', port=port, ssl_context=context)
|
||||
else:
|
||||
app.run(host='0.0.0.0', port=port)
|
||||
|
|
@ -1,17 +0,0 @@
|
|||
#!/bin/bash
|
||||
set -x
|
||||
source /opt/intel/oneapi/setvars.sh
|
||||
export CCL_WORKER_COUNT=$WORLD_SIZE
|
||||
export CCL_WORKER_AFFINITY=auto
|
||||
export SAVE_PATH="/ppml/output"
|
||||
|
||||
mpirun \
|
||||
-n $WORLD_SIZE \
|
||||
-ppn 1 \
|
||||
-f /home/mpiuser/hostfile \
|
||||
-iface eth0 \
|
||||
-genv OMP_NUM_THREADS=$OMP_NUM_THREADS \
|
||||
-genv KMP_AFFINITY="granularity=fine,none" \
|
||||
-genv KMP_BLOCKTIME=1 \
|
||||
-genv TF_ENABLE_ONEDNN_OPTS=1 \
|
||||
sudo -E python /ppml/worker_quote_generate.py --user_report_data $1 > $SAVE_PATH/quote.log 2>&1
|
||||
|
|
@ -1,88 +0,0 @@
|
|||
#
|
||||
# Copyright 2016 The BigDL Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
#
|
||||
|
||||
import ctypes
|
||||
import base64
|
||||
import os
|
||||
|
||||
def generate_tdx_quote(user_report_data):
|
||||
# Define the uuid data structure
|
||||
TDX_UUID_SIZE = 16
|
||||
class TdxUuid(ctypes.Structure):
|
||||
_fields_ = [("d", ctypes.c_uint8 * TDX_UUID_SIZE)]
|
||||
|
||||
# Define the report data structure
|
||||
TDX_REPORT_DATA_SIZE = 64
|
||||
class TdxReportData(ctypes.Structure):
|
||||
_fields_ = [("d", ctypes.c_uint8 * TDX_REPORT_DATA_SIZE)]
|
||||
|
||||
# Define the report structure
|
||||
TDX_REPORT_SIZE = 1024
|
||||
class TdxReport(ctypes.Structure):
|
||||
_fields_ = [("d", ctypes.c_uint8 * TDX_REPORT_SIZE)]
|
||||
|
||||
# Load the library
|
||||
tdx_attest = ctypes.cdll.LoadLibrary("/usr/lib/x86_64-linux-gnu/libtdx_attest.so.1")
|
||||
|
||||
# Set the argument and return types for the function
|
||||
tdx_attest.tdx_att_get_report.argtypes = [ctypes.POINTER(TdxReportData), ctypes.POINTER(TdxReport)]
|
||||
tdx_attest.tdx_att_get_report.restype = ctypes.c_uint16
|
||||
|
||||
tdx_attest.tdx_att_get_quote.argtypes = [ctypes.POINTER(TdxReportData), ctypes.POINTER(TdxUuid), ctypes.c_uint32, ctypes.POINTER(TdxUuid), ctypes.POINTER(ctypes.POINTER(ctypes.c_uint8)), ctypes.POINTER(ctypes.c_uint32), ctypes.c_uint32]
|
||||
tdx_attest.tdx_att_get_quote.restype = ctypes.c_uint16
|
||||
|
||||
|
||||
# Call the function and check the return code
|
||||
byte_array_data = bytearray(user_report_data.ljust(64)[:64], "utf-8").replace(b' ', b'\x00')
|
||||
report_data = TdxReportData()
|
||||
report_data.d = (ctypes.c_uint8 * 64).from_buffer(byte_array_data)
|
||||
report = TdxReport()
|
||||
result = tdx_attest.tdx_att_get_report(ctypes.byref(report_data), ctypes.byref(report))
|
||||
if result != 0:
|
||||
print("Error: " + hex(result))
|
||||
|
||||
att_key_id_list = None
|
||||
list_size = 0
|
||||
att_key_id = TdxUuid()
|
||||
p_quote = ctypes.POINTER(ctypes.c_uint8)()
|
||||
quote_size = ctypes.c_uint32()
|
||||
flags = 0
|
||||
|
||||
result = tdx_attest.tdx_att_get_quote(ctypes.byref(report_data), att_key_id_list, list_size, ctypes.byref(att_key_id), ctypes.byref(p_quote), ctypes.byref(quote_size), flags)
|
||||
|
||||
if result != 0:
|
||||
print("Error: " + hex(result))
|
||||
else:
|
||||
quote = ctypes.string_at(p_quote, quote_size.value)
|
||||
return quote
|
||||
|
||||
def generate_gramine_quote(user_report_data):
|
||||
USER_REPORT_PATH = "/dev/attestation/user_report_data"
|
||||
QUOTE_PATH = "/dev/attestation/quote"
|
||||
if not os.path.isfile(USER_REPORT_PATH):
|
||||
print(f"File {USER_REPORT_PATH} not found.")
|
||||
return ""
|
||||
if not os.path.isfile(QUOTE_PATH):
|
||||
print(f"File {QUOTE_PATH} not found.")
|
||||
return ""
|
||||
with open(USER_REPORT_PATH, 'w') as out:
|
||||
out.write(user_report_data)
|
||||
with open(QUOTE_PATH, "rb") as f:
|
||||
quote = f.read()
|
||||
return quote
|
||||
|
||||
if __name__ == "__main__":
|
||||
print(generate_tdx_quote("ppml"))
|
||||
|
|
@ -1,20 +0,0 @@
|
|||
import quote_generator
|
||||
import argparse
|
||||
import ssl, os
|
||||
import base64
|
||||
import requests
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--user_report_data", type=str, default="ppml")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
host = os.environ.get('HYDRA_BSTRAP_LOCALHOST').split('.')[0]
|
||||
user_report_data = args.user_report_data
|
||||
try:
|
||||
quote_b = quote_generator.generate_tdx_quote(user_report_data)
|
||||
quote = base64.b64encode(quote_b).decode('utf-8')
|
||||
except Exception as e:
|
||||
quote = "quote generation failed: %s" % (e)
|
||||
|
||||
print("%s: %s"%(host, quote))
|
||||
|
|
@ -1,162 +0,0 @@
|
|||
{{- if eq .Values.TEEMode "tdx" }}
|
||||
apiVersion: kubeflow.org/v2beta1
|
||||
kind: MPIJob
|
||||
metadata:
|
||||
name: bigdl-lora-finetuning-job
|
||||
namespace: bigdl-lora-finetuning
|
||||
spec:
|
||||
slotsPerWorker: 1
|
||||
runPolicy:
|
||||
cleanPodPolicy: Running
|
||||
sshAuthMountPath: /home/mpiuser/.ssh
|
||||
mpiImplementation: Intel
|
||||
mpiReplicaSpecs:
|
||||
Launcher:
|
||||
replicas: 1
|
||||
template:
|
||||
spec:
|
||||
volumes:
|
||||
- name: nfs-storage
|
||||
persistentVolumeClaim:
|
||||
claimName: nfs-pvc
|
||||
- name: dev
|
||||
hostPath:
|
||||
path: /dev
|
||||
{{- if eq .Values.enableTLS true }}
|
||||
- name: ssl-keys
|
||||
secret:
|
||||
secretName: ssl-keys
|
||||
{{- end }}
|
||||
runtimeClassName: kata-qemu-tdx
|
||||
containers:
|
||||
- image: {{ .Values.imageName }}
|
||||
name: bigdl-ppml-finetuning-launcher
|
||||
securityContext:
|
||||
runAsUser: 0
|
||||
privileged: true
|
||||
command: ["/bin/sh", "-c"]
|
||||
args:
|
||||
- |
|
||||
nohup python /ppml/bigdl_aa.py > /ppml/bigdl_aa.log 2>&1 &
|
||||
sudo -E -u mpiuser bash /ppml/bigdl-lora-finetuing-entrypoint.sh
|
||||
env:
|
||||
- name: WORKER_ROLE
|
||||
value: "launcher"
|
||||
- name: WORLD_SIZE
|
||||
value: "{{ .Values.trainerNum }}"
|
||||
- name: MICRO_BATCH_SIZE
|
||||
value: "{{ .Values.microBatchSize }}"
|
||||
- name: MASTER_PORT
|
||||
value: "42679"
|
||||
- name: MASTER_ADDR
|
||||
value: "bigdl-lora-finetuning-job-worker-0.bigdl-lora-finetuning-job-worker"
|
||||
- name: DATA_SUB_PATH
|
||||
value: "{{ .Values.dataSubPath }}"
|
||||
- name: OMP_NUM_THREADS
|
||||
value: "{{ .Values.ompNumThreads }}"
|
||||
- name: LOCAL_POD_NAME
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.name
|
||||
- name: HF_DATASETS_CACHE
|
||||
value: "/ppml/output/cache"
|
||||
- name: ATTESTATION_API_SERVICE_PORT
|
||||
value: "{{ .Values.attestionApiServicePort }}"
|
||||
- name: ENABLE_TLS
|
||||
value: "{{ .Values.enableTLS }}"
|
||||
volumeMounts:
|
||||
- name: nfs-storage
|
||||
subPath: {{ .Values.modelSubPath }}
|
||||
mountPath: /ppml/model
|
||||
- name: nfs-storage
|
||||
subPath: {{ .Values.dataSubPath }}
|
||||
mountPath: "/ppml/data/{{ .Values.dataSubPath }}"
|
||||
- name: dev
|
||||
mountPath: /dev
|
||||
{{- if eq .Values.enableTLS true }}
|
||||
- name: ssl-keys
|
||||
mountPath: /ppml/keys
|
||||
{{- end }}
|
||||
Worker:
|
||||
replicas: {{ .Values.trainerNum }}
|
||||
template:
|
||||
spec:
|
||||
runtimeClassName: kata-qemu-tdx
|
||||
containers:
|
||||
- image: {{ .Values.imageName }}
|
||||
name: bigdl-ppml-finetuning-worker
|
||||
securityContext:
|
||||
runAsUser: 0
|
||||
privileged: true
|
||||
command: ["/bin/sh", "-c"]
|
||||
args:
|
||||
- |
|
||||
chown nobody /home/mpiuser/.ssh/id_rsa &
|
||||
sudo -E -u mpiuser bash /ppml/bigdl-lora-finetuing-entrypoint.sh
|
||||
env:
|
||||
- name: WORKER_ROLE
|
||||
value: "trainer"
|
||||
- name: WORLD_SIZE
|
||||
value: "{{ .Values.trainerNum }}"
|
||||
- name: MICRO_BATCH_SIZE
|
||||
value: "{{ .Values.microBatchSize }}"
|
||||
- name: MASTER_PORT
|
||||
value: "42679"
|
||||
- name: MASTER_ADDR
|
||||
value: "bigdl-lora-finetuning-job-worker-0.bigdl-lora-finetuning-job-worker"
|
||||
- name: LOCAL_POD_NAME
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.name
|
||||
volumeMounts:
|
||||
- name: nfs-storage
|
||||
subPath: {{ .Values.modelSubPath }}
|
||||
mountPath: /ppml/model
|
||||
- name: nfs-storage
|
||||
subPath: {{ .Values.dataSubPath }}
|
||||
mountPath: "/ppml/data/{{ .Values.dataSubPath }}"
|
||||
- name: dev
|
||||
mountPath: /dev
|
||||
resources:
|
||||
requests:
|
||||
cpu: {{ .Values.cpuPerPod }}
|
||||
limits:
|
||||
cpu: {{ .Values.cpuPerPod }}
|
||||
volumes:
|
||||
- name: nfs-storage
|
||||
persistentVolumeClaim:
|
||||
claimName: nfs-pvc
|
||||
- name: dev
|
||||
hostPath:
|
||||
path: /dev
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: bigdl-lora-finetuning-launcher-attestation-api-service
|
||||
namespace: bigdl-lora-finetuning
|
||||
spec:
|
||||
selector:
|
||||
job-name: bigdl-lora-finetuning-job-launcher
|
||||
training.kubeflow.org/job-name: bigdl-lora-finetuning-job
|
||||
training.kubeflow.org/job-role: launcher
|
||||
ports:
|
||||
- name: launcher-attestation-api-service-port
|
||||
protocol: TCP
|
||||
port: {{ .Values.attestionApiServicePort }}
|
||||
targetPort: {{ .Values.attestionApiServicePort }}
|
||||
type: ClusterIP
|
||||
---
|
||||
{{- if eq .Values.enableTLS true }}
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: ssl-keys
|
||||
namespace: bigdl-lora-finetuning
|
||||
type: Opaque
|
||||
data:
|
||||
server.crt: {{ .Values.base64ServerCrt }}
|
||||
server.key: {{ .Values.base64ServerKey }}
|
||||
{{- end }}
|
||||
|
||||
{{- end }}
|
||||
Loading…
Reference in a new issue