[PPML] Add attestation for LLM Finetuning (#8908)

Add TDX attestation for LLM Finetuning in TDX CoCo

---------

Co-authored-by: Heyang Sun <60865256+Uxito-Ada@users.noreply.github.com>
This commit is contained in:
Xiangyu Tian 2023-09-08 10:24:04 +08:00 committed by GitHub
parent ee98cdd85c
commit ea6d4148e9
9 changed files with 387 additions and 2 deletions

View file

@ -28,7 +28,7 @@ As finetuning is from a base model, first download [Llama 7b hf model from the p
You are allowed to edit and experiment with different parameters in `./kubernetes/values.yaml` to improve finetuning performance and accuracy. For example, you can adjust `trainerNum` and `cpuPerPod` according to node and CPU core numbers in your cluster to make full use of these resources, and different `microBatchSize` result in different training speed and loss (here note that `microBatchSize`×`trainerNum` should not more than 128, as it is the batch size).
** Note: `dataSubPath`, `modelSubPath` and `outputPath` need to have the same names as files under the NFS directory in step 2. **
**Note: `dataSubPath`, `modelSubPath` and `outputPath` need to have the same names as files under the NFS directory in step 2.**
After preparing parameters in `./kubernetes/values.yaml`, submit the job as beflow:
@ -53,3 +53,40 @@ cat launcher.log # display logs collected from other workers
```
From the log, you can see whether finetuning process has been invoked successfully in all MPI worker pods, and a progress bar with finetuning speed and estimated time will be showed after some data preprocessing steps (this may take quiet a while).
## To run in TDX-CoCo and enable Remote Attestation API
You can deploy this workload in TDX CoCo and enable Remote Attestation API Serving with setting `TEEMode` in `./kubernetes/values.yaml` to `TDX`. The main diffences are it's need to execute the pods as root and mount TDX device, and a flask service is responsible for generating launcher's quote and collecting workers' quotes.
To use RA Rest API, you need to get the IP of job-launcher:
``` bash
kubectl get all -n bigdl-lora-finetuning
```
You will find a line like:
```bash
service/bigdl-lora-finetuning-launcher-attestation-api-service ClusterIP 10.109.87.248 <none> 9870/TCP 17m
```
Here are IP and port of the Remote Attestation API service.
The RA Rest API are listed below:
### 1. Generate launcher's quote
```bash
curl -X POST -H "Content-Type: application/json" -d '{"user_report_data": "<your_user_report_data>"}' http://<your_ra_api_service_ip>:<your_ra_api_service_port>/gen_quote
```
Example responce:
```json
{"quote":"BAACAIEAAAAAAAA..."}
```
### 2. Collect all cluster components' quotes (launcher and workers)
```bash
curl -X POST -H "Content-Type: application/json" -d '{"user_report_data": "<your_user_report_data>"}' http://<your_ra_api_service_ip>:<your_ra_api_service_port>/attest
```
Example responce:
```json
{"quote_list":{"bigdl-lora-finetuning-job-worker-0":"BAACAIEAAAAAAA...","bigdl-lora-finetuning-job-worker-1":"BAACAIEAAAAAAA...","launcher":"BAACAIEAAAAAA..."}}
```

View file

@ -12,6 +12,21 @@ RUN mkdir /ppml/data && mkdir /ppml/model && mkdir /ppml/output && \
export http_proxy=$HTTP_PROXY && \
export https_proxy=$HTTPS_PROXY && \
apt-get update && \
# Basic dependencies and DCAP
apt-get update && \
apt install -y build-essential apt-utils wget git sudo vim && \
mkdir -p /opt/intel/ && \
cd /opt/intel && \
wget https://download.01.org/intel-sgx/sgx-dcap/1.16/linux/distro/ubuntu20.04-server/sgx_linux_x64_sdk_2.19.100.3.bin && \
chmod a+x ./sgx_linux_x64_sdk_2.19.100.3.bin && \
printf "no\n/opt/intel\n"|./sgx_linux_x64_sdk_2.19.100.3.bin && \
. /opt/intel/sgxsdk/environment && \
cd /opt/intel && \
wget https://download.01.org/intel-sgx/sgx-dcap/1.16/linux/distro/ubuntu20.04-server/sgx_debian_local_repo.tgz && \
tar xzf sgx_debian_local_repo.tgz && \
echo 'deb [trusted=yes arch=amd64] file:///opt/intel/sgx_debian_local_repo focal main' | tee /etc/apt/sources.list.d/intel-sgx.list && \
wget -qO - https://download.01.org/intel-sgx/sgx_repo/ubuntu/intel-sgx-deb.key | apt-key add - && \
env DEBIAN_FRONTEND=noninteractive apt-get update && apt install -y libsgx-enclave-common-dev libsgx-qe3-logic libsgx-pce-logic libsgx-ae-qe3 libsgx-ae-qve libsgx-urts libsgx-dcap-ql libsgx-dcap-default-qpl libsgx-dcap-quote-verify-dev libsgx-dcap-ql-dev libsgx-dcap-default-qpl-dev libsgx-ra-network libsgx-ra-uefi libtdx-attest libtdx-attest-dev && \
apt-get install -y python3-pip python3.9-dev python3-wheel && \
pip3 install --upgrade pip && \
pip install torch==2.0.1 && \
@ -50,9 +65,19 @@ RUN mkdir /ppml/data && mkdir /ppml/model && mkdir /ppml/output && \
# Disabling StrictModes avoids directory and files read permission checks.
sed -i 's/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g' /etc/ssh/ssh_config && \
echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config
sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config && \
echo 'port=4050' | tee /etc/tdx-attest.conf && \
pip install flask && \
echo "mpiuser ALL = NOPASSWD:SETENV: /opt/intel/oneapi/mpi/2021.9.0/bin/mpirun\nmpiuser ALL = NOPASSWD:SETENV: /usr/bin/python" > /etc/sudoers.d/mpivisudo && \
chmod 440 /etc/sudoers.d/mpivisudo
ADD ./bigdl_aa.py /ppml/bigdl_aa.py
ADD ./quote_generator.py /ppml/quote_generator.py
ADD ./worker_quote_generate.py /ppml/worker_quote_generate.py
ADD ./get_worker_quote.sh /ppml/get_worker_quote.sh
ADD ./bigdl-lora-finetuing-entrypoint.sh /ppml/bigdl-lora-finetuing-entrypoint.sh
ADD ./lora_finetune.py /ppml/lora_finetune.py
RUN chown -R mpiuser /ppml
USER mpiuser

View file

@ -0,0 +1,51 @@
import quote_generator
from flask import Flask, request
from configparser import ConfigParser
import ssl, os
import base64
import requests
import subprocess
app = Flask(__name__)
use_secure_cert = False
@app.route('/gen_quote', methods=['POST'])
def gen_quote():
data = request.get_json()
user_report_data = data.get('user_report_data')
try:
quote_b = quote_generator.generate_tdx_quote(user_report_data)
quote = base64.b64encode(quote_b).decode('utf-8')
return {'quote': quote}
except Exception as e:
return {'quote': "quote generation failed: %s" % (e)}
@app.route('/attest', methods=['POST'])
def get_cluster_quote_list():
data = request.get_json()
user_report_data = data.get('user_report_data')
quote_list = []
try:
quote_b = quote_generator.generate_tdx_quote(user_report_data)
quote = base64.b64encode(quote_b).decode("utf-8")
quote_list.append(("launcher", quote))
except Exception as e:
quote_list.append("launcher", "quote generation failed: %s" % (e))
command = "sudo -u mpiuser -E bash /ppml/get_worker_quote.sh %s" % (user_report_data)
output = subprocess.check_output(command, shell=True)
with open("/ppml/output/quote.log", "r") as quote_file:
for line in quote_file:
line = line.strip()
if line:
parts = line.split(":")
if len(parts) == 2:
quote_list.append((parts[0].strip(), parts[1].strip()))
return {"quote_list": dict(quote_list)}
if __name__ == '__main__':
print("BigDL-AA: Agent Started.")
port = int(os.environ.get('ATTESTATION_API_SERVICE_PORT'))
app.run(host='0.0.0.0', port=port)

View file

@ -0,0 +1,17 @@
#!/bin/bash
set -x
source /opt/intel/oneapi/setvars.sh
export CCL_WORKER_COUNT=$WORLD_SIZE
export CCL_WORKER_AFFINITY=auto
export SAVE_PATH="/ppml/output"
mpirun \
-n $WORLD_SIZE \
-ppn 1 \
-f /home/mpiuser/hostfile \
-iface eth0 \
-genv OMP_NUM_THREADS=$OMP_NUM_THREADS \
-genv KMP_AFFINITY="granularity=fine,none" \
-genv KMP_BLOCKTIME=1 \
-genv TF_ENABLE_ONEDNN_OPTS=1 \
sudo -E python /ppml/worker_quote_generate.py --user_report_data $1 > $SAVE_PATH/quote.log 2>&1

View file

@ -0,0 +1,88 @@
#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import ctypes
import base64
import os
def generate_tdx_quote(user_report_data):
# Define the uuid data structure
TDX_UUID_SIZE = 16
class TdxUuid(ctypes.Structure):
_fields_ = [("d", ctypes.c_uint8 * TDX_UUID_SIZE)]
# Define the report data structure
TDX_REPORT_DATA_SIZE = 64
class TdxReportData(ctypes.Structure):
_fields_ = [("d", ctypes.c_uint8 * TDX_REPORT_DATA_SIZE)]
# Define the report structure
TDX_REPORT_SIZE = 1024
class TdxReport(ctypes.Structure):
_fields_ = [("d", ctypes.c_uint8 * TDX_REPORT_SIZE)]
# Load the library
tdx_attest = ctypes.cdll.LoadLibrary("/usr/lib/x86_64-linux-gnu/libtdx_attest.so.1")
# Set the argument and return types for the function
tdx_attest.tdx_att_get_report.argtypes = [ctypes.POINTER(TdxReportData), ctypes.POINTER(TdxReport)]
tdx_attest.tdx_att_get_report.restype = ctypes.c_uint16
tdx_attest.tdx_att_get_quote.argtypes = [ctypes.POINTER(TdxReportData), ctypes.POINTER(TdxUuid), ctypes.c_uint32, ctypes.POINTER(TdxUuid), ctypes.POINTER(ctypes.POINTER(ctypes.c_uint8)), ctypes.POINTER(ctypes.c_uint32), ctypes.c_uint32]
tdx_attest.tdx_att_get_quote.restype = ctypes.c_uint16
# Call the function and check the return code
byte_array_data = bytearray(user_report_data.ljust(64)[:64], "utf-8").replace(b' ', b'\x00')
report_data = TdxReportData()
report_data.d = (ctypes.c_uint8 * 64).from_buffer(byte_array_data)
report = TdxReport()
result = tdx_attest.tdx_att_get_report(ctypes.byref(report_data), ctypes.byref(report))
if result != 0:
print("Error: " + hex(result))
att_key_id_list = None
list_size = 0
att_key_id = TdxUuid()
p_quote = ctypes.POINTER(ctypes.c_uint8)()
quote_size = ctypes.c_uint32()
flags = 0
result = tdx_attest.tdx_att_get_quote(ctypes.byref(report_data), att_key_id_list, list_size, ctypes.byref(att_key_id), ctypes.byref(p_quote), ctypes.byref(quote_size), flags)
if result != 0:
print("Error: " + hex(result))
else:
quote = ctypes.string_at(p_quote, quote_size.value)
return quote
def generate_gramine_quote(user_report_data):
USER_REPORT_PATH = "/dev/attestation/user_report_data"
QUOTE_PATH = "/dev/attestation/quote"
if not os.path.isfile(USER_REPORT_PATH):
print(f"File {USER_REPORT_PATH} not found.")
return ""
if not os.path.isfile(QUOTE_PATH):
print(f"File {QUOTE_PATH} not found.")
return ""
with open(USER_REPORT_PATH, 'w') as out:
out.write(user_report_data)
with open(QUOTE_PATH, "rb") as f:
quote = f.read()
return quote
if __name__ == "__main__":
print(generate_tdx_quote("ppml"))

View file

@ -0,0 +1,20 @@
import quote_generator
import argparse
import ssl, os
import base64
import requests
parser = argparse.ArgumentParser()
parser.add_argument("--user_report_data", type=str, default="ppml")
args = parser.parse_args()
host = os.environ.get('HYDRA_BSTRAP_LOCALHOST').split('.')[0]
user_report_data = args.user_report_data
try:
quote_b = quote_generator.generate_tdx_quote(user_report_data)
quote = base64.b64encode(quote_b).decode('utf-8')
except Exception as e:
quote = "quote generation failed: %s" % (e)
print("%s: %s"%(host, quote))

View file

@ -1,3 +1,4 @@
{{- if eq .Values.TEEMode "native" }}
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
@ -95,3 +96,4 @@ spec:
- name: nfs-storage
persistentVolumeClaim:
claimName: nfs-pvc
{{- end }}

View file

@ -0,0 +1,144 @@
{{- if eq .Values.TEEMode "tdx" }}
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: bigdl-lora-finetuning-job
namespace: bigdl-lora-finetuning
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
sshAuthMountPath: /home/mpiuser/.ssh
mpiImplementation: Intel
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
volumes:
- name: nfs-storage
persistentVolumeClaim:
claimName: nfs-pvc
- name: dev
hostPath:
path: /dev
runtimeClassName: kata-qemu-tdx
containers:
- image: {{ .Values.imageName }}
name: bigdl-ppml-finetuning-launcher
securityContext:
runAsUser: 0
privileged: true
command: ["/bin/sh", "-c"]
args:
- |
nohup python /ppml/bigdl_aa.py > /ppml/bigdl_aa.log 2>&1 &
sudo -E -u mpiuser bash /ppml/bigdl-lora-finetuing-entrypoint.sh
env:
- name: WORKER_ROLE
value: "launcher"
- name: WORLD_SIZE
value: "{{ .Values.trainerNum }}"
- name: MICRO_BATCH_SIZE
value: "{{ .Values.microBatchSize }}"
- name: MASTER_PORT
value: "42679"
- name: MASTER_ADDR
value: "bigdl-lora-finetuning-job-worker-0.bigdl-lora-finetuning-job-worker"
- name: DATA_SUB_PATH
value: "{{ .Values.dataSubPath }}"
- name: OMP_NUM_THREADS
value: "{{ .Values.ompNumThreads }}"
- name: LOCAL_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: HF_DATASETS_CACHE
value: "/ppml/output/cache"
- name: ATTESTATION_API_SERVICE_PORT
value: "{{ .Values.attestionApiServicePort }}"
volumeMounts:
- name: nfs-storage
subPath: {{ .Values.modelSubPath }}
mountPath: /ppml/model
- name: nfs-storage
subPath: {{ .Values.dataSubPath }}
mountPath: "/ppml/data/{{ .Values.dataSubPath }}"
- name: nfs-storage
subPath: {{ .Values.outputSubPath }}
mountPath: "/ppml/output"
- name: dev
mountPath: /dev
Worker:
replicas: {{ .Values.trainerNum }}
template:
spec:
runtimeClassName: kata-qemu-tdx
containers:
- image: {{ .Values.imageName }}
name: bigdl-ppml-finetuning-worker
securityContext:
runAsUser: 0
privileged: true
command: ["/bin/sh", "-c"]
args:
- |
chown nobody /home/mpiuser/.ssh/id_rsa &
sudo -E -u mpiuser bash /ppml/bigdl-lora-finetuing-entrypoint.sh
env:
- name: WORKER_ROLE
value: "trainer"
- name: WORLD_SIZE
value: "{{ .Values.trainerNum }}"
- name: MICRO_BATCH_SIZE
value: "{{ .Values.microBatchSize }}"
- name: MASTER_PORT
value: "42679"
- name: MASTER_ADDR
value: "bigdl-lora-finetuning-job-worker-0.bigdl-lora-finetuning-job-worker"
- name: LOCAL_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumeMounts:
- name: nfs-storage
subPath: {{ .Values.modelSubPath }}
mountPath: /ppml/model
- name: nfs-storage
subPath: {{ .Values.dataSubPath }}
mountPath: "/ppml/data/{{ .Values.dataSubPath }}"
- name: nfs-storage
subPath: {{ .Values.outputSubPath }}
mountPath: "/ppml/output"
- name: dev
mountPath: /dev
resources:
requests:
cpu: {{ .Values.cpuPerPod }}
limits:
cpu: {{ .Values.cpuPerPod }}
volumes:
- name: nfs-storage
persistentVolumeClaim:
claimName: nfs-pvc
- name: dev
hostPath:
path: /dev
---
apiVersion: v1
kind: Service
metadata:
name: bigdl-lora-finetuning-launcher-attestation-api-service
namespace: bigdl-lora-finetuning
spec:
selector:
job-name: bigdl-lora-finetuning-job-launcher
training.kubeflow.org/job-name: bigdl-lora-finetuning-job
training.kubeflow.org/job-role: launcher
ports:
- name: launcher-attestation-api-service-port
protocol: TCP
port: {{ .Values.attestionApiServicePort }}
targetPort: {{ .Values.attestionApiServicePort }}
type: ClusterIP
{{- end }}

View file

@ -9,3 +9,4 @@ modelSubPath: llama-7b-hf # a subpath of the model file (dir) under nfs director
outputSubPath: output # a subpath of the empty directory under the nfs directory to save finetuned model, for example, if you make an empty dir named 'output' at the nfsPath, the value should be 'output'
ompNumThreads: 14
cpuPerPod: 42
attestionApiServicePort: 9870