[PPML] Add attestation for LLM Finetuning (#8908)

Add TDX attestation for LLM Finetuning in TDX CoCo --------- Co-authored-by: Heyang Sun <60865256+Uxito-Ada@users.noreply.github.com>
2023-09-08 10:24:04 +08:00 · 2023-09-08 10:24:04 +08:00 · ea6d4148e9
commit ea6d4148e9
parent ee98cdd85c
9 changed files with 387 additions and 2 deletions
--- a/docker/llm/finetune/lora/README.md
+++ b/docker/llm/finetune/lora/README.md
@ -28,7 +28,7 @@ As finetuning is from a base model, first download [Llama 7b hf model from the p

 You are allowed to edit and experiment with different parameters in `./kubernetes/values.yaml` to improve finetuning performance and accuracy. For example, you can adjust `trainerNum` and `cpuPerPod` according to node and CPU core numbers in your cluster to make full use of these resources, and different `microBatchSize` result in different training speed and loss (here note that `microBatchSize`×`trainerNum` should not more than 128, as it is the batch size).

-** Note: `dataSubPath`, `modelSubPath` and `outputPath` need to have the same names as files under the NFS directory in step 2. **
+**Note: `dataSubPath`, `modelSubPath` and `outputPath` need to have the same names as files under the NFS directory in step 2.**

 After preparing parameters in `./kubernetes/values.yaml`, submit the job as beflow:

@ -53,3 +53,40 @@ cat launcher.log # display logs collected from other workers
 ```

 From the log, you can see whether finetuning process has been invoked successfully in all MPI worker pods, and a progress bar with finetuning speed and estimated time will be showed after some data preprocessing steps (this may take quiet a while).
+
+
+## To run in TDX-CoCo and enable Remote Attestation API
+
+You can deploy this workload in TDX CoCo and enable Remote Attestation API Serving with setting `TEEMode` in `./kubernetes/values.yaml` to `TDX`. The main diffences are it's need to execute the pods as root and mount TDX device, and a flask service is responsible for generating launcher's quote and collecting workers' quotes. 
+
+To use RA Rest API, you need to get the IP of job-launcher:
+``` bash
+kubectl get all -n bigdl-lora-finetuning 
+```
+You will find a line like:
+```bash
+service/bigdl-lora-finetuning-launcher-attestation-api-service   ClusterIP   10.109.87.248   <none>        9870/TCP   17m
+```
+Here are IP and port of the Remote Attestation API service.
+
+The RA Rest API are listed below:
+### 1. Generate launcher's quote
+```bash
+curl -X POST -H "Content-Type: application/json" -d '{"user_report_data": "<your_user_report_data>"}' http://<your_ra_api_service_ip>:<your_ra_api_service_port>/gen_quote
+```
+
+Example responce:
+
+```json
+{"quote":"BAACAIEAAAAAAAA..."}
+```
+### 2. Collect all cluster components' quotes (launcher and workers)
+```bash
+curl -X POST -H "Content-Type: application/json" -d '{"user_report_data": "<your_user_report_data>"}' http://<your_ra_api_service_ip>:<your_ra_api_service_port>/attest
+```
+
+Example responce:
+
+```json
+{"quote_list":{"bigdl-lora-finetuning-job-worker-0":"BAACAIEAAAAAAA...","bigdl-lora-finetuning-job-worker-1":"BAACAIEAAAAAAA...","launcher":"BAACAIEAAAAAA..."}}
+```
--- a/docker/llm/finetune/lora/docker/Dockerfile
+++ b/docker/llm/finetune/lora/docker/Dockerfile
@ -12,6 +12,21 @@ RUN mkdir /ppml/data && mkdir /ppml/model && mkdir /ppml/output && \
    export http_proxy=$HTTP_PROXY && \
    export https_proxy=$HTTPS_PROXY && \
    apt-get update && \
+# Basic dependencies and DCAP
+    apt-get update && \
+    apt install -y build-essential apt-utils wget git sudo vim && \
+    mkdir -p /opt/intel/ && \
+    cd /opt/intel && \
+    wget https://download.01.org/intel-sgx/sgx-dcap/1.16/linux/distro/ubuntu20.04-server/sgx_linux_x64_sdk_2.19.100.3.bin && \
+    chmod a+x ./sgx_linux_x64_sdk_2.19.100.3.bin && \
+    printf "no\n/opt/intel\n"|./sgx_linux_x64_sdk_2.19.100.3.bin && \
+    . /opt/intel/sgxsdk/environment && \
+    cd /opt/intel && \
+    wget https://download.01.org/intel-sgx/sgx-dcap/1.16/linux/distro/ubuntu20.04-server/sgx_debian_local_repo.tgz && \
+    tar xzf sgx_debian_local_repo.tgz && \
+    echo 'deb [trusted=yes arch=amd64] file:///opt/intel/sgx_debian_local_repo focal main' | tee /etc/apt/sources.list.d/intel-sgx.list && \
+    wget -qO - https://download.01.org/intel-sgx/sgx_repo/ubuntu/intel-sgx-deb.key | apt-key add - && \
+    env DEBIAN_FRONTEND=noninteractive apt-get update && apt install -y libsgx-enclave-common-dev libsgx-qe3-logic libsgx-pce-logic libsgx-ae-qe3 libsgx-ae-qve libsgx-urts libsgx-dcap-ql libsgx-dcap-default-qpl libsgx-dcap-quote-verify-dev libsgx-dcap-ql-dev libsgx-dcap-default-qpl-dev libsgx-ra-network libsgx-ra-uefi libtdx-attest libtdx-attest-dev && \
    apt-get install -y python3-pip python3.9-dev python3-wheel && \
    pip3 install --upgrade pip && \
    pip install torch==2.0.1 && \
@ -50,9 +65,19 @@ RUN mkdir /ppml/data && mkdir /ppml/model && mkdir /ppml/output && \
 # Disabling StrictModes avoids directory and files read permission checks.
    sed -i 's/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g' /etc/ssh/ssh_config && \
    echo "    UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
-    sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config
+    sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config && \
+    echo 'port=4050' | tee /etc/tdx-attest.conf && \
+    pip install flask && \
+    echo "mpiuser ALL = NOPASSWD:SETENV: /opt/intel/oneapi/mpi/2021.9.0/bin/mpirun\nmpiuser ALL = NOPASSWD:SETENV: /usr/bin/python" > /etc/sudoers.d/mpivisudo && \
+    chmod 440 /etc/sudoers.d/mpivisudo
+
+ADD ./bigdl_aa.py /ppml/bigdl_aa.py
+ADD ./quote_generator.py /ppml/quote_generator.py
+ADD ./worker_quote_generate.py /ppml/worker_quote_generate.py
+ADD ./get_worker_quote.sh /ppml/get_worker_quote.sh

 ADD ./bigdl-lora-finetuing-entrypoint.sh /ppml/bigdl-lora-finetuing-entrypoint.sh
 ADD ./lora_finetune.py /ppml/lora_finetune.py
+
 RUN chown -R mpiuser /ppml
 USER mpiuser
--- a/docker/llm/finetune/lora/docker/bigdl_aa.py
+++ b/docker/llm/finetune/lora/docker/bigdl_aa.py
@ -0,0 +1,51 @@
+import quote_generator
+from flask import Flask, request
+from configparser import ConfigParser
+import ssl, os
+import base64
+import requests
+import subprocess
+
+app = Flask(__name__)
+use_secure_cert = False
+
+@app.route('/gen_quote', methods=['POST'])
+def gen_quote():
+    data = request.get_json()
+    user_report_data = data.get('user_report_data')
+    try:
+        quote_b = quote_generator.generate_tdx_quote(user_report_data)
+        quote = base64.b64encode(quote_b).decode('utf-8')
+        return {'quote': quote}
+    except Exception as e:
+        return {'quote': "quote generation failed: %s" % (e)}
+
+@app.route('/attest', methods=['POST'])
+def get_cluster_quote_list():
+    data = request.get_json()
+    user_report_data = data.get('user_report_data')
+    quote_list = []
+
+    try:
+        quote_b = quote_generator.generate_tdx_quote(user_report_data)
+        quote = base64.b64encode(quote_b).decode("utf-8")
+        quote_list.append(("launcher", quote))
+    except Exception as e:
+        quote_list.append("launcher", "quote generation failed: %s" % (e))
+
+    command = "sudo -u mpiuser -E bash /ppml/get_worker_quote.sh %s" % (user_report_data)
+    output = subprocess.check_output(command, shell=True)
+
+    with open("/ppml/output/quote.log", "r") as quote_file:
+        for line in quote_file:
+            line = line.strip()
+            if line:
+                parts = line.split(":") 
+                if len(parts) == 2:
+                    quote_list.append((parts[0].strip(), parts[1].strip())) 
+    return {"quote_list": dict(quote_list)}
+
+if __name__ == '__main__':
+    print("BigDL-AA: Agent Started.")
+    port = int(os.environ.get('ATTESTATION_API_SERVICE_PORT'))
+    app.run(host='0.0.0.0', port=port)
--- a/docker/llm/finetune/lora/docker/get_worker_quote.sh
+++ b/docker/llm/finetune/lora/docker/get_worker_quote.sh
@ -0,0 +1,17 @@
+#!/bin/bash
+set -x
+source /opt/intel/oneapi/setvars.sh
+export CCL_WORKER_COUNT=$WORLD_SIZE
+export CCL_WORKER_AFFINITY=auto
+export SAVE_PATH="/ppml/output"
+
+mpirun \
+    -n $WORLD_SIZE \
+    -ppn 1 \
+    -f /home/mpiuser/hostfile \
+    -iface eth0 \
+    -genv OMP_NUM_THREADS=$OMP_NUM_THREADS \
+    -genv KMP_AFFINITY="granularity=fine,none" \
+    -genv KMP_BLOCKTIME=1 \
+    -genv TF_ENABLE_ONEDNN_OPTS=1 \
+    sudo -E python /ppml/worker_quote_generate.py --user_report_data $1 > $SAVE_PATH/quote.log 2>&1
--- a/docker/llm/finetune/lora/docker/quote_generator.py
+++ b/docker/llm/finetune/lora/docker/quote_generator.py
@ -0,0 +1,88 @@
+#
+# Copyright 2016 The BigDL Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import ctypes
+import base64
+import os
+
+def generate_tdx_quote(user_report_data):
+    # Define the uuid data structure
+    TDX_UUID_SIZE = 16
+    class TdxUuid(ctypes.Structure):
+        _fields_ = [("d", ctypes.c_uint8 * TDX_UUID_SIZE)]
+
+    # Define the report data structure
+    TDX_REPORT_DATA_SIZE = 64
+    class TdxReportData(ctypes.Structure):
+        _fields_ = [("d", ctypes.c_uint8 * TDX_REPORT_DATA_SIZE)]
+
+    # Define the report structure
+    TDX_REPORT_SIZE = 1024
+    class TdxReport(ctypes.Structure):
+        _fields_ = [("d", ctypes.c_uint8 * TDX_REPORT_SIZE)]
+
+    # Load the library
+    tdx_attest = ctypes.cdll.LoadLibrary("/usr/lib/x86_64-linux-gnu/libtdx_attest.so.1")
+
+    # Set the argument and return types for the function
+    tdx_attest.tdx_att_get_report.argtypes = [ctypes.POINTER(TdxReportData), ctypes.POINTER(TdxReport)]
+    tdx_attest.tdx_att_get_report.restype = ctypes.c_uint16
+
+    tdx_attest.tdx_att_get_quote.argtypes = [ctypes.POINTER(TdxReportData), ctypes.POINTER(TdxUuid), ctypes.c_uint32, ctypes.POINTER(TdxUuid), ctypes.POINTER(ctypes.POINTER(ctypes.c_uint8)), ctypes.POINTER(ctypes.c_uint32), ctypes.c_uint32]
+    tdx_attest.tdx_att_get_quote.restype = ctypes.c_uint16
+
+
+    # Call the function and check the return code
+    byte_array_data = bytearray(user_report_data.ljust(64)[:64], "utf-8").replace(b' ', b'\x00')
+    report_data = TdxReportData()
+    report_data.d = (ctypes.c_uint8 * 64).from_buffer(byte_array_data)
+    report = TdxReport()
+    result = tdx_attest.tdx_att_get_report(ctypes.byref(report_data), ctypes.byref(report))
+    if result != 0:
+        print("Error: " + hex(result))
+
+    att_key_id_list = None
+    list_size = 0
+    att_key_id = TdxUuid()
+    p_quote = ctypes.POINTER(ctypes.c_uint8)()
+    quote_size = ctypes.c_uint32()
+    flags = 0
+
+    result = tdx_attest.tdx_att_get_quote(ctypes.byref(report_data), att_key_id_list, list_size, ctypes.byref(att_key_id), ctypes.byref(p_quote), ctypes.byref(quote_size), flags)
+
+    if result != 0:
+        print("Error: " + hex(result))
+    else:
+        quote = ctypes.string_at(p_quote, quote_size.value)
+        return quote
+
+def generate_gramine_quote(user_report_data):
+    USER_REPORT_PATH = "/dev/attestation/user_report_data"
+    QUOTE_PATH = "/dev/attestation/quote"
+    if not os.path.isfile(USER_REPORT_PATH):
+        print(f"File {USER_REPORT_PATH} not found.")
+        return ""
+    if not os.path.isfile(QUOTE_PATH):
+        print(f"File {QUOTE_PATH} not found.")
+        return ""
+    with open(USER_REPORT_PATH, 'w') as out:
+        out.write(user_report_data)
+    with open(QUOTE_PATH, "rb") as f:
+        quote = f.read()
+    return quote
+
+if __name__ == "__main__":
+    print(generate_tdx_quote("ppml"))
--- a/docker/llm/finetune/lora/docker/worker_quote_generate.py
+++ b/docker/llm/finetune/lora/docker/worker_quote_generate.py
@ -0,0 +1,20 @@
+import quote_generator
+import argparse
+import ssl, os
+import base64
+import requests
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--user_report_data", type=str, default="ppml")
+
+args = parser.parse_args()
+
+host = os.environ.get('HYDRA_BSTRAP_LOCALHOST').split('.')[0]
+user_report_data = args.user_report_data
+try:
+    quote_b = quote_generator.generate_tdx_quote(user_report_data)
+    quote = base64.b64encode(quote_b).decode('utf-8')
+except Exception as e:
+    quote = "quote generation failed: %s" % (e)
+
+print("%s: %s"%(host, quote))
--- a/docker/llm/finetune/lora/kubernetes/templates/bigdl-lora-finetuning-job.yaml
+++ b/docker/llm/finetune/lora/kubernetes/templates/bigdl-lora-finetuning-job.yaml
@ -1,3 +1,4 @@
+{{- if eq .Values.TEEMode "native" }}
 apiVersion: kubeflow.org/v2beta1
 kind: MPIJob
 metadata:
@ -95,3 +96,4 @@ spec:
          - name: nfs-storage
            persistentVolumeClaim:
              claimName: nfs-pvc
+{{- end }}
--- a/docker/llm/finetune/lora/kubernetes/templates/bigdl-lora-finetuning-tdx-job.yaml
+++ b/docker/llm/finetune/lora/kubernetes/templates/bigdl-lora-finetuning-tdx-job.yaml
@ -0,0 +1,144 @@
+{{- if eq .Values.TEEMode "tdx" }}
+apiVersion: kubeflow.org/v2beta1
+kind: MPIJob
+metadata:
+  name: bigdl-lora-finetuning-job
+  namespace: bigdl-lora-finetuning
+spec:
+  slotsPerWorker: 1
+  runPolicy:
+    cleanPodPolicy: Running
+  sshAuthMountPath: /home/mpiuser/.ssh
+  mpiImplementation: Intel
+  mpiReplicaSpecs:
+    Launcher:
+      replicas: 1
+      template:
+         spec:
+           volumes:
+           - name: nfs-storage
+             persistentVolumeClaim:
+               claimName: nfs-pvc
+           - name: dev
+             hostPath:
+               path: /dev
+           runtimeClassName: kata-qemu-tdx
+           containers:
+           - image: {{ .Values.imageName }}
+             name: bigdl-ppml-finetuning-launcher
+             securityContext:
+              runAsUser: 0
+              privileged: true
+             command: ["/bin/sh", "-c"]
+             args:
+               - |
+                  nohup python /ppml/bigdl_aa.py > /ppml/bigdl_aa.log 2>&1 &
+                  sudo -E -u mpiuser bash /ppml/bigdl-lora-finetuing-entrypoint.sh
+             env:
+             - name: WORKER_ROLE
+               value: "launcher"
+             - name: WORLD_SIZE
+               value: "{{ .Values.trainerNum }}"
+             - name: MICRO_BATCH_SIZE
+               value: "{{ .Values.microBatchSize }}"
+             - name: MASTER_PORT
+               value: "42679"
+             - name: MASTER_ADDR
+               value: "bigdl-lora-finetuning-job-worker-0.bigdl-lora-finetuning-job-worker"
+             - name: DATA_SUB_PATH
+               value: "{{ .Values.dataSubPath }}"
+             - name: OMP_NUM_THREADS
+               value: "{{ .Values.ompNumThreads }}"
+             - name: LOCAL_POD_NAME
+               valueFrom:
+                 fieldRef:
+                   fieldPath: metadata.name
+             - name: HF_DATASETS_CACHE
+               value: "/ppml/output/cache"
+             - name: ATTESTATION_API_SERVICE_PORT
+               value: "{{ .Values.attestionApiServicePort }}"
+             volumeMounts:
+             - name: nfs-storage
+               subPath: {{ .Values.modelSubPath }}
+               mountPath: /ppml/model
+             - name: nfs-storage
+               subPath: {{ .Values.dataSubPath }}
+               mountPath: "/ppml/data/{{ .Values.dataSubPath }}"
+             - name: nfs-storage
+               subPath: {{ .Values.outputSubPath }}
+               mountPath: "/ppml/output"
+             - name: dev
+               mountPath: /dev
+    Worker:
+      replicas: {{ .Values.trainerNum }}
+      template:
+        spec:
+          runtimeClassName: kata-qemu-tdx
+          containers:
+          - image: {{ .Values.imageName }}
+            name: bigdl-ppml-finetuning-worker
+            securityContext:
+              runAsUser: 0
+              privileged: true
+            command: ["/bin/sh", "-c"]
+            args:
+              - |
+                  chown nobody /home/mpiuser/.ssh/id_rsa &
+                  sudo -E -u mpiuser bash /ppml/bigdl-lora-finetuing-entrypoint.sh
+            env:
+            - name: WORKER_ROLE
+              value: "trainer"
+            - name: WORLD_SIZE
+              value: "{{ .Values.trainerNum }}"
+            - name: MICRO_BATCH_SIZE
+              value: "{{ .Values.microBatchSize }}"
+            - name: MASTER_PORT
+              value: "42679"
+            - name: MASTER_ADDR
+              value: "bigdl-lora-finetuning-job-worker-0.bigdl-lora-finetuning-job-worker"
+            - name: LOCAL_POD_NAME
+              valueFrom:
+                fieldRef:
+                  fieldPath: metadata.name
+            volumeMounts:
+            - name: nfs-storage
+              subPath: {{ .Values.modelSubPath }}
+              mountPath: /ppml/model
+            - name: nfs-storage
+              subPath: {{ .Values.dataSubPath }}
+              mountPath: "/ppml/data/{{ .Values.dataSubPath }}"
+            - name: nfs-storage
+              subPath: {{ .Values.outputSubPath }}
+              mountPath: "/ppml/output"
+            - name: dev
+              mountPath: /dev
+            resources:
+              requests:
+                cpu: {{ .Values.cpuPerPod }}
+              limits:
+                cpu: {{ .Values.cpuPerPod }}
+          volumes:
+          - name: nfs-storage
+            persistentVolumeClaim:
+              claimName: nfs-pvc
+          - name: dev
+            hostPath:
+              path: /dev
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: bigdl-lora-finetuning-launcher-attestation-api-service
+  namespace: bigdl-lora-finetuning
+spec:
+  selector:
+    job-name: bigdl-lora-finetuning-job-launcher
+    training.kubeflow.org/job-name: bigdl-lora-finetuning-job
+    training.kubeflow.org/job-role: launcher
+  ports:
+    - name: launcher-attestation-api-service-port
+      protocol: TCP
+      port: {{ .Values.attestionApiServicePort }}
+      targetPort: {{ .Values.attestionApiServicePort }}
+  type: ClusterIP
+{{- end }}
--- a/docker/llm/finetune/lora/kubernetes/values.yaml
+++ b/docker/llm/finetune/lora/kubernetes/values.yaml
@ -9,3 +9,4 @@ modelSubPath: llama-7b-hf # a subpath of the model file (dir) under nfs director
 outputSubPath: output # a subpath of the empty directory under the nfs directory to save finetuned model, for example, if you make an empty dir named 'output' at the nfsPath, the value should be 'output'
 ompNumThreads: 14
 cpuPerPod: 42
+attestionApiServicePort: 9870