diff --git a/docs/readthedocs/source/doc/PPML/QuickStart/tpc-h_with_sparksql_on_k8s.md b/docs/readthedocs/source/doc/PPML/QuickStart/tpc-h_with_sparksql_on_k8s.md index e70c35bf..141cfce6 100644 --- a/docs/readthedocs/source/doc/PPML/QuickStart/tpc-h_with_sparksql_on_k8s.md +++ b/docs/readthedocs/source/doc/PPML/QuickStart/tpc-h_with_sparksql_on_k8s.md @@ -8,198 +8,198 @@ ### Prepare TPC-H kit and data ### 1. Generate data -Go to [TPC Download](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp) site, choose `TPC-H` source code, then download the TPC-H toolkits. **Follow the download instructions carefully.** -After you download the tpc-h tools zip and uncompressed the zip file. Go to `dbgen` directory, and create `makefile` based on `makefile.suite`, and modify `makefile` according to the prompts inside, and run `make`. + Go to [TPC Download](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp) site, choose `TPC-H` source code, then download the TPC-H toolkits. **Follow the download instructions carefully.** + After you download the tpc-h tools zip and uncompressed the zip file. Go to `dbgen` directory, and create `makefile` based on `makefile.suite`, and modify `makefile` according to the prompts inside, and run `make`. -This should generate an executable called `dbgen` -``` -./dbgen -h -``` + This should generate an executable called `dbgen` + ``` + ./dbgen -h + ``` -gives you the various options for generating the tables. The simplest case is running: -``` -./dbgen -``` -which generates tables with extension `.tbl` with scale 1 (default) for a total of rougly 1GB size across all tables. For different size tables you can use the `-s` option: -``` -./dbgen -s 10 -``` -will generate roughly 10GB of input data. + gives you the various options for generating the tables. The simplest case is running: + ``` + ./dbgen + ``` + which generates tables with extension `.tbl` with scale 1 (default) for a total of rougly 1GB size across all tables. For different size tables you can use the `-s` option: + ``` + ./dbgen -s 10 + ``` + will generate roughly 10GB of input data. -You need to move all .tbl files to a new directory as raw data. + You need to move all .tbl files to a new directory as raw data. -You can then either upload your data to remote file system or read them locally. + You can then either upload your data to remote file system or read them locally. 2. Encrypt Data -Encrypt data with specified Key Management Service (`SimpleKeyManagementService`, or `EHSMKeyManagementService` , or `AzureKeyManagementService`). Details can be found here: https://github.com/intel-analytics/BigDL/tree/main/ppml/services/kms-utils/docker + Encrypt data with specified Key Management Service (`SimpleKeyManagementService`, or `EHSMKeyManagementService` , or `AzureKeyManagementService`). Details can be found here: https://github.com/intel-analytics/BigDL/tree/main/ppml/services/kms-utils/docker -The example code of encrypt data with `SimpleKeyManagementService` is like below: -``` -java -cp "$BIGDL_HOME/jars/bigdl-ppml-spark_3.1.2-2.1.0-SNAPSHOT.jar:$SPARK_HOME/conf/:$SPARK_HOME/jars/*:$BIGDL_HOME/jars/*" \ - -Xmx10g \ - com.intel.analytics.bigdl.ppml.examples.tpch.EncryptFiles \ - --inputPath xxx/dbgen-input \ - --outputPath xxx/dbgen-encrypted - --kmsType SimpleKeyManagementService - --simpleAPPID xxxxxxxxxxxx \ - --simpleAPPKEY xxxxxxxxxxxx \ - --primaryKeyPath /path/to/simple_encrypted_primary_key \ - --dataKeyPath /path/to/simple_encrypted_data_key -``` + The example code of encrypt data with `SimpleKeyManagementService` is like below: + ``` + java -cp "$BIGDL_HOME/jars/bigdl-ppml-spark_3.1.2-2.1.0-SNAPSHOT.jar:$SPARK_HOME/conf/:$SPARK_HOME/jars/*:$BIGDL_HOME/jars/*" \ + -Xmx10g \ + com.intel.analytics.bigdl.ppml.examples.tpch.EncryptFiles \ + --inputPath xxx/dbgen-input \ + --outputPath xxx/dbgen-encrypted + --kmsType SimpleKeyManagementService + --simpleAPPID xxxxxxxxxxxx \ + --simpleAPPKEY xxxxxxxxxxxx \ + --primaryKeyPath /path/to/simple_encrypted_primary_key \ + --dataKeyPath /path/to/simple_encrypted_data_key + ``` ### Deploy PPML TPC-H on Kubernetes ### 1. Pull docker image -``` -sudo docker pull intelanalytics/bigdl-ppml-trusted-big-data-ml-python-graphene:2.1.0-SNAPSHOT -``` + ``` + sudo docker pull intelanalytics/bigdl-ppml-trusted-big-data-ml-python-graphene:2.1.0-SNAPSHOT + ``` 2. Prepare SGX keys (following instructions [here](https://github.com/intel-analytics/BigDL/tree/main/ppml/trusted-big-data-ml/python/docker-graphene#11-prepare-the-keyspassworddataenclave-keypem "here")), make sure keys and tpch-spark can be accessed on each K8S node 3. Start a bigdl-ppml enabled Spark K8S client container with configured local IP, key, tpch and kuberconfig path -``` -export ENCLAVE_KEY=/path/to/enclave-key.pem -export SECURE_PASSWORD_PATH=/path/to/password -export DATA_PATH=/path/to/data -export KEYS_PATH=/path/to/keys -export KUBERCONFIG_PATH=/path/to/kuberconfig -export LOCAL_IP=$local_ip -export DOCKER_IMAGE=intelanalytics/bigdl-ppml-trusted-big-data-ml-python-graphene:2.1.0-SNAPSHOT -sudo docker run -itd \ - --privileged \ - --net=host \ - --name=spark-local-k8s-client \ - --oom-kill-disable \ - --device=/dev/sgx/enclave \ - --device=/dev/sgx/provision \ - -v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \ - -v $SECURE_PASSWORD_PATH:/ppml/trusted-big-data-ml/work/password \ - -v $ENCLAVE_KEY:/graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem \ - -v $DATA_PATH:/ppml/trusted-big-data-ml/work/data \ - -v $KEYS_PATH:/ppml/trusted-big-data-ml/work/keys \ - -v $KUBERCONFIG_PATH:/root/.kube/config \ - -e RUNTIME_SPARK_MASTER=k8s://https://$LOCAL_IP:6443 \ - -e RUNTIME_K8S_SERVICE_ACCOUNT=spark \ - -e RUNTIME_K8S_SPARK_IMAGE=$DOCKER_IMAGE \ - -e RUNTIME_DRIVER_HOST=$LOCAL_IP \ - -e RUNTIME_DRIVER_PORT=54321 \ - -e RUNTIME_EXECUTOR_INSTANCES=1 \ - -e RUNTIME_EXECUTOR_CORES=4 \ - -e RUNTIME_EXECUTOR_MEMORY=20g \ - -e RUNTIME_TOTAL_EXECUTOR_CORES=4 \ - -e RUNTIME_DRIVER_CORES=4 \ - -e RUNTIME_DRIVER_MEMORY=10g \ - -e SGX_MEM_SIZE=64G \ - -e SGX_LOG_LEVEL=error \ - -e LOCAL_IP=$LOCAL_IP \ - $DOCKER_IMAGE bash -``` + ``` + export ENCLAVE_KEY=/path/to/enclave-key.pem + export SECURE_PASSWORD_PATH=/path/to/password + export DATA_PATH=/path/to/data + export KEYS_PATH=/path/to/keys + export KUBERCONFIG_PATH=/path/to/kuberconfig + export LOCAL_IP=$local_ip + export DOCKER_IMAGE=intelanalytics/bigdl-ppml-trusted-big-data-ml-python-graphene:2.1.0-SNAPSHOT + sudo docker run -itd \ + --privileged \ + --net=host \ + --name=spark-local-k8s-client \ + --oom-kill-disable \ + --device=/dev/sgx/enclave \ + --device=/dev/sgx/provision \ + -v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \ + -v $SECURE_PASSWORD_PATH:/ppml/trusted-big-data-ml/work/password \ + -v $ENCLAVE_KEY:/graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem \ + -v $DATA_PATH:/ppml/trusted-big-data-ml/work/data \ + -v $KEYS_PATH:/ppml/trusted-big-data-ml/work/keys \ + -v $KUBERCONFIG_PATH:/root/.kube/config \ + -e RUNTIME_SPARK_MASTER=k8s://https://$LOCAL_IP:6443 \ + -e RUNTIME_K8S_SERVICE_ACCOUNT=spark \ + -e RUNTIME_K8S_SPARK_IMAGE=$DOCKER_IMAGE \ + -e RUNTIME_DRIVER_HOST=$LOCAL_IP \ + -e RUNTIME_DRIVER_PORT=54321 \ + -e RUNTIME_EXECUTOR_INSTANCES=1 \ + -e RUNTIME_EXECUTOR_CORES=4 \ + -e RUNTIME_EXECUTOR_MEMORY=20g \ + -e RUNTIME_TOTAL_EXECUTOR_CORES=4 \ + -e RUNTIME_DRIVER_CORES=4 \ + -e RUNTIME_DRIVER_MEMORY=10g \ + -e SGX_MEM_SIZE=64G \ + -e SGX_LOG_LEVEL=error \ + -e LOCAL_IP=$LOCAL_IP \ + $DOCKER_IMAGE bash + ``` 4. Attach to the client container -``` -sudo docker exec -it spark-local-k8s-client bash -``` + ``` + sudo docker exec -it spark-local-k8s-client bash + ``` 5. Modify `spark-executor-template.yaml`, add path of `enclave-key`, `tpch-spark` and `kuberconfig` on host -``` -apiVersion: v1 -kind: Pod -spec: - containers: - - name: spark-executor - securityContext: - privileged: true - volumeMounts: - ... + ``` + apiVersion: v1 + kind: Pod + spec: + containers: + - name: spark-executor + securityContext: + privileged: true + volumeMounts: + ... + - name: tpch + mountPath: /ppml/trusted-big-data-ml/work/tpch-spark + - name: kubeconf + mountPath: /root/.kube/config + volumes: + - name: enclave-key + hostPath: + path: /root/keys/enclave-key.pem + ... - name: tpch - mountPath: /ppml/trusted-big-data-ml/work/tpch-spark + hostPath: + path: /path/to/tpch-spark - name: kubeconf - mountPath: /root/.kube/config - volumes: - - name: enclave-key - hostPath: - path: /root/keys/enclave-key.pem - ... - - name: tpch - hostPath: - path: /path/to/tpch-spark - - name: kubeconf - hostPath: - path: /path/to/kuberconfig -``` + hostPath: + path: /path/to/kuberconfig + ``` 6. Run PPML TPC-H -```bash -secure_password=`openssl rsautl -inkey /ppml/trusted-big-data-ml/work/password/key.txt -decrypt Q01 39.80204010 \ No newline at end of file + The result is in OUTPUT_DIR. There should be a file called TIMES.TXT with content formatted like: + >Q01 39.80204010 \ No newline at end of file