update tpch doc (#4979)

This commit is contained in:
Le-Zheng 2022-06-30 12:59:31 +08:00 committed by GitHub
parent 26ec59edf7
commit cef5359c54

View file

@ -6,38 +6,52 @@
- Intel SGX Device Plugin to use SGX in K8S cluster (install following instructions [here](https://bigdl.readthedocs.io/en/latest/doc/PPML/QuickStart/deploy_intel_sgx_device_plugin_for_kubernetes.html "here")) - Intel SGX Device Plugin to use SGX in K8S cluster (install following instructions [here](https://bigdl.readthedocs.io/en/latest/doc/PPML/QuickStart/deploy_intel_sgx_device_plugin_for_kubernetes.html "here"))
### Prepare TPC-H kit and data ### ### Prepare TPC-H kit and data ###
1. Download and compile tpc-h 1. Generate data
```bash
git clone https://github.com/intel-analytics/zoo-tutorials.git
cd zoo-tutorials/tpch-spark
sed -i 's/2.11.7/2.12.1/g' tpch.sbt Go to [TPC Download](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp) site, choose `TPC-H` source code, then download the TPC-H toolkits.
sed -i 's/2.4.0/3.1.2/g' tpch.sbt After you download the tpc-h tools zip and uncompressed the zip file. Go to `dbgen` directory, and create a makefile based on `makefile.suite`, and run `make`.
sbt package
cd dbgen This should generate an executable called `dbgen`
make ```
./dbgen -h
``` ```
2. Generate data
Generate input data with size ~100GB (user can adjust data size to need): gives you the various options for generating the tables. The simplest case is running:
```bash ```
./dbgen -s 100 ./dbgen
```
which generates tables with extension `.tbl` with scale 1 (default) for a total of rougly 1GB size across all tables. For different size tables you can use the `-s` option:
```
./dbgen -s 10
```
will generate roughly 10GB of input data.
You can then either upload your data to remote file system or read them locally.
2. Encrypt Data
Encrypt data with specified Key Management Service (`SimpleKeyManagementService`, or `EHSMKeyManagementService` , or `AzureKeyManagementService`)
The example code of encrypt data with `SimpleKeyManagementService` is like below:
```
java -cp '/ppml/trusted-big-data-ml/work/bigdl-2.1.0-SNAPSHOT/lib/bigdl-ppml-spark_3.1.2-2.1.0-SNAPSHOT-jar-with-dependencies.jar:/ppml/trusted-big-data-ml/work/spark-3.1.2/conf/:/ppml/trusted-big-data-ml/work/spark-3.1.2/jars/* \
-Xmx10g \
com.intel.analytics.bigdl.ppml.examples.tpch.EncryptFiles \
--inputPath xxx/dbgen \
--outputPath xxx/dbgen-encrypted
``` ```
### Deploy PPML TPC-H on Kubernetes ### ### Deploy PPML TPC-H on Kubernetes ###
1. Pull docker image 1. Pull docker image
```bash ```
sudo docker pull intelanalytics/bigdl-ppml-trusted-big-data-ml-python-graphene:2.1.0-SNAPSHOT sudo docker pull intelanalytics/bigdl-ppml-trusted-big-data-ml-python-graphene:2.1.0-SNAPSHOT
``` ```
2. Prepare SGX keys (following instructions [here](https://github.com/intel-analytics/BigDL/tree/main/ppml/trusted-big-data-ml/python/docker-graphene#11-prepare-the-keyspassworddataenclave-keypem "here")), make sure keys and tpch-spark can be accessed on each K8S node 2. Prepare SGX keys (following instructions [here](https://github.com/intel-analytics/BigDL/tree/main/ppml/trusted-big-data-ml/python/docker-graphene#11-prepare-the-keyspassworddataenclave-keypem "here")), make sure keys and tpch-spark can be accessed on each K8S node
3. Start a bigdl-ppml enabled Spark K8S client container with configured local IP, key, tpch and kuberconfig path 3. Start a bigdl-ppml enabled Spark K8S client container with configured local IP, key, tpch and kuberconfig path
```bash ```
export ENCLAVE_KEY=/YOUR_DIR/keys/enclave-key.pem export ENCLAVE_KEY=/root/keys/enclave-key.pem
export DATA_PATH=/YOUR_DIR/zoo-tutorials/tpch-spark export DATA_PATH=/root/zoo-tutorials/tpch-spark
export KEYS_PATH=/YOUR_DIR/keys export KEYS_PATH=/root/keys
export SECURE_PASSWORD_PATH=/YOUR_DIR/password export KUBERCONFIG_PATH=/root/kuberconfig
export KUBERCONFIG_PATH=/YOUR_DIR/kuberconfig
export LOCAL_IP=$local_ip export LOCAL_IP=$local_ip
export DOCKER_IMAGE=intelanalytics/bigdl-ppml-trusted-big-data-ml-python-graphene:2.1.0-SNAPSHOT export DOCKER_IMAGE=intelanalytics/bigdl-ppml-trusted-big-data-ml-python-graphene:2.1.0-SNAPSHOT
sudo docker run -itd \ sudo docker run -itd \
@ -51,7 +65,6 @@ sudo docker run -itd \
-v $ENCLAVE_KEY:/graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem \ -v $ENCLAVE_KEY:/graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem \
-v $DATA_PATH:/ppml/trusted-big-data-ml/work/tpch-spark \ -v $DATA_PATH:/ppml/trusted-big-data-ml/work/tpch-spark \
-v $KEYS_PATH:/ppml/trusted-big-data-ml/work/keys \ -v $KEYS_PATH:/ppml/trusted-big-data-ml/work/keys \
-v $SECURE_PASSWORD_PATH:/ppml/trusted-big-data-ml/work/password \
-v $KUBERCONFIG_PATH:/root/.kube/config \ -v $KUBERCONFIG_PATH:/root/.kube/config \
-e RUNTIME_SPARK_MASTER=k8s://https://$LOCAL_IP:6443 \ -e RUNTIME_SPARK_MASTER=k8s://https://$LOCAL_IP:6443 \
-e RUNTIME_K8S_SERVICE_ACCOUNT=spark \ -e RUNTIME_K8S_SERVICE_ACCOUNT=spark \
@ -101,17 +114,14 @@ spec:
path: /path/to/kuberconfig path: /path/to/kuberconfig
``` ```
6. Run PPML TPC-H 6. Run PPML TPC-H
```bash bash```
secure_password=`openssl rsautl -inkey /ppml/trusted-big-data-ml/work/password/key.txt -decrypt </ppml/trusted-big-data-ml/work/password/output.bin` && \ secure_password=`openssl rsautl -inkey /ppml/trusted-big-data-ml/work/password/key.txt -decrypt </ppml/trusted-big-data-ml/work/password/output.bin` && \
export TF_MKL_ALLOC_MAX_BYTES=10737418240 && \ export TF_MKL_ALLOC_MAX_BYTES=10737418240 && \
export SPARK_LOCAL_IP=$LOCAL_IP && \ export SPARK_LOCAL_IP=$LOCAL_IP && \
export HDFS_HOST=$hdfs_host_ip && \ export INPUT_DIR=xxx/dbgen \
export HDFS_PORT=$hdfs_port && \ export OUTPUT_DIR=xxx/output \
export TPCH_DIR=/ppml/trusted-big-data-ml/work/tpch-spark \
export INPUT_DIR=$TPCH_DIR/dbgen \
export OUTPUT_DIR=hdfs://$HDFS_HOST:$HDFS_PORT/tpc-h/output \
/opt/jdk8/bin/java \ /opt/jdk8/bin/java \
-cp '$TPCH_DIR/target/scala-2.12/spark-tpc-h-queries_2.12-1.0.jar:$TPCH_DIR/dbgen/*:/ppml/trusted-big-data-ml/work/spark-3.1.2/conf/:/ppml/trusted-big-data-ml/work/spark-3.1.2/jars/*' \ -cp '/ppml/trusted-big-data-ml/work/bigdl-2.1.0-SNAPSHOT/lib/bigdl-ppml-spark_3.1.2-2.1.0-SNAPSHOT-jar-with-dependencies.jar:/ppml/trusted-big-data-ml/work/spark-3.1.2/conf/:/ppml/trusted-big-data-ml/work/spark-3.1.2/jars/*' \
-Xmx10g \ -Xmx10g \
-Dbigdl.mklNumThreads=1 \ -Dbigdl.mklNumThreads=1 \
org.apache.spark.deploy.SparkSubmit \ org.apache.spark.deploy.SparkSubmit \
@ -169,8 +179,13 @@ export OUTPUT_DIR=hdfs://$HDFS_HOST:$HDFS_PORT/tpc-h/output \
--conf spark.ssl.trustStore=/ppml/trusted-big-data-ml/work/keys/keystore.jks \ --conf spark.ssl.trustStore=/ppml/trusted-big-data-ml/work/keys/keystore.jks \
--conf spark.ssl.trustStorePassword=$secure_password \ --conf spark.ssl.trustStorePassword=$secure_password \
--conf spark.ssl.trustStoreType=JKS \ --conf spark.ssl.trustStoreType=JKS \
--class main.scala.TpchQuery \ --conf spark.bigdl.kms.type=SimpleKeyManagementService \
--conf spark.bigdl.kms.simple.id=simpleAPPID \
--conf spark.bigdl.kms.simple.key=simpleAPPKEY \
--conf spark.bigdl.kms.key.primary=xxxx/primaryKey \
--conf spark.bigdl.kms.key.data=xxxx/dataKey \
--class com.intel.analytics.bigdl.ppml.examples.tpch.TpchQuery \
--verbose \ --verbose \
$TPCH_DIR/target/scala-2.12/spark-tpc-h-queries_2.12-1.0.jar \ /ppml/trusted-big-data-ml/work/bigdl-2.1.0-SNAPSHOT/lib/bigdl-ppml-spark_3.1.2-2.1.0-SNAPSHOT-jar-with-dependencies.jar \
$INPUT_DIR $OUTPUT_DIR $INPUT_DIR $OUTPUT_DIR aes_cbc_pkcs5padding plain_text [QUERY]
``` ```