[PPML] TPC-DS doc upadte (#6238)

* ppml tpcds doc update

* fix

* update data generation step
This commit is contained in:
CharleneHu94 2022-10-21 17:01:37 +08:00 committed by GitHub
parent f7e07ecc69
commit 5e4c269a49

View file

@ -8,79 +8,60 @@
### Prepare TPC-DS kit and data ### Prepare TPC-DS kit and data
1. Download and compile tpc-ds 1. Download and compile TPC-DS kit
```bash ```bash
git clone --recursive https://github.com/intel-analytics/zoo-tutorials.git git clone --recursive https://github.com/intel-analytics/zoo-tutorials.git
cd /path/to/zoo-tutorials cd zoo-tutorials/tpcds-spark
git clone https://github.com/databricks/tpcds-kit.git git clone https://github.com/databricks/tpcds-kit.git
cd tpcds-kit/tools cd tpcds-kit/tools
make OS=LINUX make OS=LINUX
cd ../../
sbt package
``` ```
2. Generate data 2. Generate data
```bash ```bash
cd /path/to/zoo-tutorials cd /path/to/zoo-tutorials/tpcds-spark/spark-sql-perf
cd tpcds-spark/spark-sql-perf
sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData -d <dsdgenDir> -s <scaleFactor> -l <dataDir> -f parquet" sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData -d <dsdgenDir> -s <scaleFactor> -l <dataDir> -f parquet"
``` ```
`dsdgenDir` is the path of `tpcds-kit/tools`, `scaleFactor` is the size of the data, for example `-s 1` will generate 1G data, `dataDir` is the path to store generated data. `dsdgenDir` is the path of `tpcds-kit/tools`, `scaleFactor` indicates data size, for example `-s 1` will generate data of 1GB scale factor, `dataDir` is the path to store generated data.
### Deploy PPML TPC-DS on Kubernetes ### Deploy PPML TPC-DS on Kubernetes
1. Pull docker image
1. Compile Kit
```bash
cd zoo-tutorials/tpcds-spark
sbt package
```
2. Create external tables
```bash
$SPARK_HOME/bin/spark-submit \
--class "createTables" \
--master <spark-master> \
--driver-memory 20G \
--executor-cores <executor-cores> \
--total-executor-cores <total-cores> \
--executor-memory 20G \
--jars spark-sql-perf/target/scala-2.12/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar \
target/scala-2.12/tpcds-benchmark_2.12-0.1.jar <dataDir> <dsdgenDir> <scaleFactor>
```
3. Pull docker image
```bash ```bash
sudo docker pull intelanalytics/bigdl-ppml-trusted-big-data-ml-python-graphene:2.1.0-SNAPSHOT sudo docker pull intelanalytics/bigdl-ppml-trusted-big-data-ml-python-graphene:2.1.0-SNAPSHOT
``` ```
4. Prepare SGX keys (following instructions [here](https://github.com/intel-analytics/BigDL/tree/main/ppml/trusted-big-data-ml/python/docker-graphene#11-prepare-the-keyspassworddataenclave-keypem "here")), make sure keys and tpcds-spark can be accessed on each K8S node 2. Prepare keys, password and k8s configurations (follow instructions [here](https://github.com/intel-analytics/BigDL/tree/main/ppml/trusted-big-data-ml/python/docker-graphene#11-prepare-the-keyspassworddataenclave-keypem "here")), make sure keys, `tpcds-spark` and generated tpc-ds data can be accessed on each K8S node, e.g. deploy on distributed storage inclusing NFS and HDFS.
5. Start a bigdl-ppml enabled Spark K8S client container with configured local IP, key, tpc-ds and kuberconfig path 3. Start a bigdl-ppml enabled Spark K8S client container with configured local IP, key, tpc-ds and kubeconfig path, also configure data path if your data is stored on local FS
```bash ```bash
export ENCLAVE_KEY=/YOUR_DIR/keys/enclave-key.pem export ENCLAVE_KEY=/YOUR_DIR/keys/enclave-key.pem
export DATA_PATH=/YOUR_DIR/zoo-tutorials/tpcds-spark export TPCDS_PATH=/YOUR_DIR/zoo-tutorials/tpcds-spark
export DATA_PATH=/YOUR_DIR/data
export KEYS_PATH=/YOUR_DIR/keys export KEYS_PATH=/YOUR_DIR/keys
export SECURE_PASSWORD_PATH=/YOUR_DIR/password export SECURE_PASSWORD_PATH=/YOUR_DIR/password
export KUBERCONFIG_PATH=/YOUR_DIR/kuberconfig export KUBECONFIG_PATH=/YOUR_DIR/kubeconfig
export LOCAL_IP=$local_ip export LOCAL_IP=$local_ip
export DOCKER_IMAGE=intelanalytics/bigdl-ppml-trusted-big-data-ml-python-graphene:2.1.0-SNAPSHOT export DOCKER_IMAGE=intelanalytics/bigdl-ppml-trusted-big-data-ml-python-graphene:2.1.0-SNAPSHOT
sudo docker run -itd \ sudo docker run -itd \
--privileged \ --privileged \
--net=host \ --net=host \
--name=spark-local-k8s-client \ --name=spark-k8s-client \
--oom-kill-disable \ --oom-kill-disable \
--device=/dev/sgx/enclave \ --device=/dev/sgx/enclave \
--device=/dev/sgx/provision \ --device=/dev/sgx/provision \
-v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \ -v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \
-v $ENCLAVE_KEY:/graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem \ -v $ENCLAVE_KEY:/graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem \
-v $DATA_PATH:/ppml/trusted-big-data-ml/work/tpcds-spark \ -v $TPCDS_PATH:/ppml/trusted-big-data-ml/work/tpcds-spark \
-v $DATA_PATH:/ppml/trusted-big-data-ml/work/data \
-v $KEYS_PATH:/ppml/trusted-big-data-ml/work/keys \ -v $KEYS_PATH:/ppml/trusted-big-data-ml/work/keys \
-v $SECURE_PASSWORD_PATH:/ppml/trusted-big-data-ml/work/password \ -v $SECURE_PASSWORD_PATH:/ppml/trusted-big-data-ml/work/password \
-v $KUBERCONFIG_PATH:/root/.kube/config \ -v $KUBECONFIG_PATH:/root/.kube/config \
-e RUNTIME_SPARK_MASTER=k8s://https://$LOCAL_IP:6443 \ -e RUNTIME_SPARK_MASTER=k8s://https://$LOCAL_IP:6443 \
-e RUNTIME_K8S_SERVICE_ACCOUNT=spark \ -e RUNTIME_K8S_SERVICE_ACCOUNT=spark \
-e RUNTIME_K8S_SPARK_IMAGE=$DOCKER_IMAGE \ -e RUNTIME_K8S_SPARK_IMAGE=$DOCKER_IMAGE \
@ -98,13 +79,29 @@
$DOCKER_IMAGE bash $DOCKER_IMAGE bash
``` ```
6. Attach to the client container 4. Attach to the client container
```bash ```bash
sudo docker exec -it spark-local-k8s-client bash sudo docker exec -it spark-local-k8s-client bash
``` ```
7. Modify `spark-executor-template.yaml`, add path of `enclave-key`, `tpcds-spark` and `kuberconfig` on host 5. Create external tables
```bash
cd /ppml/trusted-big-data-ml/work/tpcds-spark
$SPARK_HOME/bin/spark-submit \
--class "createTables" \
--master <spark-master> \
--driver-memory 20G \
--executor-cores <executor-cores> \
--total-executor-cores <total-cores> \
--executor-memory 20G \
--jars spark-sql-perf/target/scala-2.12/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar \
target/scala-2.12/tpcds-benchmark_2.12-0.1.jar <dataDir> <dsdgenDir> <scaleFactor>
```
`<dataDir>` and `<dsdgenDir>` are the generated data path and `tpcds-kit/tools` path, both should be accessible in the container. After successfully creating tables, there should be a directory `metastore_db` in the current working path.
6. Modify `/ppml/trusted-big-data-ml/spark-executor-template.yaml`, add path of `enclave-key`, `tpcds-spark` and `kubeconfig`. If data is not stored on HDFS, also configure mount volume `data` and make sure `mountPath` is the same as `<dataDir>` used in create table step.
```yaml ```yaml
apiVersion: v1 apiVersion: v1
@ -115,29 +112,37 @@
securityContext: securityContext:
privileged: true privileged: true
volumeMounts: volumeMounts:
- name: enclave-key
mountPath: /graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem
... ...
- name: tpcds - name: tpcds
mountPath: /ppml/trusted-big-data-ml/work/tpcds-spark mountPath: /ppml/trusted-big-data-ml/work/tpcds-spark
- name: data
mountPath: /mounted/path/to/data
- name: kubeconf - name: kubeconf
mountPath: /root/.kube/config mountPath: /root/.kube/config
volumes: volumes:
- name: enclave-key - name: enclave-key
hostPath: hostPath:
path: /root/keys/enclave-key.pem path: /path/to/keys/enclave-key.pem
... ...
- name: tpcds - name: tpcds
hostPath: hostPath:
path: /path/to/tpcds-spark path: /path/to/tpcds-spark
- name: data
hostPath:
path: /path/to/data
- name: kubeconf - name: kubeconf
hostPath: hostPath:
path: /path/to/kuberconfig path: /path/to/kubeconfig
``` ```
8. Execute TPC-DS queries 7. Execute TPC-DS queries
Optional argument `QUERY` is the query number to run. Multiple query numbers should be separated by space, e.g. `1 2 3`. If no query number is specified, all 1-99 queries would be executed. Optional argument `QUERY` is the query number to run. Multiple query numbers should be separated by space, e.g. `1 2 3`. If no query number is specified, all 1-99 queries would be executed. Configure `$hdfs_host_ip` and `$hdfs_port` if the output is stored on HDFS.
```bash ```bash
cd /ppml/trusted-big-data-ml/work/tpcds-spark
secure_password=`openssl rsautl -inkey /ppml/trusted-big-data-ml/work/password/key.txt -decrypt </ppml/trusted-big-data-ml/work/password/output.bin` && \ secure_password=`openssl rsautl -inkey /ppml/trusted-big-data-ml/work/password/key.txt -decrypt </ppml/trusted-big-data-ml/work/password/output.bin` && \
export TF_MKL_ALLOC_MAX_BYTES=10737418240 && \ export TF_MKL_ALLOC_MAX_BYTES=10737418240 && \
export SPARK_LOCAL_IP=$LOCAL_IP && \ export SPARK_LOCAL_IP=$LOCAL_IP && \
@ -147,7 +152,7 @@
export OUTPUT_DIR=hdfs://$HDFS_HOST:$HDFS_PORT/tpc-ds/output \ export OUTPUT_DIR=hdfs://$HDFS_HOST:$HDFS_PORT/tpc-ds/output \
export QUERY=3 export QUERY=3
/opt/jdk8/bin/java \ /opt/jdk8/bin/java \
-cp '$TPCDS_DIR/target/scala-2.12/tpcds-benchmark_2.12-0.1.jar:/ppml/trusted-big-data-ml/work/spark-3.1.2/conf/:/ppml/trusted-big-data-ml/work/spark-3.1.2/jars/*' \ -cp '$TPCDS_DIR/target/scala-2.12/tpcds-benchmark_2.12-0.1.jar:$TPCDS_DIR/spark-sql-perf/target/scala-2.12/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar:/ppml/trusted-big-data-ml/work/spark-3.1.2/conf/:/ppml/trusted-big-data-ml/work/spark-3.1.2/jars/*' \
-Xmx10g \ -Xmx10g \
-Dbigdl.mklNumThreads=1 \ -Dbigdl.mklNumThreads=1 \
org.apache.spark.deploy.SparkSubmit \ org.apache.spark.deploy.SparkSubmit \
@ -206,9 +211,11 @@
--conf spark.ssl.trustStorePassword=$secure_password \ --conf spark.ssl.trustStorePassword=$secure_password \
--conf spark.ssl.trustStoreType=JKS \ --conf spark.ssl.trustStoreType=JKS \
--class "TPCDSBenchmark" \ --class "TPCDSBenchmark" \
--jars $TPCDS_DIR/spark-sql-perf/target/scala-2.12/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar \
--verbose \ --verbose \
$TPCDS_DIR/target/scala-2.12/tpcds-benchmark_2.12-0.1.jar \ $TPCDS_DIR/target/scala-2.12/tpcds-benchmark_2.12-0.1.jar \
$OUTPUT_DIR $QUERY $OUTPUT_DIR $QUERY
``` ```
Note: For Spark cluster mode, the `metastore_db` directory generated in table create step needs to be mounted into driver pod, and the path in the container needs to specified by adding `--conf spark.hadoop.javax.jdo.option.ConnectionURL="jdbc:derby:;databaseName=/path/to/metastore_db;create=true" \` to `spark-submit` command.
After benchmark is finished, the performance result is saved as `part-*.csv` file under `<OUTPUT_DIR>/performance` directory. After benchmark is finished, the performance result is saved as `part-*.csv` file under `<OUTPUT_DIR>/performance` directory.