[PPML] TPC-DS doc upadte (#6238)
* ppml tpcds doc update * fix * update data generation step
This commit is contained in:
		
							parent
							
								
									f7e07ecc69
								
							
						
					
					
						commit
						5e4c269a49
					
				
					 1 changed files with 189 additions and 182 deletions
				
			
		| 
						 | 
					@ -8,79 +8,60 @@
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Prepare TPC-DS kit and data
 | 
					### Prepare TPC-DS kit and data
 | 
				
			||||||
 | 
					
 | 
				
			||||||
1. Download and compile tpc-ds
 | 
					1. Download and compile TPC-DS kit
 | 
				
			||||||
 | 
					
 | 
				
			||||||
   ```bash
 | 
					```bash
 | 
				
			||||||
   git clone --recursive https://github.com/intel-analytics/zoo-tutorials.git
 | 
					git clone --recursive https://github.com/intel-analytics/zoo-tutorials.git
 | 
				
			||||||
   cd /path/to/zoo-tutorials
 | 
					cd zoo-tutorials/tpcds-spark
 | 
				
			||||||
   git clone https://github.com/databricks/tpcds-kit.git
 | 
					git clone https://github.com/databricks/tpcds-kit.git
 | 
				
			||||||
   cd tpcds-kit/tools
 | 
					cd tpcds-kit/tools
 | 
				
			||||||
   make OS=LINUX
 | 
					make OS=LINUX
 | 
				
			||||||
   ```
 | 
					cd ../../
 | 
				
			||||||
 | 
					sbt package
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
2. Generate data
 | 
					2. Generate data
 | 
				
			||||||
 | 
					
 | 
				
			||||||
   ```bash
 | 
					```bash
 | 
				
			||||||
   cd /path/to/zoo-tutorials
 | 
					cd /path/to/zoo-tutorials/tpcds-spark/spark-sql-perf
 | 
				
			||||||
   cd tpcds-spark/spark-sql-perf
 | 
					sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData -d <dsdgenDir> -s <scaleFactor> -l <dataDir> -f parquet"
 | 
				
			||||||
   sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData -d <dsdgenDir> -s <scaleFactor> -l <dataDir> -f parquet"
 | 
					```
 | 
				
			||||||
   ```
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
   `dsdgenDir` is the path of `tpcds-kit/tools`, `scaleFactor` is the size of the data, for example `-s 1` will generate 1G data, `dataDir` is the path to store generated data.
 | 
					`dsdgenDir` is the path of `tpcds-kit/tools`, `scaleFactor` indicates data size, for example `-s 1` will generate data of 1GB scale factor, `dataDir` is the path to store generated data.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Deploy PPML TPC-DS on Kubernetes
 | 
					### Deploy PPML TPC-DS on Kubernetes
 | 
				
			||||||
 | 
					1. Pull docker image
 | 
				
			||||||
 | 
					
 | 
				
			||||||
1. Compile Kit
 | 
					```bash
 | 
				
			||||||
 | 
					sudo docker pull intelanalytics/bigdl-ppml-trusted-big-data-ml-python-graphene:2.1.0-SNAPSHOT
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
   ```bash
 | 
					2. Prepare keys, password and k8s configurations (follow instructions [here](https://github.com/intel-analytics/BigDL/tree/main/ppml/trusted-big-data-ml/python/docker-graphene#11-prepare-the-keyspassworddataenclave-keypem "here")), make sure keys, `tpcds-spark` and generated tpc-ds data can be accessed on each K8S node, e.g. deploy on distributed storage inclusing NFS and HDFS. 
 | 
				
			||||||
   cd zoo-tutorials/tpcds-spark
 | 
					3. Start a bigdl-ppml enabled Spark K8S client container with configured local IP, key, tpc-ds and kubeconfig path, also configure data path if your data is stored on local FS
 | 
				
			||||||
   sbt package
 | 
					 | 
				
			||||||
   ```
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
2. Create external tables
 | 
					```bash
 | 
				
			||||||
 | 
					export ENCLAVE_KEY=/YOUR_DIR/keys/enclave-key.pem
 | 
				
			||||||
   ```bash
 | 
					export TPCDS_PATH=/YOUR_DIR/zoo-tutorials/tpcds-spark
 | 
				
			||||||
   $SPARK_HOME/bin/spark-submit \
 | 
					export DATA_PATH=/YOUR_DIR/data
 | 
				
			||||||
           --class "createTables" \
 | 
					export KEYS_PATH=/YOUR_DIR/keys
 | 
				
			||||||
           --master <spark-master> \
 | 
					export SECURE_PASSWORD_PATH=/YOUR_DIR/password
 | 
				
			||||||
           --driver-memory 20G \
 | 
					export KUBECONFIG_PATH=/YOUR_DIR/kubeconfig
 | 
				
			||||||
           --executor-cores <executor-cores> \
 | 
					export LOCAL_IP=$local_ip
 | 
				
			||||||
           --total-executor-cores <total-cores> \
 | 
					export DOCKER_IMAGE=intelanalytics/bigdl-ppml-trusted-big-data-ml-python-graphene:2.1.0-SNAPSHOT
 | 
				
			||||||
           --executor-memory 20G \
 | 
					sudo docker run -itd \
 | 
				
			||||||
           --jars spark-sql-perf/target/scala-2.12/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar \
 | 
					 | 
				
			||||||
           target/scala-2.12/tpcds-benchmark_2.12-0.1.jar <dataDir> <dsdgenDir> <scaleFactor>
 | 
					 | 
				
			||||||
   ```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
3. Pull docker image
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
   ```bash
 | 
					 | 
				
			||||||
   sudo docker pull intelanalytics/bigdl-ppml-trusted-big-data-ml-python-graphene:2.1.0-SNAPSHOT
 | 
					 | 
				
			||||||
   ```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
4. Prepare SGX keys (following instructions [here](https://github.com/intel-analytics/BigDL/tree/main/ppml/trusted-big-data-ml/python/docker-graphene#11-prepare-the-keyspassworddataenclave-keypem "here")), make sure keys and tpcds-spark can be accessed on each K8S node
 | 
					 | 
				
			||||||
5. Start a bigdl-ppml enabled Spark K8S client container with configured local IP, key, tpc-ds and kuberconfig path
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
   ```bash
 | 
					 | 
				
			||||||
   export ENCLAVE_KEY=/YOUR_DIR/keys/enclave-key.pem
 | 
					 | 
				
			||||||
   export DATA_PATH=/YOUR_DIR/zoo-tutorials/tpcds-spark
 | 
					 | 
				
			||||||
   export KEYS_PATH=/YOUR_DIR/keys
 | 
					 | 
				
			||||||
   export SECURE_PASSWORD_PATH=/YOUR_DIR/password
 | 
					 | 
				
			||||||
   export KUBERCONFIG_PATH=/YOUR_DIR/kuberconfig
 | 
					 | 
				
			||||||
   export LOCAL_IP=$local_ip
 | 
					 | 
				
			||||||
   export DOCKER_IMAGE=intelanalytics/bigdl-ppml-trusted-big-data-ml-python-graphene:2.1.0-SNAPSHOT
 | 
					 | 
				
			||||||
   sudo docker run -itd \
 | 
					 | 
				
			||||||
        --privileged \
 | 
					        --privileged \
 | 
				
			||||||
        --net=host \
 | 
					        --net=host \
 | 
				
			||||||
           --name=spark-local-k8s-client \
 | 
					        --name=spark-k8s-client \
 | 
				
			||||||
        --oom-kill-disable \
 | 
					        --oom-kill-disable \
 | 
				
			||||||
        --device=/dev/sgx/enclave \
 | 
					        --device=/dev/sgx/enclave \
 | 
				
			||||||
        --device=/dev/sgx/provision \
 | 
					        --device=/dev/sgx/provision \
 | 
				
			||||||
        -v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \
 | 
					        -v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \
 | 
				
			||||||
        -v $ENCLAVE_KEY:/graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem \
 | 
					        -v $ENCLAVE_KEY:/graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem \
 | 
				
			||||||
           -v $DATA_PATH:/ppml/trusted-big-data-ml/work/tpcds-spark \
 | 
					        -v $TPCDS_PATH:/ppml/trusted-big-data-ml/work/tpcds-spark \
 | 
				
			||||||
 | 
					        -v $DATA_PATH:/ppml/trusted-big-data-ml/work/data \
 | 
				
			||||||
        -v $KEYS_PATH:/ppml/trusted-big-data-ml/work/keys \
 | 
					        -v $KEYS_PATH:/ppml/trusted-big-data-ml/work/keys \
 | 
				
			||||||
        -v $SECURE_PASSWORD_PATH:/ppml/trusted-big-data-ml/work/password \
 | 
					        -v $SECURE_PASSWORD_PATH:/ppml/trusted-big-data-ml/work/password \
 | 
				
			||||||
           -v $KUBERCONFIG_PATH:/root/.kube/config \
 | 
					        -v $KUBECONFIG_PATH:/root/.kube/config \
 | 
				
			||||||
        -e RUNTIME_SPARK_MASTER=k8s://https://$LOCAL_IP:6443 \
 | 
					        -e RUNTIME_SPARK_MASTER=k8s://https://$LOCAL_IP:6443 \
 | 
				
			||||||
        -e RUNTIME_K8S_SERVICE_ACCOUNT=spark \
 | 
					        -e RUNTIME_K8S_SERVICE_ACCOUNT=spark \
 | 
				
			||||||
        -e RUNTIME_K8S_SPARK_IMAGE=$DOCKER_IMAGE \
 | 
					        -e RUNTIME_K8S_SPARK_IMAGE=$DOCKER_IMAGE \
 | 
				
			||||||
| 
						 | 
					@ -96,58 +77,82 @@
 | 
				
			||||||
        -e SGX_LOG_LEVEL=error \
 | 
					        -e SGX_LOG_LEVEL=error \
 | 
				
			||||||
        -e LOCAL_IP=$LOCAL_IP \
 | 
					        -e LOCAL_IP=$LOCAL_IP \
 | 
				
			||||||
        $DOCKER_IMAGE bash
 | 
					        $DOCKER_IMAGE bash
 | 
				
			||||||
   ```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
6. Attach to the client container
 | 
					4. Attach to the client container
 | 
				
			||||||
 | 
					
 | 
				
			||||||
   ```bash
 | 
					```bash
 | 
				
			||||||
   sudo docker exec -it spark-local-k8s-client bash
 | 
					sudo docker exec -it spark-local-k8s-client bash
 | 
				
			||||||
   ```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
7. Modify `spark-executor-template.yaml`, add path of `enclave-key`, `tpcds-spark` and `kuberconfig` on host
 | 
					5. Create external tables
 | 
				
			||||||
 | 
					
 | 
				
			||||||
   ```yaml
 | 
					```bash
 | 
				
			||||||
   apiVersion: v1
 | 
					cd /ppml/trusted-big-data-ml/work/tpcds-spark
 | 
				
			||||||
   kind: Pod
 | 
					$SPARK_HOME/bin/spark-submit \
 | 
				
			||||||
   spec:
 | 
					        --class "createTables" \
 | 
				
			||||||
 | 
					        --master <spark-master> \
 | 
				
			||||||
 | 
					        --driver-memory 20G \
 | 
				
			||||||
 | 
					        --executor-cores <executor-cores> \
 | 
				
			||||||
 | 
					        --total-executor-cores <total-cores> \
 | 
				
			||||||
 | 
					        --executor-memory 20G \
 | 
				
			||||||
 | 
					        --jars spark-sql-perf/target/scala-2.12/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar \
 | 
				
			||||||
 | 
					        target/scala-2.12/tpcds-benchmark_2.12-0.1.jar <dataDir> <dsdgenDir> <scaleFactor>
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					`<dataDir>` and `<dsdgenDir>` are the generated data path and `tpcds-kit/tools` path, both should be accessible in the container. After successfully creating tables, there should be a directory `metastore_db` in the current working path. 
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					6. Modify `/ppml/trusted-big-data-ml/spark-executor-template.yaml`, add path of `enclave-key`, `tpcds-spark` and `kubeconfig`. If data is not stored on HDFS, also configure mount volume `data` and make sure `mountPath` is the same as `<dataDir>` used in create table step.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```yaml
 | 
				
			||||||
 | 
					apiVersion: v1
 | 
				
			||||||
 | 
					kind: Pod
 | 
				
			||||||
 | 
					spec:
 | 
				
			||||||
  containers:
 | 
					  containers:
 | 
				
			||||||
  - name: spark-executor
 | 
					  - name: spark-executor
 | 
				
			||||||
    securityContext:
 | 
					    securityContext:
 | 
				
			||||||
      privileged: true
 | 
					      privileged: true
 | 
				
			||||||
    volumeMounts:
 | 
					    volumeMounts:
 | 
				
			||||||
 | 
					      - name: enclave-key
 | 
				
			||||||
 | 
					        mountPath: /graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem
 | 
				
			||||||
    ...
 | 
					    ...
 | 
				
			||||||
      - name: tpcds
 | 
					      - name: tpcds
 | 
				
			||||||
        mountPath: /ppml/trusted-big-data-ml/work/tpcds-spark
 | 
					        mountPath: /ppml/trusted-big-data-ml/work/tpcds-spark
 | 
				
			||||||
 | 
					      - name: data
 | 
				
			||||||
 | 
					        mountPath: /mounted/path/to/data
 | 
				
			||||||
      - name: kubeconf
 | 
					      - name: kubeconf
 | 
				
			||||||
        mountPath: /root/.kube/config
 | 
					        mountPath: /root/.kube/config
 | 
				
			||||||
  volumes:
 | 
					  volumes:
 | 
				
			||||||
    - name: enclave-key
 | 
					    - name: enclave-key
 | 
				
			||||||
      hostPath:
 | 
					      hostPath:
 | 
				
			||||||
           path:  /root/keys/enclave-key.pem
 | 
					        path:  /path/to/keys/enclave-key.pem
 | 
				
			||||||
    ...
 | 
					    ...
 | 
				
			||||||
    - name: tpcds
 | 
					    - name: tpcds
 | 
				
			||||||
      hostPath:
 | 
					      hostPath:
 | 
				
			||||||
        path: /path/to/tpcds-spark
 | 
					        path: /path/to/tpcds-spark
 | 
				
			||||||
 | 
					    - name: data
 | 
				
			||||||
 | 
					      hostPath:
 | 
				
			||||||
 | 
					        path: /path/to/data
 | 
				
			||||||
    - name: kubeconf
 | 
					    - name: kubeconf
 | 
				
			||||||
      hostPath:
 | 
					      hostPath:
 | 
				
			||||||
           path: /path/to/kuberconfig
 | 
					        path: /path/to/kubeconfig
 | 
				
			||||||
   ```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
8. Execute TPC-DS queries
 | 
					7. Execute TPC-DS queries
 | 
				
			||||||
 | 
					
 | 
				
			||||||
   Optional argument `QUERY` is the query number to run. Multiple query numbers should be separated by space, e.g. `1 2 3`. If no query number is specified, all 1-99 queries would be executed.
 | 
					Optional argument `QUERY` is the query number to run. Multiple query numbers should be separated by space, e.g. `1 2 3`. If no query number is specified, all 1-99 queries would be executed. Configure `$hdfs_host_ip` and `$hdfs_port` if the output is stored on HDFS. 
 | 
				
			||||||
 | 
					
 | 
				
			||||||
   ```bash
 | 
					```bash
 | 
				
			||||||
   secure_password=`openssl rsautl -inkey /ppml/trusted-big-data-ml/work/password/key.txt -decrypt </ppml/trusted-big-data-ml/work/password/output.bin` && \
 | 
					cd /ppml/trusted-big-data-ml/work/tpcds-spark
 | 
				
			||||||
   export TF_MKL_ALLOC_MAX_BYTES=10737418240 && \
 | 
					secure_password=`openssl rsautl -inkey /ppml/trusted-big-data-ml/work/password/key.txt -decrypt </ppml/trusted-big-data-ml/work/password/output.bin` && \
 | 
				
			||||||
   export SPARK_LOCAL_IP=$LOCAL_IP && \
 | 
					export TF_MKL_ALLOC_MAX_BYTES=10737418240 && \
 | 
				
			||||||
   export HDFS_HOST=$hdfs_host_ip && \
 | 
					export SPARK_LOCAL_IP=$LOCAL_IP && \
 | 
				
			||||||
   export HDFS_PORT=$hdfs_port && \
 | 
					export HDFS_HOST=$hdfs_host_ip && \
 | 
				
			||||||
   export TPCDS_DIR=/ppml/trusted-big-data-ml/work/tpcds-spark \
 | 
					export HDFS_PORT=$hdfs_port && \
 | 
				
			||||||
   export OUTPUT_DIR=hdfs://$HDFS_HOST:$HDFS_PORT/tpc-ds/output \
 | 
					export TPCDS_DIR=/ppml/trusted-big-data-ml/work/tpcds-spark \
 | 
				
			||||||
   export QUERY=3
 | 
					export OUTPUT_DIR=hdfs://$HDFS_HOST:$HDFS_PORT/tpc-ds/output \
 | 
				
			||||||
 | 
					export QUERY=3
 | 
				
			||||||
  /opt/jdk8/bin/java \
 | 
					  /opt/jdk8/bin/java \
 | 
				
			||||||
       -cp '$TPCDS_DIR/target/scala-2.12/tpcds-benchmark_2.12-0.1.jar:/ppml/trusted-big-data-ml/work/spark-3.1.2/conf/:/ppml/trusted-big-data-ml/work/spark-3.1.2/jars/*' \
 | 
					    -cp '$TPCDS_DIR/target/scala-2.12/tpcds-benchmark_2.12-0.1.jar:$TPCDS_DIR/spark-sql-perf/target/scala-2.12/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar:/ppml/trusted-big-data-ml/work/spark-3.1.2/conf/:/ppml/trusted-big-data-ml/work/spark-3.1.2/jars/*' \
 | 
				
			||||||
    -Xmx10g \
 | 
					    -Xmx10g \
 | 
				
			||||||
    -Dbigdl.mklNumThreads=1 \
 | 
					    -Dbigdl.mklNumThreads=1 \
 | 
				
			||||||
    org.apache.spark.deploy.SparkSubmit \
 | 
					    org.apache.spark.deploy.SparkSubmit \
 | 
				
			||||||
| 
						 | 
					@ -206,9 +211,11 @@
 | 
				
			||||||
    --conf spark.ssl.trustStorePassword=$secure_password \
 | 
					    --conf spark.ssl.trustStorePassword=$secure_password \
 | 
				
			||||||
    --conf spark.ssl.trustStoreType=JKS \
 | 
					    --conf spark.ssl.trustStoreType=JKS \
 | 
				
			||||||
    --class "TPCDSBenchmark" \
 | 
					    --class "TPCDSBenchmark" \
 | 
				
			||||||
 | 
					    --jars $TPCDS_DIR/spark-sql-perf/target/scala-2.12/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar \
 | 
				
			||||||
    --verbose \
 | 
					    --verbose \
 | 
				
			||||||
    $TPCDS_DIR/target/scala-2.12/tpcds-benchmark_2.12-0.1.jar \
 | 
					    $TPCDS_DIR/target/scala-2.12/tpcds-benchmark_2.12-0.1.jar \
 | 
				
			||||||
    $OUTPUT_DIR $QUERY
 | 
					    $OUTPUT_DIR $QUERY
 | 
				
			||||||
   ```
 | 
					```
 | 
				
			||||||
 | 
					Note: For Spark cluster mode, the `metastore_db` directory generated in table create step needs to be mounted into driver pod, and the path in the container needs to specified by adding  `--conf spark.hadoop.javax.jdo.option.ConnectionURL="jdbc:derby:;databaseName=/path/to/metastore_db;create=true" \` to `spark-submit` command.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
   After benchmark is finished, the performance result is saved as `part-*.csv` file under `<OUTPUT_DIR>/performance` directory.
 | 
					After benchmark is finished, the performance result is saved as `part-*.csv` file under `<OUTPUT_DIR>/performance` directory.
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
		Loading…
	
		Reference in a new issue