update k8s docs (#3461)

2021-11-19 17:17:05 +08:00 · 2021-11-19 17:17:05 +08:00 · bee5879b24
commit bee5879b24
parent 30b4c3d9d5
1 changed files with 69 additions and 42 deletions
--- a/docs/readthedocs/source/doc/UserGuide/k8s.md
+++ b/docs/readthedocs/source/doc/UserGuide/k8s.md
@ -2,14 +2,16 @@

 ---

-### **1. Pull `hyper-zoo` Docker Image**
+### **1. Pull `bigdl-k8s` Docker Image**

-You may pull the prebuilt  Analytics Zoo `hyper-zoo` Image from [Docker Hub](https://hub.docker.com/r/intelanalytics/hyper-zoo/tags) as follows:
+You may pull the prebuilt  BigDL `bigdl-k8s` Image from [Docker Hub](https://hub.docker.com/r/intelanalytics/bigdl-k8s/tags) as follows:

 ```bash
-sudo docker pull intelanalytics/hyper-zoo:latest
+sudo docker pull intelanalytics/bigdl-k8s:latest
 ```

+Note, If you would like to run Tensorflow 2.x application, pull image "bigdl-k8s:latest-tf2" with `sudo docker pull intelanalytics/bigdl-k8s:latest-tf2`. The two images are distinguished with tensorflow version installed in python environment.
+
 **Speed up pulling image by adding mirrors**

 To speed up pulling the image from DockerHub, you may add the registry-mirrors key and value by editing `daemon.json` (located in `/etc/docker/` folder on Linux):
@ -34,7 +36,7 @@ sudo systemctl restart docker

 ### **2. Launch a Client Container**

-You can submit Analytics Zoo application from a client container that provides the required environment.
+You can submit BigDL application from a client container that provides the required environment.

 ```bash
 sudo docker run -itd --net=host \
@ -57,7 +59,7 @@ sudo docker run -itd --net=host \
    -e https_proxy=https://your-proxy-host:your-proxy-port \
    -e RUNTIME_SPARK_MASTER=k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
    -e RUNTIME_K8S_SERVICE_ACCOUNT=account \
-    -e RUNTIME_K8S_SPARK_IMAGE=intelanalytics/hyper-zoo:latest \
+    -e RUNTIME_K8S_SPARK_IMAGE=intelanalytics/bigdl-k8s:latest \
    -e RUNTIME_PERSISTENT_VOLUME_CLAIM=myvolumeclaim \
    -e RUNTIME_DRIVER_HOST=x.x.x.x \
    -e RUNTIME_DRIVER_PORT=54321 \
@ -67,7 +69,7 @@ sudo docker run -itd --net=host \
    -e RUNTIME_TOTAL_EXECUTOR_CORES=4 \
    -e RUNTIME_DRIVER_CORES=4 \
    -e RUNTIME_DRIVER_MEMORY=10g \
-    intelanalytics/hyper-zoo:latest bash 
+    intelanalytics/bigdl-k8s:latest bash 
 ```

 - NOTEBOOK_PORT value 12345 is a user specified port number.
@ -96,16 +98,17 @@ root@[hostname]:/opt/spark/work-dir#

 The `/opt` directory contains:

- download-analytics-zoo.sh is used for downloading Analytics-Zoo distributions.
+- download-bigdl.sh is used for downloading BigDL distributions.
 - start-notebook-spark.sh is used for starting the jupyter notebook on standard spark cluster. 
 - start-notebook-k8s.sh is used for starting the jupyter notebook on k8s cluster.
- analytics-zoo-x.x-SNAPSHOT is `ANALYTICS_ZOO_HOME`, which is the home of Analytics Zoo distribution.
- analytics-zoo-examples directory contains downloaded python example code.
+- bigdl-x.x-SNAPSHOT is `BIGDL_HOME`, which is the home of BigDL distribution.
+- bigdl-examples directory contains downloaded python example code.
+- install-conda-env.sh is displayed that conda env and python dependencies are installed.
 - jdk is the jdk home.
 - spark is the spark home.
 - redis is the redis home.

-### **3. Run Analytics Zoo Examples on k8s**
+### **3. Run BigDL Examples on k8s**

 _**Note**: Please make sure `kubectl` has appropriate permission to create, list and delete pod._

@ -113,13 +116,13 @@ _**Note**: Please refer to section 4 for some know issues._

 #### **3.1 K8s client mode**

-We recommend using `init_orca_context` at the very beginning of your code (e.g. in script.py) to initiate and run Analytics Zoo on standard K8s clusters in [client mode](http://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode).
+We recommend using `init_orca_context` at the very beginning of your code (e.g. in script.py) to initiate and run BigDL on standard K8s clusters in [client mode](http://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode).

 ```python
-from zoo.orca import init_orca_context
+from bigdl.orca import init_orca_context

 init_orca_context(cluster_mode="k8s", master="k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>",
-                  container_image="intelanalytics/hyper-zoo:latest",
+                  container_image="intelanalytics/bigdl-k8s:latest",
                  num_nodes=2, cores=2,
                  conf={"spark.driver.host": "x.x.x.x",
                        "spark.driver.port": "x"})
@ -129,28 +132,36 @@ Execute `python script.py` to run your program on k8s cluster directly.

 #### **3.2 K8s cluster mode**

-For k8s [cluster mode](https://spark.apache.org/docs/2.4.5/running-on-kubernetes.html#cluster-mode), you can call `init_orca_context` and specify cluster_mode to be "spark-submit" in your python script (e.g. in script.py):
+For k8s [cluster mode](https://spark.apache.org/docs/3.1.2/running-on-kubernetes.html#cluster-mode), you can call `init_orca_context` and specify cluster_mode to be "spark-submit" in your python script (e.g. in script.py):

 ```python
-from zoo.orca import init_orca_context
+from bigdl.orca import init_orca_context

 init_orca_context(cluster_mode="spark-submit")
 ```

-Use spark-submit to submit your Analytics Zoo program:
+Use spark-submit to submit your BigDL program:

 ```bash
-${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
+${SPARK_HOME}/bin/spark-submit \
  --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
  --deploy-mode cluster \
-  --name analytics-zoo \
-  --conf spark.kubernetes.container.image="intelanalytics/hyper-zoo:latest" \
+  --conf spark.kubernetes.authenticate.driver.serviceAccountName=account \
+  --name bigdl \
+  --conf spark.kubernetes.container.image="intelanalytics/bigdl-k8s:latest" \
+  --conf spark.kubernetes.container.image.pullPolicy=Always \
+  --conf spark.pyspark.driver.python=./environment/bin/python \
+  --conf spark.pyspark.python=./environment/bin/python \
  --conf spark.executor.instances=1 \
  --executor-memory 10g \
  --driver-memory 10g \
  --executor-cores 8 \
  --num-executors 2 \
-  file:///path/script.py
+  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
+  --py-files local://${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,local:///path/script.py
+  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
+  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
+  local:///path/script.py
 ```

 #### **3.3 Run Jupyter Notebooks**
@ -164,7 +175,7 @@ In the `/opt` directory, run this command line to start the Jupyter Notebook ser

 You will see the output message like below. This means the Jupyter Notebook service has started successfully within the container.
 ```
-[I 23:51:08.456 NotebookApp] Serving notebooks from local directory: /opt/analytics-zoo-0.11.0-SNAPSHOT/apps
+[I 23:51:08.456 NotebookApp] Serving notebooks from local directory: /opt/bigdl-0.14.0-SNAPSHOT/apps
 [I 23:51:08.456 NotebookApp] Jupyter Notebook 6.2.0 is running at:
 [I 23:51:08.456 NotebookApp] http://xxxx:12345/?token=...
 [I 23:51:08.457 NotebookApp]  or http://127.0.0.1:12345/?token=...
@ -175,7 +186,7 @@ Then, refer [docker guide](./docker.md) to open Jupyter Notebook service from a

 #### **3.4 Run Scala programs**

-Use spark-submit to submit your Analytics Zoo program.  e.g., run [anomalydetection](https://github.com/intel-analytics/analytics-zoo/tree/master/zoo/src/main/scala/com/intel/analytics/zoo/examples/anomalydetection) example (running in either local mode or cluster mode) as follows:
+Use spark-submit to submit your BigDL program.  e.g., run [nnframes imageInference](../../../../../../scala/dllib/src/main/scala/com/intel/analytics/bigdl/dllib/example/nnframes/imageInference) example (running in either local mode or cluster mode) as follows:

 ```bash
 ${SPARK_HOME}/bin/spark-submit \
@ -184,7 +195,7 @@ ${SPARK_HOME}/bin/spark-submit \
  --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
  --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
-  --name analytics-zoo \
+  --name bigdl \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
@ -198,14 +209,13 @@ ${SPARK_HOME}/bin/spark-submit \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
-  --properties-file ${ANALYTICS_ZOO_HOME}/conf/spark-analytics-zoo.conf \
-  --py-files ${ANALYTICS_ZOO_HOME}/lib/analytics-zoo-bigdl_${BIGDL_VERSION}-spark_${SPARK_VERSION}-${ANALYTICS_ZOO_VERSION}-python-api.zip \
+  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
-  --conf spark.driver.extraClassPath=${ANALYTICS_ZOO_HOME}/lib/analytics-zoo-bigdl_${BIGDL_VERSION}-spark_${SPARK_VERSION}-${ANALYTICS_ZOO_VERSION}-jar-with-dependencies.jar \
-  --conf spark.executor.extraClassPath=${ANALYTICS_ZOO_HOME}/lib/analytics-zoo-bigdl_${BIGDL_VERSION}-spark_${SPARK_VERSION}-${ANALYTICS_ZOO_VERSION}-jar-with-dependencies.jar \
-  --class com.intel.analytics.zoo.examples.anomalydetection.AnomalyDetection \
-  ${ANALYTICS_ZOO_HOME}/lib/analytics-zoo-bigdl_${BIGDL_VERSION}-spark_${SPARK_VERSION}-${ANALYTICS_ZOO_VERSION}-python-api.zip \
+  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/*  \
+  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/*  \
+  --class com.intel.analytics.bigdl.dllib.examples.nnframes.imageInference.ImageTransferLearning \
+  ${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip \
  --inputDir /path
 ```

@ -218,25 +228,42 @@ Options:
 - --properties-file: the customized conf properties.
 - --py-files: the extra python packages is needed.
 - --class: scala example class name.
- --inputDir: input data path of the anomaly detection example. The data path is the mounted filesystem of the host. Refer to more details by [Kubernetes Volumes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes).
+- --inputDir: input data path of the nnframe example. The data path is the mounted filesystem of the host. Refer to more details by [Kubernetes Volumes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes).

 ### **4 Know issues**

 This section shows some common topics for both client mode and cluster mode.

-#### **4.1 How to retain executor logs for debugging?**
+#### **4.1 How to specify python environment**
+
+The k8s image provides conda python environment. Image "intelanalytics/bigdl-k8s:latest" installs python environment in "/usr/local/envs/pytf1/bin/python". Image "intelanalytics/bigdl-k8s:latest-tf2" installs python environment in "/usr/local/envs/pytf2/bin/python".
+
+In client mode, set python env and run application:
+```python
+source activate pytf1
+python script.py
+```
+In cluster mode, specify on both the driver and executor:
+```bash
+${SPARK_HOME}/bin/spark-submit \
+  --... ...\
+  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
+  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
+  file:///path/script.py
+```
+#### **4.2 How to retain executor logs for debugging?**

 The k8s would delete the pod once the executor failed in client mode and cluster mode.  If you want to get the content of executor log, you could set "temp-dir" to a mounted network file system (NFS) storage to change the log dir to replace the former one. In this case, you may meet `JSONDecodeError` because multiple executors would write logs to the same physical folder and cause conflicts. The solutions are in the next section.

 ```python
-init_orca_context(..., extra_params = {"temp-dir": "/zoo/"})
+init_orca_context(..., extra_params = {"temp-dir": "/bigdl/"})
 ```

-#### **4.2 How to deal with "JSONDecodeError" ?**
+#### **4.3 How to deal with "JSONDecodeError" ?**

 If you set `temp-dir` to a mounted nfs storage and use multiple executors , you may meet `JSONDecodeError` since multiple executors would write to the same physical folder and cause conflicts. Do not mount `temp-dir` to shared storage is one option to avoid conflicts. But if you debug ray on k8s, you need to output logs to a shared storage. In this case, you could set num-nodes to 1. After testing, you can remove `temp-dir` setting and run multiple executors.

-#### **4.3 How to use NFS?**
+#### **4.4 How to use NFS?**

 If you want to save some files out of pod's lifecycle, such as logging callbacks or tensorboard callbacks, you need to set the output dir to a mounted persistent volume dir. Let NFS be a simple example.

@ -246,23 +273,23 @@ Use NFS in client mode:
 init_orca_context(cluster_mode="k8s", ...,
                  conf={...,
                  "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim",
-                  "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/zoo" 
+                  "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/bigdl" 
                  })
 ```

 Use NFS in cluster mode:

 ```bash
-${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
+${SPARK_HOME}/bin/spark-submit \
  --... ...\
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \
-  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/zoo" \
+  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/bigdl" \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \
-  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/zoo" \
+  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/bigdl" \
  file:///path/script.py
 ```

-#### **4.4 How to deal with  "RayActorError" ?**
+#### **4.5 How to deal with  "RayActorError" ?**

 "RayActorError" may caused by running out of the ray memory. If you meet this error, try to increase the memory for ray.

@ -270,11 +297,11 @@ ${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
 init_orca_context(..., exra_executor_memory_for_ray=100g)
 ```

-####  **4.5 How to set proper "steps_per_epoch" and "validation steps" ?**
+####  **4.6 How to set proper "steps_per_epoch" and "validation steps" ?**

 The `steps_per_epoch` and `validation_steps` should equal to numbers of dataset divided by batch size if you want to train all dataset. The `steps_per_epoch` and `validation_steps` do not relate to the `num_nodes` when total dataset and batch size are fixed. For example, you set `num_nodes` to 1, and set `steps_per_epoch` to 6. If you change the `num_nodes` to 3, the `steps_per_epoch` should still be 6.

-#### **4.6 Others**
+#### **4.7 Others**

 `spark.kubernetes.container.image.pullPolicy` needs to be specified as `always` if you need to update your spark executor image for k8s.