update k8s docs (#3461)

This commit is contained in:
Le-Zheng 2021-11-19 17:17:05 +08:00 committed by GitHub
parent 30b4c3d9d5
commit bee5879b24

View file

@ -2,14 +2,16 @@
--- ---
### **1. Pull `hyper-zoo` Docker Image** ### **1. Pull `bigdl-k8s` Docker Image**
You may pull the prebuilt Analytics Zoo `hyper-zoo` Image from [Docker Hub](https://hub.docker.com/r/intelanalytics/hyper-zoo/tags) as follows: You may pull the prebuilt BigDL `bigdl-k8s` Image from [Docker Hub](https://hub.docker.com/r/intelanalytics/bigdl-k8s/tags) as follows:
```bash ```bash
sudo docker pull intelanalytics/hyper-zoo:latest sudo docker pull intelanalytics/bigdl-k8s:latest
``` ```
Note, If you would like to run Tensorflow 2.x application, pull image "bigdl-k8s:latest-tf2" with `sudo docker pull intelanalytics/bigdl-k8s:latest-tf2`. The two images are distinguished with tensorflow version installed in python environment.
**Speed up pulling image by adding mirrors** **Speed up pulling image by adding mirrors**
To speed up pulling the image from DockerHub, you may add the registry-mirrors key and value by editing `daemon.json` (located in `/etc/docker/` folder on Linux): To speed up pulling the image from DockerHub, you may add the registry-mirrors key and value by editing `daemon.json` (located in `/etc/docker/` folder on Linux):
@ -34,7 +36,7 @@ sudo systemctl restart docker
### **2. Launch a Client Container** ### **2. Launch a Client Container**
You can submit Analytics Zoo application from a client container that provides the required environment. You can submit BigDL application from a client container that provides the required environment.
```bash ```bash
sudo docker run -itd --net=host \ sudo docker run -itd --net=host \
@ -57,7 +59,7 @@ sudo docker run -itd --net=host \
-e https_proxy=https://your-proxy-host:your-proxy-port \ -e https_proxy=https://your-proxy-host:your-proxy-port \
-e RUNTIME_SPARK_MASTER=k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \ -e RUNTIME_SPARK_MASTER=k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
-e RUNTIME_K8S_SERVICE_ACCOUNT=account \ -e RUNTIME_K8S_SERVICE_ACCOUNT=account \
-e RUNTIME_K8S_SPARK_IMAGE=intelanalytics/hyper-zoo:latest \ -e RUNTIME_K8S_SPARK_IMAGE=intelanalytics/bigdl-k8s:latest \
-e RUNTIME_PERSISTENT_VOLUME_CLAIM=myvolumeclaim \ -e RUNTIME_PERSISTENT_VOLUME_CLAIM=myvolumeclaim \
-e RUNTIME_DRIVER_HOST=x.x.x.x \ -e RUNTIME_DRIVER_HOST=x.x.x.x \
-e RUNTIME_DRIVER_PORT=54321 \ -e RUNTIME_DRIVER_PORT=54321 \
@ -67,7 +69,7 @@ sudo docker run -itd --net=host \
-e RUNTIME_TOTAL_EXECUTOR_CORES=4 \ -e RUNTIME_TOTAL_EXECUTOR_CORES=4 \
-e RUNTIME_DRIVER_CORES=4 \ -e RUNTIME_DRIVER_CORES=4 \
-e RUNTIME_DRIVER_MEMORY=10g \ -e RUNTIME_DRIVER_MEMORY=10g \
intelanalytics/hyper-zoo:latest bash intelanalytics/bigdl-k8s:latest bash
``` ```
- NOTEBOOK_PORT value 12345 is a user specified port number. - NOTEBOOK_PORT value 12345 is a user specified port number.
@ -96,16 +98,17 @@ root@[hostname]:/opt/spark/work-dir#
The `/opt` directory contains: The `/opt` directory contains:
- download-analytics-zoo.sh is used for downloading Analytics-Zoo distributions. - download-bigdl.sh is used for downloading BigDL distributions.
- start-notebook-spark.sh is used for starting the jupyter notebook on standard spark cluster. - start-notebook-spark.sh is used for starting the jupyter notebook on standard spark cluster.
- start-notebook-k8s.sh is used for starting the jupyter notebook on k8s cluster. - start-notebook-k8s.sh is used for starting the jupyter notebook on k8s cluster.
- analytics-zoo-x.x-SNAPSHOT is `ANALYTICS_ZOO_HOME`, which is the home of Analytics Zoo distribution. - bigdl-x.x-SNAPSHOT is `BIGDL_HOME`, which is the home of BigDL distribution.
- analytics-zoo-examples directory contains downloaded python example code. - bigdl-examples directory contains downloaded python example code.
- install-conda-env.sh is displayed that conda env and python dependencies are installed.
- jdk is the jdk home. - jdk is the jdk home.
- spark is the spark home. - spark is the spark home.
- redis is the redis home. - redis is the redis home.
### **3. Run Analytics Zoo Examples on k8s** ### **3. Run BigDL Examples on k8s**
_**Note**: Please make sure `kubectl` has appropriate permission to create, list and delete pod._ _**Note**: Please make sure `kubectl` has appropriate permission to create, list and delete pod._
@ -113,13 +116,13 @@ _**Note**: Please refer to section 4 for some know issues._
#### **3.1 K8s client mode** #### **3.1 K8s client mode**
We recommend using `init_orca_context` at the very beginning of your code (e.g. in script.py) to initiate and run Analytics Zoo on standard K8s clusters in [client mode](http://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode). We recommend using `init_orca_context` at the very beginning of your code (e.g. in script.py) to initiate and run BigDL on standard K8s clusters in [client mode](http://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode).
```python ```python
from zoo.orca import init_orca_context from bigdl.orca import init_orca_context
init_orca_context(cluster_mode="k8s", master="k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>", init_orca_context(cluster_mode="k8s", master="k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>",
container_image="intelanalytics/hyper-zoo:latest", container_image="intelanalytics/bigdl-k8s:latest",
num_nodes=2, cores=2, num_nodes=2, cores=2,
conf={"spark.driver.host": "x.x.x.x", conf={"spark.driver.host": "x.x.x.x",
"spark.driver.port": "x"}) "spark.driver.port": "x"})
@ -129,28 +132,36 @@ Execute `python script.py` to run your program on k8s cluster directly.
#### **3.2 K8s cluster mode** #### **3.2 K8s cluster mode**
For k8s [cluster mode](https://spark.apache.org/docs/2.4.5/running-on-kubernetes.html#cluster-mode), you can call `init_orca_context` and specify cluster_mode to be "spark-submit" in your python script (e.g. in script.py): For k8s [cluster mode](https://spark.apache.org/docs/3.1.2/running-on-kubernetes.html#cluster-mode), you can call `init_orca_context` and specify cluster_mode to be "spark-submit" in your python script (e.g. in script.py):
```python ```python
from zoo.orca import init_orca_context from bigdl.orca import init_orca_context
init_orca_context(cluster_mode="spark-submit") init_orca_context(cluster_mode="spark-submit")
``` ```
Use spark-submit to submit your Analytics Zoo program: Use spark-submit to submit your BigDL program:
```bash ```bash
${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \ ${SPARK_HOME}/bin/spark-submit \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \ --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-mode cluster \ --deploy-mode cluster \
--name analytics-zoo \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=account \
--conf spark.kubernetes.container.image="intelanalytics/hyper-zoo:latest" \ --name bigdl \
--conf spark.kubernetes.container.image="intelanalytics/bigdl-k8s:latest" \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.pyspark.driver.python=./environment/bin/python \
--conf spark.pyspark.python=./environment/bin/python \
--conf spark.executor.instances=1 \ --conf spark.executor.instances=1 \
--executor-memory 10g \ --executor-memory 10g \
--driver-memory 10g \ --driver-memory 10g \
--executor-cores 8 \ --executor-cores 8 \
--num-executors 2 \ --num-executors 2 \
file:///path/script.py --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files local://${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,local:///path/script.py
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
local:///path/script.py
``` ```
#### **3.3 Run Jupyter Notebooks** #### **3.3 Run Jupyter Notebooks**
@ -164,7 +175,7 @@ In the `/opt` directory, run this command line to start the Jupyter Notebook ser
You will see the output message like below. This means the Jupyter Notebook service has started successfully within the container. You will see the output message like below. This means the Jupyter Notebook service has started successfully within the container.
``` ```
[I 23:51:08.456 NotebookApp] Serving notebooks from local directory: /opt/analytics-zoo-0.11.0-SNAPSHOT/apps [I 23:51:08.456 NotebookApp] Serving notebooks from local directory: /opt/bigdl-0.14.0-SNAPSHOT/apps
[I 23:51:08.456 NotebookApp] Jupyter Notebook 6.2.0 is running at: [I 23:51:08.456 NotebookApp] Jupyter Notebook 6.2.0 is running at:
[I 23:51:08.456 NotebookApp] http://xxxx:12345/?token=... [I 23:51:08.456 NotebookApp] http://xxxx:12345/?token=...
[I 23:51:08.457 NotebookApp] or http://127.0.0.1:12345/?token=... [I 23:51:08.457 NotebookApp] or http://127.0.0.1:12345/?token=...
@ -175,7 +186,7 @@ Then, refer [docker guide](./docker.md) to open Jupyter Notebook service from a
#### **3.4 Run Scala programs** #### **3.4 Run Scala programs**
Use spark-submit to submit your Analytics Zoo program. e.g., run [anomalydetection](https://github.com/intel-analytics/analytics-zoo/tree/master/zoo/src/main/scala/com/intel/analytics/zoo/examples/anomalydetection) example (running in either local mode or cluster mode) as follows: Use spark-submit to submit your BigDL program. e.g., run [nnframes imageInference](../../../../../../scala/dllib/src/main/scala/com/intel/analytics/bigdl/dllib/example/nnframes/imageInference) example (running in either local mode or cluster mode) as follows:
```bash ```bash
${SPARK_HOME}/bin/spark-submit \ ${SPARK_HOME}/bin/spark-submit \
@ -184,7 +195,7 @@ ${SPARK_HOME}/bin/spark-submit \
--conf spark.driver.host=${RUNTIME_DRIVER_HOST} \ --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
--conf spark.driver.port=${RUNTIME_DRIVER_PORT} \ --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--name analytics-zoo \ --name bigdl \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \ --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \ --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
@ -198,14 +209,13 @@ ${SPARK_HOME}/bin/spark-submit \
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \ --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
--driver-cores ${RUNTIME_DRIVER_CORES} \ --driver-cores ${RUNTIME_DRIVER_CORES} \
--driver-memory ${RUNTIME_DRIVER_MEMORY} \ --driver-memory ${RUNTIME_DRIVER_MEMORY} \
--properties-file ${ANALYTICS_ZOO_HOME}/conf/spark-analytics-zoo.conf \ --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${ANALYTICS_ZOO_HOME}/lib/analytics-zoo-bigdl_${BIGDL_VERSION}-spark_${SPARK_VERSION}-${ANALYTICS_ZOO_VERSION}-python-api.zip \
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \ --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
--conf spark.sql.catalogImplementation='in-memory' \ --conf spark.sql.catalogImplementation='in-memory' \
--conf spark.driver.extraClassPath=${ANALYTICS_ZOO_HOME}/lib/analytics-zoo-bigdl_${BIGDL_VERSION}-spark_${SPARK_VERSION}-${ANALYTICS_ZOO_VERSION}-jar-with-dependencies.jar \ --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=${ANALYTICS_ZOO_HOME}/lib/analytics-zoo-bigdl_${BIGDL_VERSION}-spark_${SPARK_VERSION}-${ANALYTICS_ZOO_VERSION}-jar-with-dependencies.jar \ --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
--class com.intel.analytics.zoo.examples.anomalydetection.AnomalyDetection \ --class com.intel.analytics.bigdl.dllib.examples.nnframes.imageInference.ImageTransferLearning \
${ANALYTICS_ZOO_HOME}/lib/analytics-zoo-bigdl_${BIGDL_VERSION}-spark_${SPARK_VERSION}-${ANALYTICS_ZOO_VERSION}-python-api.zip \ ${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip \
--inputDir /path --inputDir /path
``` ```
@ -218,25 +228,42 @@ Options:
- --properties-file: the customized conf properties. - --properties-file: the customized conf properties.
- --py-files: the extra python packages is needed. - --py-files: the extra python packages is needed.
- --class: scala example class name. - --class: scala example class name.
- --inputDir: input data path of the anomaly detection example. The data path is the mounted filesystem of the host. Refer to more details by [Kubernetes Volumes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes). - --inputDir: input data path of the nnframe example. The data path is the mounted filesystem of the host. Refer to more details by [Kubernetes Volumes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes).
### **4 Know issues** ### **4 Know issues**
This section shows some common topics for both client mode and cluster mode. This section shows some common topics for both client mode and cluster mode.
#### **4.1 How to retain executor logs for debugging?** #### **4.1 How to specify python environment**
The k8s image provides conda python environment. Image "intelanalytics/bigdl-k8s:latest" installs python environment in "/usr/local/envs/pytf1/bin/python". Image "intelanalytics/bigdl-k8s:latest-tf2" installs python environment in "/usr/local/envs/pytf2/bin/python".
In client mode, set python env and run application:
```python
source activate pytf1
python script.py
```
In cluster mode, specify on both the driver and executor:
```bash
${SPARK_HOME}/bin/spark-submit \
--... ...\
--conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
file:///path/script.py
```
#### **4.2 How to retain executor logs for debugging?**
The k8s would delete the pod once the executor failed in client mode and cluster mode. If you want to get the content of executor log, you could set "temp-dir" to a mounted network file system (NFS) storage to change the log dir to replace the former one. In this case, you may meet `JSONDecodeError` because multiple executors would write logs to the same physical folder and cause conflicts. The solutions are in the next section. The k8s would delete the pod once the executor failed in client mode and cluster mode. If you want to get the content of executor log, you could set "temp-dir" to a mounted network file system (NFS) storage to change the log dir to replace the former one. In this case, you may meet `JSONDecodeError` because multiple executors would write logs to the same physical folder and cause conflicts. The solutions are in the next section.
```python ```python
init_orca_context(..., extra_params = {"temp-dir": "/zoo/"}) init_orca_context(..., extra_params = {"temp-dir": "/bigdl/"})
``` ```
#### **4.2 How to deal with "JSONDecodeError" ?** #### **4.3 How to deal with "JSONDecodeError" ?**
If you set `temp-dir` to a mounted nfs storage and use multiple executors , you may meet `JSONDecodeError` since multiple executors would write to the same physical folder and cause conflicts. Do not mount `temp-dir` to shared storage is one option to avoid conflicts. But if you debug ray on k8s, you need to output logs to a shared storage. In this case, you could set num-nodes to 1. After testing, you can remove `temp-dir` setting and run multiple executors. If you set `temp-dir` to a mounted nfs storage and use multiple executors , you may meet `JSONDecodeError` since multiple executors would write to the same physical folder and cause conflicts. Do not mount `temp-dir` to shared storage is one option to avoid conflicts. But if you debug ray on k8s, you need to output logs to a shared storage. In this case, you could set num-nodes to 1. After testing, you can remove `temp-dir` setting and run multiple executors.
#### **4.3 How to use NFS?** #### **4.4 How to use NFS?**
If you want to save some files out of pod's lifecycle, such as logging callbacks or tensorboard callbacks, you need to set the output dir to a mounted persistent volume dir. Let NFS be a simple example. If you want to save some files out of pod's lifecycle, such as logging callbacks or tensorboard callbacks, you need to set the output dir to a mounted persistent volume dir. Let NFS be a simple example.
@ -246,23 +273,23 @@ Use NFS in client mode:
init_orca_context(cluster_mode="k8s", ..., init_orca_context(cluster_mode="k8s", ...,
conf={..., conf={...,
"spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim", "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/zoo" "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/bigdl"
}) })
``` ```
Use NFS in cluster mode: Use NFS in cluster mode:
```bash ```bash
${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \ ${SPARK_HOME}/bin/spark-submit \
--... ...\ --... ...\
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/zoo" \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/bigdl" \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/zoo" \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/bigdl" \
file:///path/script.py file:///path/script.py
``` ```
#### **4.4 How to deal with "RayActorError" ?** #### **4.5 How to deal with "RayActorError" ?**
"RayActorError" may caused by running out of the ray memory. If you meet this error, try to increase the memory for ray. "RayActorError" may caused by running out of the ray memory. If you meet this error, try to increase the memory for ray.
@ -270,11 +297,11 @@ ${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
init_orca_context(..., exra_executor_memory_for_ray=100g) init_orca_context(..., exra_executor_memory_for_ray=100g)
``` ```
#### **4.5 How to set proper "steps_per_epoch" and "validation steps" ?** #### **4.6 How to set proper "steps_per_epoch" and "validation steps" ?**
The `steps_per_epoch` and `validation_steps` should equal to numbers of dataset divided by batch size if you want to train all dataset. The `steps_per_epoch` and `validation_steps` do not relate to the `num_nodes` when total dataset and batch size are fixed. For example, you set `num_nodes` to 1, and set `steps_per_epoch` to 6. If you change the `num_nodes` to 3, the `steps_per_epoch` should still be 6. The `steps_per_epoch` and `validation_steps` should equal to numbers of dataset divided by batch size if you want to train all dataset. The `steps_per_epoch` and `validation_steps` do not relate to the `num_nodes` when total dataset and batch size are fixed. For example, you set `num_nodes` to 1, and set `steps_per_epoch` to 6. If you change the `num_nodes` to 3, the `steps_per_epoch` should still be 6.
#### **4.6 Others** #### **4.7 Others**
`spark.kubernetes.container.image.pullPolicy` needs to be specified as `always` if you need to update your spark executor image for k8s. `spark.kubernetes.container.image.pullPolicy` needs to be specified as `always` if you need to update your spark executor image for k8s.