update k8s docs (#3461)
This commit is contained in:
parent
30b4c3d9d5
commit
bee5879b24
1 changed files with 69 additions and 42 deletions
|
|
@ -2,14 +2,16 @@
|
|||
|
||||
---
|
||||
|
||||
### **1. Pull `hyper-zoo` Docker Image**
|
||||
### **1. Pull `bigdl-k8s` Docker Image**
|
||||
|
||||
You may pull the prebuilt Analytics Zoo `hyper-zoo` Image from [Docker Hub](https://hub.docker.com/r/intelanalytics/hyper-zoo/tags) as follows:
|
||||
You may pull the prebuilt BigDL `bigdl-k8s` Image from [Docker Hub](https://hub.docker.com/r/intelanalytics/bigdl-k8s/tags) as follows:
|
||||
|
||||
```bash
|
||||
sudo docker pull intelanalytics/hyper-zoo:latest
|
||||
sudo docker pull intelanalytics/bigdl-k8s:latest
|
||||
```
|
||||
|
||||
Note, If you would like to run Tensorflow 2.x application, pull image "bigdl-k8s:latest-tf2" with `sudo docker pull intelanalytics/bigdl-k8s:latest-tf2`. The two images are distinguished with tensorflow version installed in python environment.
|
||||
|
||||
**Speed up pulling image by adding mirrors**
|
||||
|
||||
To speed up pulling the image from DockerHub, you may add the registry-mirrors key and value by editing `daemon.json` (located in `/etc/docker/` folder on Linux):
|
||||
|
|
@ -34,7 +36,7 @@ sudo systemctl restart docker
|
|||
|
||||
### **2. Launch a Client Container**
|
||||
|
||||
You can submit Analytics Zoo application from a client container that provides the required environment.
|
||||
You can submit BigDL application from a client container that provides the required environment.
|
||||
|
||||
```bash
|
||||
sudo docker run -itd --net=host \
|
||||
|
|
@ -57,7 +59,7 @@ sudo docker run -itd --net=host \
|
|||
-e https_proxy=https://your-proxy-host:your-proxy-port \
|
||||
-e RUNTIME_SPARK_MASTER=k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
|
||||
-e RUNTIME_K8S_SERVICE_ACCOUNT=account \
|
||||
-e RUNTIME_K8S_SPARK_IMAGE=intelanalytics/hyper-zoo:latest \
|
||||
-e RUNTIME_K8S_SPARK_IMAGE=intelanalytics/bigdl-k8s:latest \
|
||||
-e RUNTIME_PERSISTENT_VOLUME_CLAIM=myvolumeclaim \
|
||||
-e RUNTIME_DRIVER_HOST=x.x.x.x \
|
||||
-e RUNTIME_DRIVER_PORT=54321 \
|
||||
|
|
@ -67,7 +69,7 @@ sudo docker run -itd --net=host \
|
|||
-e RUNTIME_TOTAL_EXECUTOR_CORES=4 \
|
||||
-e RUNTIME_DRIVER_CORES=4 \
|
||||
-e RUNTIME_DRIVER_MEMORY=10g \
|
||||
intelanalytics/hyper-zoo:latest bash
|
||||
intelanalytics/bigdl-k8s:latest bash
|
||||
```
|
||||
|
||||
- NOTEBOOK_PORT value 12345 is a user specified port number.
|
||||
|
|
@ -96,16 +98,17 @@ root@[hostname]:/opt/spark/work-dir#
|
|||
|
||||
The `/opt` directory contains:
|
||||
|
||||
- download-analytics-zoo.sh is used for downloading Analytics-Zoo distributions.
|
||||
- download-bigdl.sh is used for downloading BigDL distributions.
|
||||
- start-notebook-spark.sh is used for starting the jupyter notebook on standard spark cluster.
|
||||
- start-notebook-k8s.sh is used for starting the jupyter notebook on k8s cluster.
|
||||
- analytics-zoo-x.x-SNAPSHOT is `ANALYTICS_ZOO_HOME`, which is the home of Analytics Zoo distribution.
|
||||
- analytics-zoo-examples directory contains downloaded python example code.
|
||||
- bigdl-x.x-SNAPSHOT is `BIGDL_HOME`, which is the home of BigDL distribution.
|
||||
- bigdl-examples directory contains downloaded python example code.
|
||||
- install-conda-env.sh is displayed that conda env and python dependencies are installed.
|
||||
- jdk is the jdk home.
|
||||
- spark is the spark home.
|
||||
- redis is the redis home.
|
||||
|
||||
### **3. Run Analytics Zoo Examples on k8s**
|
||||
### **3. Run BigDL Examples on k8s**
|
||||
|
||||
_**Note**: Please make sure `kubectl` has appropriate permission to create, list and delete pod._
|
||||
|
||||
|
|
@ -113,13 +116,13 @@ _**Note**: Please refer to section 4 for some know issues._
|
|||
|
||||
#### **3.1 K8s client mode**
|
||||
|
||||
We recommend using `init_orca_context` at the very beginning of your code (e.g. in script.py) to initiate and run Analytics Zoo on standard K8s clusters in [client mode](http://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode).
|
||||
We recommend using `init_orca_context` at the very beginning of your code (e.g. in script.py) to initiate and run BigDL on standard K8s clusters in [client mode](http://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode).
|
||||
|
||||
```python
|
||||
from zoo.orca import init_orca_context
|
||||
from bigdl.orca import init_orca_context
|
||||
|
||||
init_orca_context(cluster_mode="k8s", master="k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>",
|
||||
container_image="intelanalytics/hyper-zoo:latest",
|
||||
container_image="intelanalytics/bigdl-k8s:latest",
|
||||
num_nodes=2, cores=2,
|
||||
conf={"spark.driver.host": "x.x.x.x",
|
||||
"spark.driver.port": "x"})
|
||||
|
|
@ -129,28 +132,36 @@ Execute `python script.py` to run your program on k8s cluster directly.
|
|||
|
||||
#### **3.2 K8s cluster mode**
|
||||
|
||||
For k8s [cluster mode](https://spark.apache.org/docs/2.4.5/running-on-kubernetes.html#cluster-mode), you can call `init_orca_context` and specify cluster_mode to be "spark-submit" in your python script (e.g. in script.py):
|
||||
For k8s [cluster mode](https://spark.apache.org/docs/3.1.2/running-on-kubernetes.html#cluster-mode), you can call `init_orca_context` and specify cluster_mode to be "spark-submit" in your python script (e.g. in script.py):
|
||||
|
||||
```python
|
||||
from zoo.orca import init_orca_context
|
||||
from bigdl.orca import init_orca_context
|
||||
|
||||
init_orca_context(cluster_mode="spark-submit")
|
||||
```
|
||||
|
||||
Use spark-submit to submit your Analytics Zoo program:
|
||||
Use spark-submit to submit your BigDL program:
|
||||
|
||||
```bash
|
||||
${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
|
||||
${SPARK_HOME}/bin/spark-submit \
|
||||
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
|
||||
--deploy-mode cluster \
|
||||
--name analytics-zoo \
|
||||
--conf spark.kubernetes.container.image="intelanalytics/hyper-zoo:latest" \
|
||||
--conf spark.kubernetes.authenticate.driver.serviceAccountName=account \
|
||||
--name bigdl \
|
||||
--conf spark.kubernetes.container.image="intelanalytics/bigdl-k8s:latest" \
|
||||
--conf spark.kubernetes.container.image.pullPolicy=Always \
|
||||
--conf spark.pyspark.driver.python=./environment/bin/python \
|
||||
--conf spark.pyspark.python=./environment/bin/python \
|
||||
--conf spark.executor.instances=1 \
|
||||
--executor-memory 10g \
|
||||
--driver-memory 10g \
|
||||
--executor-cores 8 \
|
||||
--num-executors 2 \
|
||||
file:///path/script.py
|
||||
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
|
||||
--py-files local://${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,local:///path/script.py
|
||||
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
|
||||
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
|
||||
local:///path/script.py
|
||||
```
|
||||
|
||||
#### **3.3 Run Jupyter Notebooks**
|
||||
|
|
@ -164,7 +175,7 @@ In the `/opt` directory, run this command line to start the Jupyter Notebook ser
|
|||
|
||||
You will see the output message like below. This means the Jupyter Notebook service has started successfully within the container.
|
||||
```
|
||||
[I 23:51:08.456 NotebookApp] Serving notebooks from local directory: /opt/analytics-zoo-0.11.0-SNAPSHOT/apps
|
||||
[I 23:51:08.456 NotebookApp] Serving notebooks from local directory: /opt/bigdl-0.14.0-SNAPSHOT/apps
|
||||
[I 23:51:08.456 NotebookApp] Jupyter Notebook 6.2.0 is running at:
|
||||
[I 23:51:08.456 NotebookApp] http://xxxx:12345/?token=...
|
||||
[I 23:51:08.457 NotebookApp] or http://127.0.0.1:12345/?token=...
|
||||
|
|
@ -175,7 +186,7 @@ Then, refer [docker guide](./docker.md) to open Jupyter Notebook service from a
|
|||
|
||||
#### **3.4 Run Scala programs**
|
||||
|
||||
Use spark-submit to submit your Analytics Zoo program. e.g., run [anomalydetection](https://github.com/intel-analytics/analytics-zoo/tree/master/zoo/src/main/scala/com/intel/analytics/zoo/examples/anomalydetection) example (running in either local mode or cluster mode) as follows:
|
||||
Use spark-submit to submit your BigDL program. e.g., run [nnframes imageInference](../../../../../../scala/dllib/src/main/scala/com/intel/analytics/bigdl/dllib/example/nnframes/imageInference) example (running in either local mode or cluster mode) as follows:
|
||||
|
||||
```bash
|
||||
${SPARK_HOME}/bin/spark-submit \
|
||||
|
|
@ -184,7 +195,7 @@ ${SPARK_HOME}/bin/spark-submit \
|
|||
--conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
|
||||
--conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
|
||||
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
|
||||
--name analytics-zoo \
|
||||
--name bigdl \
|
||||
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
|
||||
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
|
||||
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
|
||||
|
|
@ -198,14 +209,13 @@ ${SPARK_HOME}/bin/spark-submit \
|
|||
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
|
||||
--driver-cores ${RUNTIME_DRIVER_CORES} \
|
||||
--driver-memory ${RUNTIME_DRIVER_MEMORY} \
|
||||
--properties-file ${ANALYTICS_ZOO_HOME}/conf/spark-analytics-zoo.conf \
|
||||
--py-files ${ANALYTICS_ZOO_HOME}/lib/analytics-zoo-bigdl_${BIGDL_VERSION}-spark_${SPARK_VERSION}-${ANALYTICS_ZOO_VERSION}-python-api.zip \
|
||||
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
|
||||
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
|
||||
--conf spark.sql.catalogImplementation='in-memory' \
|
||||
--conf spark.driver.extraClassPath=${ANALYTICS_ZOO_HOME}/lib/analytics-zoo-bigdl_${BIGDL_VERSION}-spark_${SPARK_VERSION}-${ANALYTICS_ZOO_VERSION}-jar-with-dependencies.jar \
|
||||
--conf spark.executor.extraClassPath=${ANALYTICS_ZOO_HOME}/lib/analytics-zoo-bigdl_${BIGDL_VERSION}-spark_${SPARK_VERSION}-${ANALYTICS_ZOO_VERSION}-jar-with-dependencies.jar \
|
||||
--class com.intel.analytics.zoo.examples.anomalydetection.AnomalyDetection \
|
||||
${ANALYTICS_ZOO_HOME}/lib/analytics-zoo-bigdl_${BIGDL_VERSION}-spark_${SPARK_VERSION}-${ANALYTICS_ZOO_VERSION}-python-api.zip \
|
||||
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
|
||||
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
|
||||
--class com.intel.analytics.bigdl.dllib.examples.nnframes.imageInference.ImageTransferLearning \
|
||||
${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip \
|
||||
--inputDir /path
|
||||
```
|
||||
|
||||
|
|
@ -218,25 +228,42 @@ Options:
|
|||
- --properties-file: the customized conf properties.
|
||||
- --py-files: the extra python packages is needed.
|
||||
- --class: scala example class name.
|
||||
- --inputDir: input data path of the anomaly detection example. The data path is the mounted filesystem of the host. Refer to more details by [Kubernetes Volumes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes).
|
||||
- --inputDir: input data path of the nnframe example. The data path is the mounted filesystem of the host. Refer to more details by [Kubernetes Volumes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes).
|
||||
|
||||
### **4 Know issues**
|
||||
|
||||
This section shows some common topics for both client mode and cluster mode.
|
||||
|
||||
#### **4.1 How to retain executor logs for debugging?**
|
||||
#### **4.1 How to specify python environment**
|
||||
|
||||
The k8s image provides conda python environment. Image "intelanalytics/bigdl-k8s:latest" installs python environment in "/usr/local/envs/pytf1/bin/python". Image "intelanalytics/bigdl-k8s:latest-tf2" installs python environment in "/usr/local/envs/pytf2/bin/python".
|
||||
|
||||
In client mode, set python env and run application:
|
||||
```python
|
||||
source activate pytf1
|
||||
python script.py
|
||||
```
|
||||
In cluster mode, specify on both the driver and executor:
|
||||
```bash
|
||||
${SPARK_HOME}/bin/spark-submit \
|
||||
--... ...\
|
||||
--conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
|
||||
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
|
||||
file:///path/script.py
|
||||
```
|
||||
#### **4.2 How to retain executor logs for debugging?**
|
||||
|
||||
The k8s would delete the pod once the executor failed in client mode and cluster mode. If you want to get the content of executor log, you could set "temp-dir" to a mounted network file system (NFS) storage to change the log dir to replace the former one. In this case, you may meet `JSONDecodeError` because multiple executors would write logs to the same physical folder and cause conflicts. The solutions are in the next section.
|
||||
|
||||
```python
|
||||
init_orca_context(..., extra_params = {"temp-dir": "/zoo/"})
|
||||
init_orca_context(..., extra_params = {"temp-dir": "/bigdl/"})
|
||||
```
|
||||
|
||||
#### **4.2 How to deal with "JSONDecodeError" ?**
|
||||
#### **4.3 How to deal with "JSONDecodeError" ?**
|
||||
|
||||
If you set `temp-dir` to a mounted nfs storage and use multiple executors , you may meet `JSONDecodeError` since multiple executors would write to the same physical folder and cause conflicts. Do not mount `temp-dir` to shared storage is one option to avoid conflicts. But if you debug ray on k8s, you need to output logs to a shared storage. In this case, you could set num-nodes to 1. After testing, you can remove `temp-dir` setting and run multiple executors.
|
||||
|
||||
#### **4.3 How to use NFS?**
|
||||
#### **4.4 How to use NFS?**
|
||||
|
||||
If you want to save some files out of pod's lifecycle, such as logging callbacks or tensorboard callbacks, you need to set the output dir to a mounted persistent volume dir. Let NFS be a simple example.
|
||||
|
||||
|
|
@ -246,23 +273,23 @@ Use NFS in client mode:
|
|||
init_orca_context(cluster_mode="k8s", ...,
|
||||
conf={...,
|
||||
"spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim",
|
||||
"spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/zoo"
|
||||
"spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/bigdl"
|
||||
})
|
||||
```
|
||||
|
||||
Use NFS in cluster mode:
|
||||
|
||||
```bash
|
||||
${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
|
||||
${SPARK_HOME}/bin/spark-submit \
|
||||
--... ...\
|
||||
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \
|
||||
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/zoo" \
|
||||
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/bigdl" \
|
||||
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \
|
||||
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/zoo" \
|
||||
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/bigdl" \
|
||||
file:///path/script.py
|
||||
```
|
||||
|
||||
#### **4.4 How to deal with "RayActorError" ?**
|
||||
#### **4.5 How to deal with "RayActorError" ?**
|
||||
|
||||
"RayActorError" may caused by running out of the ray memory. If you meet this error, try to increase the memory for ray.
|
||||
|
||||
|
|
@ -270,11 +297,11 @@ ${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
|
|||
init_orca_context(..., exra_executor_memory_for_ray=100g)
|
||||
```
|
||||
|
||||
#### **4.5 How to set proper "steps_per_epoch" and "validation steps" ?**
|
||||
#### **4.6 How to set proper "steps_per_epoch" and "validation steps" ?**
|
||||
|
||||
The `steps_per_epoch` and `validation_steps` should equal to numbers of dataset divided by batch size if you want to train all dataset. The `steps_per_epoch` and `validation_steps` do not relate to the `num_nodes` when total dataset and batch size are fixed. For example, you set `num_nodes` to 1, and set `steps_per_epoch` to 6. If you change the `num_nodes` to 3, the `steps_per_epoch` should still be 6.
|
||||
|
||||
#### **4.6 Others**
|
||||
#### **4.7 Others**
|
||||
|
||||
`spark.kubernetes.container.image.pullPolicy` needs to be specified as `always` if you need to update your spark executor image for k8s.
|
||||
|
||||
|
|
|
|||
Loading…
Reference in a new issue