Update k8s user guide for run from remote (#3606)

* update doc

* minor

* update

* fix

* minor

* minor

* fix
This commit is contained in:
Kai Huang 2021-11-26 17:54:13 +08:00 committed by GitHub
parent 6c1c0e46b1
commit dffeda5e1e

View file

@ -108,13 +108,29 @@ The `/opt` directory contains:
- spark is the spark home. - spark is the spark home.
- redis is the redis home. - redis is the redis home.
### **3. Run BigDL Examples on k8s** ### **3. Submit to k8s from remote**
Instead of lanuching a client container, you can also submit BigDL application from a remote node with the following steps:
1. Check the [prerequisites](https://spark.apache.org/docs/latest/running-on-kubernetes.html#prerequisites) of running Spark on Kubernetes.
- The remote node needs to properly setup the configurations and authentications of the k8s cluster (e.g. the `config` file under `~/.kube`, especially the server address in the `config`).
- Install `kubectl` on the remote node and run some sample commands for verification, for example `kubectl auth can-i <list|create|edit|delete> pods`.
Note that the installation of `kubectl` is not a must for the remote node, but it is a useful tool to verify whether the remote node has access to the k8s cluster.
- The environment variables `http_proxy` and `https_proxy` may affect the connection using `kubectl`. You may check and unset these environment variables in case you get errors when executing the `kubectl` commands on the remote node.
2. Follow the steps in the [Python User Guide](./python.html#install) to install BigDL in a conda environment.
### **4. Run BigDL on k8s**
_**Note**: Please make sure `kubectl` has appropriate permission to create, list and delete pod._ _**Note**: Please make sure `kubectl` has appropriate permission to create, list and delete pod._
_**Note**: Please refer to section 4 for some know issues._ You may refer to [Section 5](#known-issues) for some known issues when running BigDL on k8s.
#### **3.1 K8s client mode** #### **4.1 K8s client mode**
We recommend using `init_orca_context` at the very beginning of your code (e.g. in script.py) to initiate and run BigDL on standard K8s clusters in [client mode](http://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode). We recommend using `init_orca_context` at the very beginning of your code (e.g. in script.py) to initiate and run BigDL on standard K8s clusters in [client mode](http://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode).
@ -123,14 +139,14 @@ from bigdl.orca import init_orca_context
init_orca_context(cluster_mode="k8s", master="k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>", init_orca_context(cluster_mode="k8s", master="k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>",
container_image="intelanalytics/bigdl-k8s:latest", container_image="intelanalytics/bigdl-k8s:latest",
num_nodes=2, cores=2, num_nodes=2, cores=2, memory="2g")
conf={"spark.driver.host": "x.x.x.x",
"spark.driver.port": "x"})
``` ```
Remark: You may need to specify Spark driver host and port if necessary by adding the argument: `conf={"spark.driver.host": "x.x.x.x", "spark.driver.port": "x"}`.
Execute `python script.py` to run your program on k8s cluster directly. Execute `python script.py` to run your program on k8s cluster directly.
#### **3.2 K8s cluster mode** #### **4.2 K8s cluster mode**
For k8s [cluster mode](https://spark.apache.org/docs/3.1.2/running-on-kubernetes.html#cluster-mode), you can call `init_orca_context` and specify cluster_mode to be "spark-submit" in your python script (e.g. in script.py): For k8s [cluster mode](https://spark.apache.org/docs/3.1.2/running-on-kubernetes.html#cluster-mode), you can call `init_orca_context` and specify cluster_mode to be "spark-submit" in your python script (e.g. in script.py):
@ -164,7 +180,7 @@ ${SPARK_HOME}/bin/spark-submit \
local:///path/script.py local:///path/script.py
``` ```
#### **3.3 Run Jupyter Notebooks** #### **4.3 Run Jupyter Notebooks**
After a Docker container is launched and user login into the container, you can start the Jupyter Notebook service inside the container. After a Docker container is launched and user login into the container, you can start the Jupyter Notebook service inside the container.
@ -184,7 +200,7 @@ You will see the output message like below. This means the Jupyter Notebook serv
Then, refer [docker guide](./docker.md) to open Jupyter Notebook service from a browser and run notebook. Then, refer [docker guide](./docker.md) to open Jupyter Notebook service from a browser and run notebook.
#### **3.4 Run Scala programs** #### **4.4 Run Scala programs**
Use spark-submit to submit your BigDL program. e.g., run [nnframes imageInference](../../../../../../scala/dllib/src/main/scala/com/intel/analytics/bigdl/dllib/example/nnframes/imageInference) example (running in either local mode or cluster mode) as follows: Use spark-submit to submit your BigDL program. e.g., run [nnframes imageInference](../../../../../../scala/dllib/src/main/scala/com/intel/analytics/bigdl/dllib/example/nnframes/imageInference) example (running in either local mode or cluster mode) as follows:
@ -230,11 +246,11 @@ Options:
- --class: scala example class name. - --class: scala example class name.
- --inputDir: input data path of the nnframe example. The data path is the mounted filesystem of the host. Refer to more details by [Kubernetes Volumes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes). - --inputDir: input data path of the nnframe example. The data path is the mounted filesystem of the host. Refer to more details by [Kubernetes Volumes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes).
### **4 Know issues** ### **5 Known Issues**
This section shows some common topics for both client mode and cluster mode. This section shows some common topics for both client mode and cluster mode.
#### **4.1 How to specify python environment** #### **5.1 How to specify the Python environment?**
The k8s image provides conda python environment. Image "intelanalytics/bigdl-k8s:latest" installs python environment in "/usr/local/envs/pytf1/bin/python". Image "intelanalytics/bigdl-k8s:latest-tf2" installs python environment in "/usr/local/envs/pytf2/bin/python". The k8s image provides conda python environment. Image "intelanalytics/bigdl-k8s:latest" installs python environment in "/usr/local/envs/pytf1/bin/python". Image "intelanalytics/bigdl-k8s:latest-tf2" installs python environment in "/usr/local/envs/pytf2/bin/python".
@ -251,7 +267,7 @@ ${SPARK_HOME}/bin/spark-submit \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \ --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
file:///path/script.py file:///path/script.py
``` ```
#### **4.2 How to retain executor logs for debugging?** #### **5.2 How to retain executor logs for debugging?**
The k8s would delete the pod once the executor failed in client mode and cluster mode. If you want to get the content of executor log, you could set "temp-dir" to a mounted network file system (NFS) storage to change the log dir to replace the former one. In this case, you may meet `JSONDecodeError` because multiple executors would write logs to the same physical folder and cause conflicts. The solutions are in the next section. The k8s would delete the pod once the executor failed in client mode and cluster mode. If you want to get the content of executor log, you could set "temp-dir" to a mounted network file system (NFS) storage to change the log dir to replace the former one. In this case, you may meet `JSONDecodeError` because multiple executors would write logs to the same physical folder and cause conflicts. The solutions are in the next section.
@ -259,11 +275,11 @@ The k8s would delete the pod once the executor failed in client mode and cluster
init_orca_context(..., extra_params = {"temp-dir": "/bigdl/"}) init_orca_context(..., extra_params = {"temp-dir": "/bigdl/"})
``` ```
#### **4.3 How to deal with "JSONDecodeError" ?** #### **5.3 How to deal with "JSONDecodeError"?**
If you set `temp-dir` to a mounted nfs storage and use multiple executors , you may meet `JSONDecodeError` since multiple executors would write to the same physical folder and cause conflicts. Do not mount `temp-dir` to shared storage is one option to avoid conflicts. But if you debug ray on k8s, you need to output logs to a shared storage. In this case, you could set num-nodes to 1. After testing, you can remove `temp-dir` setting and run multiple executors. If you set `temp-dir` to a mounted nfs storage and use multiple executors , you may meet `JSONDecodeError` since multiple executors would write to the same physical folder and cause conflicts. Do not mount `temp-dir` to shared storage is one option to avoid conflicts. But if you debug ray on k8s, you need to output logs to a shared storage. In this case, you could set num-nodes to 1. After testing, you can remove `temp-dir` setting and run multiple executors.
#### **4.4 How to use NFS?** #### **5.4 How to use NFS?**
If you want to save some files out of pod's lifecycle, such as logging callbacks or tensorboard callbacks, you need to set the output dir to a mounted persistent volume dir. Let NFS be a simple example. If you want to save some files out of pod's lifecycle, such as logging callbacks or tensorboard callbacks, you need to set the output dir to a mounted persistent volume dir. Let NFS be a simple example.
@ -289,23 +305,23 @@ ${SPARK_HOME}/bin/spark-submit \
file:///path/script.py file:///path/script.py
``` ```
#### **4.5 How to deal with "RayActorError" ?** #### **5.5 How to deal with "RayActorError"?**
"RayActorError" may caused by running out of the ray memory. If you meet this error, try to increase the memory for ray. "RayActorError" may caused by running out of the ray memory. If you meet this error, try to increase the memory for ray.
```python ```python
init_orca_context(..., exra_executor_memory_for_ray=100g) init_orca_context(..., extra_executor_memory_for_ray="100g")
``` ```
#### **4.6 How to set proper "steps_per_epoch" and "validation steps" ?** #### **5.6 How to set proper "steps_per_epoch" and "validation steps"?**
The `steps_per_epoch` and `validation_steps` should equal to numbers of dataset divided by batch size if you want to train all dataset. The `steps_per_epoch` and `validation_steps` do not relate to the `num_nodes` when total dataset and batch size are fixed. For example, you set `num_nodes` to 1, and set `steps_per_epoch` to 6. If you change the `num_nodes` to 3, the `steps_per_epoch` should still be 6. The `steps_per_epoch` and `validation_steps` should equal to numbers of dataset divided by batch size if you want to train all dataset. The `steps_per_epoch` and `validation_steps` do not relate to the `num_nodes` when total dataset and batch size are fixed. For example, you set `num_nodes` to 1, and set `steps_per_epoch` to 6. If you change the `num_nodes` to 3, the `steps_per_epoch` should still be 6.
#### **4.7 Others** #### **5.7 Others**
`spark.kubernetes.container.image.pullPolicy` needs to be specified as `always` if you need to update your spark executor image for k8s. `spark.kubernetes.container.image.pullPolicy` needs to be specified as `always` if you need to update your spark executor image for k8s.
### **5. Access logs and clear pods** ### **6. Access logs and clear pods**
When application is running, its possible to stream logs on the driver pod: When application is running, its possible to stream logs on the driver pod: