Update k8s user guide for run from remote (#3606)
* update doc * minor * update * fix * minor * minor * fix
This commit is contained in:
parent
6c1c0e46b1
commit
dffeda5e1e
1 changed files with 35 additions and 19 deletions
|
|
@ -108,13 +108,29 @@ The `/opt` directory contains:
|
||||||
- spark is the spark home.
|
- spark is the spark home.
|
||||||
- redis is the redis home.
|
- redis is the redis home.
|
||||||
|
|
||||||
### **3. Run BigDL Examples on k8s**
|
### **3. Submit to k8s from remote**
|
||||||
|
|
||||||
|
Instead of lanuching a client container, you can also submit BigDL application from a remote node with the following steps:
|
||||||
|
|
||||||
|
1. Check the [prerequisites](https://spark.apache.org/docs/latest/running-on-kubernetes.html#prerequisites) of running Spark on Kubernetes.
|
||||||
|
|
||||||
|
- The remote node needs to properly setup the configurations and authentications of the k8s cluster (e.g. the `config` file under `~/.kube`, especially the server address in the `config`).
|
||||||
|
|
||||||
|
- Install `kubectl` on the remote node and run some sample commands for verification, for example `kubectl auth can-i <list|create|edit|delete> pods`.
|
||||||
|
Note that the installation of `kubectl` is not a must for the remote node, but it is a useful tool to verify whether the remote node has access to the k8s cluster.
|
||||||
|
|
||||||
|
- The environment variables `http_proxy` and `https_proxy` may affect the connection using `kubectl`. You may check and unset these environment variables in case you get errors when executing the `kubectl` commands on the remote node.
|
||||||
|
|
||||||
|
2. Follow the steps in the [Python User Guide](./python.html#install) to install BigDL in a conda environment.
|
||||||
|
|
||||||
|
|
||||||
|
### **4. Run BigDL on k8s**
|
||||||
|
|
||||||
_**Note**: Please make sure `kubectl` has appropriate permission to create, list and delete pod._
|
_**Note**: Please make sure `kubectl` has appropriate permission to create, list and delete pod._
|
||||||
|
|
||||||
_**Note**: Please refer to section 4 for some know issues._
|
You may refer to [Section 5](#known-issues) for some known issues when running BigDL on k8s.
|
||||||
|
|
||||||
#### **3.1 K8s client mode**
|
#### **4.1 K8s client mode**
|
||||||
|
|
||||||
We recommend using `init_orca_context` at the very beginning of your code (e.g. in script.py) to initiate and run BigDL on standard K8s clusters in [client mode](http://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode).
|
We recommend using `init_orca_context` at the very beginning of your code (e.g. in script.py) to initiate and run BigDL on standard K8s clusters in [client mode](http://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode).
|
||||||
|
|
||||||
|
|
@ -123,14 +139,14 @@ from bigdl.orca import init_orca_context
|
||||||
|
|
||||||
init_orca_context(cluster_mode="k8s", master="k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>",
|
init_orca_context(cluster_mode="k8s", master="k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>",
|
||||||
container_image="intelanalytics/bigdl-k8s:latest",
|
container_image="intelanalytics/bigdl-k8s:latest",
|
||||||
num_nodes=2, cores=2,
|
num_nodes=2, cores=2, memory="2g")
|
||||||
conf={"spark.driver.host": "x.x.x.x",
|
|
||||||
"spark.driver.port": "x"})
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Remark: You may need to specify Spark driver host and port if necessary by adding the argument: `conf={"spark.driver.host": "x.x.x.x", "spark.driver.port": "x"}`.
|
||||||
|
|
||||||
Execute `python script.py` to run your program on k8s cluster directly.
|
Execute `python script.py` to run your program on k8s cluster directly.
|
||||||
|
|
||||||
#### **3.2 K8s cluster mode**
|
#### **4.2 K8s cluster mode**
|
||||||
|
|
||||||
For k8s [cluster mode](https://spark.apache.org/docs/3.1.2/running-on-kubernetes.html#cluster-mode), you can call `init_orca_context` and specify cluster_mode to be "spark-submit" in your python script (e.g. in script.py):
|
For k8s [cluster mode](https://spark.apache.org/docs/3.1.2/running-on-kubernetes.html#cluster-mode), you can call `init_orca_context` and specify cluster_mode to be "spark-submit" in your python script (e.g. in script.py):
|
||||||
|
|
||||||
|
|
@ -164,7 +180,7 @@ ${SPARK_HOME}/bin/spark-submit \
|
||||||
local:///path/script.py
|
local:///path/script.py
|
||||||
```
|
```
|
||||||
|
|
||||||
#### **3.3 Run Jupyter Notebooks**
|
#### **4.3 Run Jupyter Notebooks**
|
||||||
|
|
||||||
After a Docker container is launched and user login into the container, you can start the Jupyter Notebook service inside the container.
|
After a Docker container is launched and user login into the container, you can start the Jupyter Notebook service inside the container.
|
||||||
|
|
||||||
|
|
@ -184,7 +200,7 @@ You will see the output message like below. This means the Jupyter Notebook serv
|
||||||
|
|
||||||
Then, refer [docker guide](./docker.md) to open Jupyter Notebook service from a browser and run notebook.
|
Then, refer [docker guide](./docker.md) to open Jupyter Notebook service from a browser and run notebook.
|
||||||
|
|
||||||
#### **3.4 Run Scala programs**
|
#### **4.4 Run Scala programs**
|
||||||
|
|
||||||
Use spark-submit to submit your BigDL program. e.g., run [nnframes imageInference](../../../../../../scala/dllib/src/main/scala/com/intel/analytics/bigdl/dllib/example/nnframes/imageInference) example (running in either local mode or cluster mode) as follows:
|
Use spark-submit to submit your BigDL program. e.g., run [nnframes imageInference](../../../../../../scala/dllib/src/main/scala/com/intel/analytics/bigdl/dllib/example/nnframes/imageInference) example (running in either local mode or cluster mode) as follows:
|
||||||
|
|
||||||
|
|
@ -230,11 +246,11 @@ Options:
|
||||||
- --class: scala example class name.
|
- --class: scala example class name.
|
||||||
- --inputDir: input data path of the nnframe example. The data path is the mounted filesystem of the host. Refer to more details by [Kubernetes Volumes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes).
|
- --inputDir: input data path of the nnframe example. The data path is the mounted filesystem of the host. Refer to more details by [Kubernetes Volumes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes).
|
||||||
|
|
||||||
### **4 Know issues**
|
### **5 Known Issues**
|
||||||
|
|
||||||
This section shows some common topics for both client mode and cluster mode.
|
This section shows some common topics for both client mode and cluster mode.
|
||||||
|
|
||||||
#### **4.1 How to specify python environment**
|
#### **5.1 How to specify the Python environment?**
|
||||||
|
|
||||||
The k8s image provides conda python environment. Image "intelanalytics/bigdl-k8s:latest" installs python environment in "/usr/local/envs/pytf1/bin/python". Image "intelanalytics/bigdl-k8s:latest-tf2" installs python environment in "/usr/local/envs/pytf2/bin/python".
|
The k8s image provides conda python environment. Image "intelanalytics/bigdl-k8s:latest" installs python environment in "/usr/local/envs/pytf1/bin/python". Image "intelanalytics/bigdl-k8s:latest-tf2" installs python environment in "/usr/local/envs/pytf2/bin/python".
|
||||||
|
|
||||||
|
|
@ -251,7 +267,7 @@ ${SPARK_HOME}/bin/spark-submit \
|
||||||
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
|
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
|
||||||
file:///path/script.py
|
file:///path/script.py
|
||||||
```
|
```
|
||||||
#### **4.2 How to retain executor logs for debugging?**
|
#### **5.2 How to retain executor logs for debugging?**
|
||||||
|
|
||||||
The k8s would delete the pod once the executor failed in client mode and cluster mode. If you want to get the content of executor log, you could set "temp-dir" to a mounted network file system (NFS) storage to change the log dir to replace the former one. In this case, you may meet `JSONDecodeError` because multiple executors would write logs to the same physical folder and cause conflicts. The solutions are in the next section.
|
The k8s would delete the pod once the executor failed in client mode and cluster mode. If you want to get the content of executor log, you could set "temp-dir" to a mounted network file system (NFS) storage to change the log dir to replace the former one. In this case, you may meet `JSONDecodeError` because multiple executors would write logs to the same physical folder and cause conflicts. The solutions are in the next section.
|
||||||
|
|
||||||
|
|
@ -259,11 +275,11 @@ The k8s would delete the pod once the executor failed in client mode and cluster
|
||||||
init_orca_context(..., extra_params = {"temp-dir": "/bigdl/"})
|
init_orca_context(..., extra_params = {"temp-dir": "/bigdl/"})
|
||||||
```
|
```
|
||||||
|
|
||||||
#### **4.3 How to deal with "JSONDecodeError" ?**
|
#### **5.3 How to deal with "JSONDecodeError"?**
|
||||||
|
|
||||||
If you set `temp-dir` to a mounted nfs storage and use multiple executors , you may meet `JSONDecodeError` since multiple executors would write to the same physical folder and cause conflicts. Do not mount `temp-dir` to shared storage is one option to avoid conflicts. But if you debug ray on k8s, you need to output logs to a shared storage. In this case, you could set num-nodes to 1. After testing, you can remove `temp-dir` setting and run multiple executors.
|
If you set `temp-dir` to a mounted nfs storage and use multiple executors , you may meet `JSONDecodeError` since multiple executors would write to the same physical folder and cause conflicts. Do not mount `temp-dir` to shared storage is one option to avoid conflicts. But if you debug ray on k8s, you need to output logs to a shared storage. In this case, you could set num-nodes to 1. After testing, you can remove `temp-dir` setting and run multiple executors.
|
||||||
|
|
||||||
#### **4.4 How to use NFS?**
|
#### **5.4 How to use NFS?**
|
||||||
|
|
||||||
If you want to save some files out of pod's lifecycle, such as logging callbacks or tensorboard callbacks, you need to set the output dir to a mounted persistent volume dir. Let NFS be a simple example.
|
If you want to save some files out of pod's lifecycle, such as logging callbacks or tensorboard callbacks, you need to set the output dir to a mounted persistent volume dir. Let NFS be a simple example.
|
||||||
|
|
||||||
|
|
@ -289,23 +305,23 @@ ${SPARK_HOME}/bin/spark-submit \
|
||||||
file:///path/script.py
|
file:///path/script.py
|
||||||
```
|
```
|
||||||
|
|
||||||
#### **4.5 How to deal with "RayActorError" ?**
|
#### **5.5 How to deal with "RayActorError"?**
|
||||||
|
|
||||||
"RayActorError" may caused by running out of the ray memory. If you meet this error, try to increase the memory for ray.
|
"RayActorError" may caused by running out of the ray memory. If you meet this error, try to increase the memory for ray.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
init_orca_context(..., exra_executor_memory_for_ray=100g)
|
init_orca_context(..., extra_executor_memory_for_ray="100g")
|
||||||
```
|
```
|
||||||
|
|
||||||
#### **4.6 How to set proper "steps_per_epoch" and "validation steps" ?**
|
#### **5.6 How to set proper "steps_per_epoch" and "validation steps"?**
|
||||||
|
|
||||||
The `steps_per_epoch` and `validation_steps` should equal to numbers of dataset divided by batch size if you want to train all dataset. The `steps_per_epoch` and `validation_steps` do not relate to the `num_nodes` when total dataset and batch size are fixed. For example, you set `num_nodes` to 1, and set `steps_per_epoch` to 6. If you change the `num_nodes` to 3, the `steps_per_epoch` should still be 6.
|
The `steps_per_epoch` and `validation_steps` should equal to numbers of dataset divided by batch size if you want to train all dataset. The `steps_per_epoch` and `validation_steps` do not relate to the `num_nodes` when total dataset and batch size are fixed. For example, you set `num_nodes` to 1, and set `steps_per_epoch` to 6. If you change the `num_nodes` to 3, the `steps_per_epoch` should still be 6.
|
||||||
|
|
||||||
#### **4.7 Others**
|
#### **5.7 Others**
|
||||||
|
|
||||||
`spark.kubernetes.container.image.pullPolicy` needs to be specified as `always` if you need to update your spark executor image for k8s.
|
`spark.kubernetes.container.image.pullPolicy` needs to be specified as `always` if you need to update your spark executor image for k8s.
|
||||||
|
|
||||||
### **5. Access logs and clear pods**
|
### **6. Access logs and clear pods**
|
||||||
|
|
||||||
When application is running, it’s possible to stream logs on the driver pod:
|
When application is running, it’s possible to stream logs on the driver pod:
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue