* refactor toc * refactor toc * Change to pydata-sphinx-theme and update packages requirement list for ReadtheDocs * Remove customized css for old theme * Add index page to each top bar section and limit dropdown maximum to be 4 * Use js to change 'More' to 'Libraries' * Add custom.css to conf.py for further css changes * Add BigDL logo and search bar * refactor toc * refactor toc and add overview * refactor toc and add overview * refactor toc and add overview * refactor get started * add paper and video section * add videos * add grid columns in landing page * add document roadmap to index * reapply search bar and github icon commit * reorg orca and chronos sections * Test: weaken ads by js * update: change left attrbute * update: add comments * update: change opacity to 0.7 * Remove useless theme template override for old theme * Add sidebar releases component in the home page * Remove sidebar search and restore top nav search button * Add BigDL handouts * Add back to homepage button to pages except from the home page * Update releases contents & styles in left sidebar * Add version badge to the top bar * Test: weaken ads by js * update: add comments * remove landing page contents * rfix chronos install * refactor install * refactor chronos section titles * refactor nano index * change chronos landing * revise chronos landing page * add document navigator to nano landing page * revise install landing page * Improve css of versions in sidebar * Make handouts image pointing to a page in new tab * add win guide to install * add dliib installation * revise title bar * rename index files * add index page for user guide * add dllib and orca API * update user guide landing page * refactor side bar * Remove extra style configuration of card components & make different card usage consistent * Remove extra styles for Nano how-to guides * Remove extra styles for Chronos how-to guides * Remove dark mode for now * Update index page description * Add decision tree for choosing BigDL libraries in index page * add dllib models api, revise core layers formats * Change primary & info color in light mode * Restyle card components * Restructure Chronos landing page * Update card style * Update BigDL library selection decision tree * Fix failed Chronos tutorials filter * refactor PPML documents * refactor and add friesian documents * add friesian arch diagram * update landing pages and fill key features guide index page * Restyle link card component * Style video frames in PPML sections * Adjust Nano landing page * put api docs to the last in index for convinience * Make badge horizontal padding smaller & small changes * Change the second letter of all header titles to be small capitalizd * Small changes on Chronos index page * Revise decision tree to make it smaller * Update: try to change the position of ads. * Bugfix: deleted nonexist file config * Update: update ad JS/CSS/config * Update: change ad. * Update: delete my template and change files. * Update: change chronos installation table color. * Update: change table font color to --pst-color-primary-text * Remove old contents in landing page sidebar * Restyle badge for usage in card footer again * Add quicklinks template on landing page sidebar * add quick links * Add scala logo * move tf, pytorch out of the link * change orca key features cards * fix typo * fix a mistake in wording * Restyle badge for card footer * Update decision tree * Remove useless html templates * add more api docs and update tutorials in dllib * update chronos install using new style * merge changes in nano doc from master * fix quickstart links in sidebar quicklinks * Make tables responsive * Fix overflow in api doc * Fix list indents problems in [User guide] section * Further fixes to nested bullets contents in [User Guide] section * Fix strange title in Nano 5-min doc * Fix list indent problems in [DLlib] section * Fix misnumbered list problems and other small fixes for [Chronos] section * Fix list indent problems and other small fixes for [Friesian] section * Fix list indent problem and other small fixes for [PPML] section * Fix list indent problem for developer guide * Fix list indent problem for [Cluster Serving] section * fix dllib links * Fix wrong relative link in section landing page Co-authored-by: Yuwen Hu <yuwen.hu@intel.com> Co-authored-by: Juntao Luo <1072087358@qq.com>
		
			
				
	
	
		
			346 lines
		
	
	
	
		
			16 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			346 lines
		
	
	
	
		
			16 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
# K8s User Guide
 | 
						||
 | 
						||
---
 | 
						||
 | 
						||
### **1. Pull `bigdl-k8s` Docker Image**
 | 
						||
 | 
						||
You may pull the prebuilt  BigDL `bigdl-k8s` Image from [Docker Hub](https://hub.docker.com/r/intelanalytics/bigdl-k8s/tags) as follows:
 | 
						||
 | 
						||
```bash
 | 
						||
sudo docker pull intelanalytics/bigdl-k8s:latest
 | 
						||
```
 | 
						||
 | 
						||
**Speed up pulling image by adding mirrors**
 | 
						||
 | 
						||
To speed up pulling the image from DockerHub, you may add the registry-mirrors key and value by editing `daemon.json` (located in `/etc/docker/` folder on Linux):
 | 
						||
```
 | 
						||
{
 | 
						||
  "registry-mirrors": ["https://<my-docker-mirror-host>"]
 | 
						||
}
 | 
						||
```
 | 
						||
For instance, users in China may add the USTC mirror as follows:
 | 
						||
```
 | 
						||
{
 | 
						||
  "registry-mirrors": ["https://docker.mirrors.ustc.edu.cn"]
 | 
						||
}
 | 
						||
```
 | 
						||
 | 
						||
After that, flush changes and restart docker:
 | 
						||
 | 
						||
```
 | 
						||
sudo systemctl daemon-reload
 | 
						||
sudo systemctl restart docker
 | 
						||
```
 | 
						||
 | 
						||
### **2. Launch a Client Container**
 | 
						||
 | 
						||
You can submit BigDL application from a client container that provides the required environment.
 | 
						||
 | 
						||
```bash
 | 
						||
sudo docker run -itd --net=host \
 | 
						||
    -v /etc/kubernetes:/etc/kubernetes \
 | 
						||
    -v /root/.kube:/root/.kube \
 | 
						||
    intelanalytics/bigdl-k8s:latest bash
 | 
						||
```
 | 
						||
 | 
						||
**Note:** to create the client container, `-v /etc/kubernetes:/etc/kubernetes:` and `-v /root/.kube:/root/.kube` are required to specify the path of kube config and installation.
 | 
						||
 | 
						||
You can specify more arguments:
 | 
						||
 | 
						||
```bash
 | 
						||
sudo docker run -itd --net=host \
 | 
						||
    -v /etc/kubernetes:/etc/kubernetes \
 | 
						||
    -v /root/.kube:/root/.kube \
 | 
						||
    -e http_proxy=http://your-proxy-host:your-proxy-port \
 | 
						||
    -e https_proxy=https://your-proxy-host:your-proxy-port \
 | 
						||
    -e RUNTIME_SPARK_MASTER=k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
 | 
						||
    -e RUNTIME_K8S_SERVICE_ACCOUNT=account \
 | 
						||
    -e RUNTIME_K8S_SPARK_IMAGE=intelanalytics/bigdl-k8s:latest \
 | 
						||
    -e RUNTIME_PERSISTENT_VOLUME_CLAIM=myvolumeclaim \
 | 
						||
    -e RUNTIME_DRIVER_HOST=x.x.x.x \
 | 
						||
    -e RUNTIME_DRIVER_PORT=54321 \
 | 
						||
    -e RUNTIME_EXECUTOR_INSTANCES=1 \
 | 
						||
    -e RUNTIME_EXECUTOR_CORES=4 \
 | 
						||
    -e RUNTIME_EXECUTOR_MEMORY=20g \
 | 
						||
    -e RUNTIME_TOTAL_EXECUTOR_CORES=4 \
 | 
						||
    -e RUNTIME_DRIVER_CORES=4 \
 | 
						||
    -e RUNTIME_DRIVER_MEMORY=10g \
 | 
						||
    intelanalytics/bigdl-k8s:latest bash 
 | 
						||
```
 | 
						||
 | 
						||
- http_proxy/https_proxy is to specify http proxy/https_proxy.
 | 
						||
- RUNTIME_SPARK_MASTER is to specify spark master, which should be `k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>` or `spark://<spark-master-host>:<spark-master-port>`. 
 | 
						||
- RUNTIME_K8S_SERVICE_ACCOUNT is service account for driver pod. Please refer to k8s [RBAC](https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).
 | 
						||
- RUNTIME_K8S_SPARK_IMAGE is the k8s image.
 | 
						||
- RUNTIME_PERSISTENT_VOLUME_CLAIM is to specify [Kubernetes volume](https://spark.apache.org/docs/latest/running-on-kubernetes.html#volume-mounts) mount. We are supposed to use volume mount to store or receive data.
 | 
						||
- RUNTIME_DRIVER_HOST/RUNTIME_DRIVER_PORT is to specify driver localhost and port number (only required when submitting jobs via kubernetes client mode).
 | 
						||
- Other environment variables are for spark configuration setting. The default values in this image are listed above. Replace the values as you need.
 | 
						||
 | 
						||
Once the container is created, execute the container:
 | 
						||
 | 
						||
```bash
 | 
						||
sudo docker exec -it <containerID> bash
 | 
						||
```
 | 
						||
 | 
						||
You will login into the container and see this as the output:
 | 
						||
 | 
						||
```
 | 
						||
root@[hostname]:/opt/spark/work-dir# 
 | 
						||
```
 | 
						||
 | 
						||
`/opt/spark/work-dir` is the spark work path. 
 | 
						||
 | 
						||
The `/opt` directory contains:
 | 
						||
 | 
						||
- download-bigdl.sh is used for downloading BigDL distributions.
 | 
						||
- start-notebook-spark.sh is used for starting the jupyter notebook on standard spark cluster. 
 | 
						||
- start-notebook-k8s.sh is used for starting the jupyter notebook on k8s cluster.
 | 
						||
- bigdl-x.x-SNAPSHOT is `BIGDL_HOME`, which is the home of BigDL distribution.
 | 
						||
- bigdl-examples directory contains downloaded python example code.
 | 
						||
- install-conda-env.sh is displayed that conda env and python dependencies are installed.
 | 
						||
- jdk is the jdk home.
 | 
						||
- spark is the spark home.
 | 
						||
- redis is the redis home.
 | 
						||
 | 
						||
### **3. Submit to k8s from remote**
 | 
						||
 | 
						||
Instead of lanuching a client container, you can also submit BigDL application from a remote node with the following steps:
 | 
						||
 | 
						||
1. Check the [prerequisites](https://spark.apache.org/docs/latest/running-on-kubernetes.html#prerequisites) of running Spark on Kubernetes.
 | 
						||
 | 
						||
    - The remote node needs to properly setup the configurations and authentications of the k8s cluster (e.g. the `config` file under `~/.kube`, especially the server address in the `config`).
 | 
						||
 | 
						||
    - Install `kubectl` on the remote node and run some sample commands for verification, for example `kubectl auth can-i <list|create|edit|delete> pods`. 
 | 
						||
    Note that the installation of `kubectl` is not a must for the remote node, but it is a useful tool to verify whether the remote node has access to the k8s cluster.
 | 
						||
 | 
						||
    - The environment variables `http_proxy` and `https_proxy` may affect the connection using `kubectl`. You may check and unset these environment variables in case you get errors when executing the `kubectl` commands on the remote node.
 | 
						||
 | 
						||
2. Follow the steps in the [Python User Guide](./python.html#install) to install BigDL in a conda environment.
 | 
						||
 | 
						||
 | 
						||
### **4. Run BigDL on k8s**
 | 
						||
 | 
						||
_**Note**: Please make sure `kubectl` has appropriate permission to create, list and delete pod._
 | 
						||
 | 
						||
You may refer to [Section 5](#known-issues) for some known issues when running BigDL on k8s.
 | 
						||
 | 
						||
#### **4.1 K8s client mode**
 | 
						||
 | 
						||
We recommend using `init_orca_context` at the very beginning of your code (e.g. in script.py) to initiate and run BigDL on standard K8s clusters in [client mode](http://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode).
 | 
						||
 | 
						||
```python
 | 
						||
from bigdl.orca import init_orca_context
 | 
						||
 | 
						||
init_orca_context(cluster_mode="k8s", master="k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>",
 | 
						||
                  container_image="intelanalytics/bigdl-k8s:latest",
 | 
						||
                  num_nodes=2, cores=2, memory="2g")
 | 
						||
```
 | 
						||
 | 
						||
Remark: You may need to specify Spark driver host and port if necessary by adding the argument: `conf={"spark.driver.host": "x.x.x.x", "spark.driver.port": "x"}`.
 | 
						||
 | 
						||
Execute `python script.py` to run your program on k8s cluster directly.
 | 
						||
 | 
						||
#### **4.2 K8s cluster mode**
 | 
						||
 | 
						||
For k8s [cluster mode](https://spark.apache.org/docs/3.1.2/running-on-kubernetes.html#cluster-mode), you can call `init_orca_context` and specify cluster_mode to be "spark-submit" in your python script (e.g. in script.py):
 | 
						||
 | 
						||
```python
 | 
						||
from bigdl.orca import init_orca_context
 | 
						||
 | 
						||
init_orca_context(cluster_mode="spark-submit")
 | 
						||
```
 | 
						||
 | 
						||
Use spark-submit to submit your BigDL program:
 | 
						||
 | 
						||
```bash
 | 
						||
${SPARK_HOME}/bin/spark-submit \
 | 
						||
  --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
 | 
						||
  --deploy-mode cluster \
 | 
						||
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=account \
 | 
						||
  --name bigdl \
 | 
						||
  --conf spark.kubernetes.container.image="intelanalytics/bigdl-k8s:latest" \
 | 
						||
  --conf spark.kubernetes.container.image.pullPolicy=Always \
 | 
						||
  --conf spark.pyspark.driver.python=./env/bin/python \
 | 
						||
  --conf spark.pyspark.python=./env/bin/python \
 | 
						||
  --archives path/to/environment.tar.gz#env \
 | 
						||
  --conf spark.executor.instances=1 \
 | 
						||
  --executor-memory 10g \
 | 
						||
  --driver-memory 10g \
 | 
						||
  --executor-cores 8 \
 | 
						||
  --num-executors 2 \
 | 
						||
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
 | 
						||
  --py-files local://${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,local:///path/script.py
 | 
						||
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
 | 
						||
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
 | 
						||
  local:///path/script.py
 | 
						||
```
 | 
						||
 | 
						||
#### **4.3 Run Jupyter Notebooks**
 | 
						||
 | 
						||
After a Docker container is launched and user login into the container, you can start the Jupyter Notebook service inside the container.
 | 
						||
 | 
						||
In the `/opt` directory, run this command line to start the Jupyter Notebook service:
 | 
						||
```
 | 
						||
./start-notebook-k8s.sh
 | 
						||
```
 | 
						||
 | 
						||
You will see the output message like below. This means the Jupyter Notebook service has started successfully within the container.
 | 
						||
```
 | 
						||
[I 23:51:08.456 NotebookApp] Serving notebooks from local directory: /opt/bigdl-2.1.0-SNAPSHOT/apps
 | 
						||
[I 23:51:08.456 NotebookApp] Jupyter Notebook 6.2.0 is running at:
 | 
						||
[I 23:51:08.456 NotebookApp] http://xxxx:12345/?token=...
 | 
						||
[I 23:51:08.457 NotebookApp]  or http://127.0.0.1:12345/?token=...
 | 
						||
[I 23:51:08.457 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
 | 
						||
```
 | 
						||
 | 
						||
Then, refer [docker guide](./docker.md) to open Jupyter Notebook service from a browser and run notebook.
 | 
						||
 | 
						||
#### **4.4 Run Scala programs**
 | 
						||
 | 
						||
Use spark-submit to submit your BigDL program.  e.g., run [nnframes imageInference](../../../../../../scala/dllib/src/main/scala/com/intel/analytics/bigdl/dllib/example/nnframes/imageInference) example (running in either local mode or cluster mode) as follows:
 | 
						||
 | 
						||
```bash
 | 
						||
${SPARK_HOME}/bin/spark-submit \
 | 
						||
  --master ${RUNTIME_SPARK_MASTER} \
 | 
						||
  --deploy-mode client \
 | 
						||
  --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
 | 
						||
  --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
 | 
						||
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
 | 
						||
  --name bigdl \
 | 
						||
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
 | 
						||
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
 | 
						||
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
 | 
						||
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/path \
 | 
						||
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
 | 
						||
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/path \
 | 
						||
  --conf spark.kubernetes.driver.label.<your-label>=true \
 | 
						||
  --conf spark.kubernetes.executor.label.<your-label>=true \
 | 
						||
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
 | 
						||
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
 | 
						||
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
 | 
						||
  --driver-cores ${RUNTIME_DRIVER_CORES} \
 | 
						||
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
 | 
						||
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
 | 
						||
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
 | 
						||
  --conf spark.sql.catalogImplementation='in-memory' \
 | 
						||
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/*  \
 | 
						||
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/*  \
 | 
						||
  --class com.intel.analytics.bigdl.dllib.examples.nnframes.imageInference.ImageTransferLearning \
 | 
						||
  ${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip \
 | 
						||
  --inputDir /path
 | 
						||
```
 | 
						||
 | 
						||
Options:
 | 
						||
 | 
						||
- --master: the spark mater, must be a URL with the format `k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>`. 
 | 
						||
- --deploy-mode: submit application in client/cluster mode.
 | 
						||
- --name: the Spark application name.
 | 
						||
- --conf: to specify k8s service account, container image to use for the Spark application, driver volumes name and path, label of pods, spark driver and executor configuration, etc. You can refer to [spark configuration](https://spark.apache.org/docs/latest/configuration.html) and [spark on k8s configuration](https://spark.apache.org/docs/latest/running-on-kubernetes.html#configuration) for more details.
 | 
						||
- --properties-file: the customized conf properties.
 | 
						||
- --py-files: the extra python packages is needed.
 | 
						||
- --class: scala example class name.
 | 
						||
- --inputDir: input data path of the nnframe example. The data path is the mounted filesystem of the host. Refer to more details by [Kubernetes Volumes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes).
 | 
						||
 | 
						||
### **5 Known Issues**
 | 
						||
 | 
						||
This section shows some common topics for both client mode and cluster mode.
 | 
						||
 | 
						||
#### **5.1 How to specify the Python environment?**
 | 
						||
 | 
						||
In client mode, follow [python user guide](./python.md) to install conda and BigDL and run application:
 | 
						||
```python
 | 
						||
python script.py
 | 
						||
```
 | 
						||
In cluster mode, install conda, pack environment and use on both the driver and executor.
 | 
						||
- Pack the current conda environment to `environment.tar.gz` (you can use any name you like):
 | 
						||
  ```bash
 | 
						||
  conda pack -o environment.tar.gz
 | 
						||
  ```
 | 
						||
- spark-submit with "--archives" and specify python stores for dirver and executor
 | 
						||
  ```bash
 | 
						||
  --conf spark.pyspark.driver.python=./env/bin/python \
 | 
						||
  --conf spark.pyspark.python=./env/bin/python \
 | 
						||
  --archives local:///bigdl2.0/data/environment.tar.gz#env \ # this path shoud be that k8s pod can access
 | 
						||
  ```
 | 
						||
 | 
						||
#### **5.2 How to retain executor logs for debugging?**
 | 
						||
 | 
						||
The k8s would delete the pod once the executor failed in client mode and cluster mode.  If you want to get the content of executor log, you could set "temp-dir" to a mounted network file system (NFS) storage to change the log dir to replace the former one. In this case, you may meet `JSONDecodeError` because multiple executors would write logs to the same physical folder and cause conflicts. The solutions are in the next section.
 | 
						||
 | 
						||
```python
 | 
						||
init_orca_context(..., extra_params = {"temp-dir": "/bigdl/"})
 | 
						||
```
 | 
						||
 | 
						||
#### **5.3 How to deal with "JSONDecodeError"?**
 | 
						||
 | 
						||
If you set `temp-dir` to a mounted nfs storage and use multiple executors , you may meet `JSONDecodeError` since multiple executors would write to the same physical folder and cause conflicts. Do not mount `temp-dir` to shared storage is one option to avoid conflicts. But if you debug ray on k8s, you need to output logs to a shared storage. In this case, you could set num-nodes to 1. After testing, you can remove `temp-dir` setting and run multiple executors.
 | 
						||
 | 
						||
#### **5.4 How to use NFS?**
 | 
						||
 | 
						||
If you want to save some files out of pod's lifecycle, such as logging callbacks or tensorboard callbacks, you need to set the output dir to a mounted persistent volume dir. Let NFS be a simple example.
 | 
						||
 | 
						||
Use NFS in client mode:
 | 
						||
 | 
						||
```python
 | 
						||
init_orca_context(cluster_mode="k8s", ...,
 | 
						||
                  conf={...,
 | 
						||
                  "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim",
 | 
						||
                  "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/bigdl" 
 | 
						||
                  })
 | 
						||
```
 | 
						||
 | 
						||
Use NFS in cluster mode:
 | 
						||
 | 
						||
```bash
 | 
						||
${SPARK_HOME}/bin/spark-submit \
 | 
						||
  --... ...\
 | 
						||
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \
 | 
						||
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/bigdl" \
 | 
						||
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \
 | 
						||
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/bigdl" \
 | 
						||
  file:///path/script.py
 | 
						||
```
 | 
						||
 | 
						||
#### **5.5 How to deal with "RayActorError"?**
 | 
						||
 | 
						||
"RayActorError" may caused by running out of the ray memory. If you meet this error, try to increase the memory for ray.
 | 
						||
 | 
						||
```python
 | 
						||
init_orca_context(..., extra_executor_memory_for_ray="100g")
 | 
						||
```
 | 
						||
 | 
						||
#### **5.6 How to set proper "steps_per_epoch" and "validation steps"?**
 | 
						||
 | 
						||
The `steps_per_epoch` and `validation_steps` should equal to numbers of dataset divided by batch size if you want to train all dataset. The `steps_per_epoch` and `validation_steps` do not relate to the `num_nodes` when total dataset and batch size are fixed. For example, you set `num_nodes` to 1, and set `steps_per_epoch` to 6. If you change the `num_nodes` to 3, the `steps_per_epoch` should still be 6.
 | 
						||
 | 
						||
#### **5.7 Others**
 | 
						||
 | 
						||
`spark.kubernetes.container.image.pullPolicy` needs to be specified as `always` if you need to update your spark executor image for k8s.
 | 
						||
 | 
						||
### **6. Access logs and clear pods**
 | 
						||
 | 
						||
When application is running, it’s possible to stream logs on the driver pod:
 | 
						||
 | 
						||
```bash
 | 
						||
$ kubectl logs <spark-driver-pod>
 | 
						||
```
 | 
						||
 | 
						||
To check pod status or to get some basic information around pod using:
 | 
						||
 | 
						||
```bash
 | 
						||
$ kubectl describe pod <spark-driver-pod>
 | 
						||
```
 | 
						||
 | 
						||
You can also check other pods using the similar way.
 | 
						||
 | 
						||
After finishing running the application, deleting the driver pod:
 | 
						||
 | 
						||
```bash
 | 
						||
$ kubectl delete <spark-driver-pod>
 | 
						||
```
 | 
						||
 | 
						||
Or clean up the entire spark application by pod label:
 | 
						||
 | 
						||
```bash
 | 
						||
$ kubectl delete pod -l <pod label>
 | 
						||
```
 |