Update K8s tutorial (part4) (#6724)

* update python

* update spark-submit
This commit is contained in:
Kai Huang 2022-11-22 19:51:04 +08:00 committed by GitHub
parent 7f2beefb7b
commit d0388cb8d8
2 changed files with 187 additions and 261 deletions

View file

@ -2,7 +2,7 @@
This tutorial provides a step-by-step guide on how to run BigDL-Orca programs on Kubernetes (K8s) clusters, using a [PyTorch Fashin-MNIST program](https://github.com/intel-analytics/BigDL/tree/main/python/orca/tutorial/pytorch/FashionMNIST) as a working example. This tutorial provides a step-by-step guide on how to run BigDL-Orca programs on Kubernetes (K8s) clusters, using a [PyTorch Fashin-MNIST program](https://github.com/intel-analytics/BigDL/tree/main/python/orca/tutorial/pytorch/FashionMNIST) as a working example.
The **Client Container** that appears in this tutorial refer to the docker container where you launch or submit your applications. The **Client Container** that appears in this tutorial refer to the docker container where you launch or submit your applications. The __Develop Node__ is the host machine where you launch the client container.
--- ---
## 1. Basic Concepts ## 1. Basic Concepts
@ -50,10 +50,48 @@ For k8s-client, the Spark driver runs in the client process (outside the K8s clu
Please see more details in [K8s-Cluster](https://spark.apache.org/docs/latest/running-on-kubernetes.html#cluster-mode) and [K8s-Client](https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode). Please see more details in [K8s-Cluster](https://spark.apache.org/docs/latest/running-on-kubernetes.html#cluster-mode) and [K8s-Client](https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode).
For **k8s-cluster** mode, a `driver pod name` will be returned when the application is completed. You can retrieve the results on the __Develop Node__ following the commands below:
* Retrieve the logs on the driver pod:
```bash
kubectl logs <driver-pod-name>
```
* Check the pod status or get basic information of the driver pod:
```bash
kubectl describe pod <driver-pod-name>
```
### 1.3 Load Data from Volumes ### 1.3 Load Data from Volumes
When you are running programs on K8s, please load data from [Volumes](https://kubernetes.io/docs/concepts/storage/volumes/) accessible to all K8s pods. We use Network File Systems (NFS) in this tutorial as an example. When you are running programs on K8s, please load data from [Volumes](https://kubernetes.io/docs/concepts/storage/volumes/) accessible to all K8s pods. We use Network File Systems (NFS) with path `/bigdl/nfsdata` in this tutorial as an example.
To load data from Volumes, please set the corresponding Volume configurations for spark using `--conf` option in Spark scripts or specifying `conf` in `init_orca_context`. Here we list the configurations for using NFS as Volume.
For **k8s-client** mode:
* `spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName`: specify the claim name of `persistentVolumeClaim` with volumnName `nfsvolumeclaim` to mount into executor pods.
* `spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path`: specify the NFS path to be mounted as `nfsvolumeclaim` to executor pods.
Besides the above two configurations, you need to additionally set the following configurations for **k8s-cluster** mode:
* `spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName`: specify the claim name of `persistentVolumeClaim` with volumnName `nfsvolumeclaim` to mount into the driver pod.
* `spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path`: specify the NFS path to be mounted as `nfsvolumeclaim` to the driver pod.
* `spark.kubernetes.authenticate.driver.serviceAccountName`: the service account for the driver pod.
* `spark.kubernetes.file.upload.path`: the path to store files at spark submit side in k8s-cluster mode.
Sample conf for NFS in the Fashion-MNIST example provided by this tutorial is as follows:
```python
{
# For k8s-client mode
"spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName": "nfsvolumeclaim",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/bigdl/nfsdata",
# Additionally for k8s-cluster mode
"spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName": "nfsvolumeclaim",
"spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/bigdl/nfsdata",
"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",
"spark.kubernetes.file.upload.path": "/bigdl/nfsdata/"
}
```
After mounting the Volume (NFS) into the BigDL container (see __[Section 2.2](#create-a-k8s-client-container)__ for more details), the Fashion-MNIST example could load data from NFS as local storage. After mounting the Volume (NFS) into the BigDL container (see __[Section 2.2](#create-a-k8s-client-container)__ for more details), the Fashion-MNIST example could load data from NFS as local storage.
@ -91,24 +129,6 @@ sudo docker pull intelanalytics/bigdl-k8s:2.1.0
### 2.2 Create a K8s Client Container ### 2.2 Create a K8s Client Container
Please create the __Client Container__ using the script below: Please create the __Client Container__ using the script below:
```bash ```bash
sudo docker run -itd --net=host \
-v /etc/kubernetes:/etc/kubernetes \
-v /root/.kube:/root/.kube \
intelanalytics/bigdl-k8s:latest bash
```
In the script:
* **Please switch the tag according to the BigDL image you pull.**
* `--net=host`: use the host network stack for the Docker container.
* `-v /etc/kubernetes:/etc/kubernetes`: specify the path of Kubernetes configurations to mount into the Docker container.
* `-v /root/.kube:/root/.kube`: specify the path of Kubernetes installation to mount into the Docker container.
__Notes:__
* The __Client Container__ contains all the required environment except K8s configurations.
* You don't need to create Spark executor containers manually, which are scheduled by K8s at runtime.
We recommend you to specify more arguments when creating the __Client Container__:
```bash
sudo docker run -itd --net=host \ sudo docker run -itd --net=host \
-v /etc/kubernetes:/etc/kubernetes \ -v /etc/kubernetes:/etc/kubernetes \
-v /root/.kube:/root/.kube \ -v /root/.kube:/root/.kube \
@ -124,23 +144,26 @@ sudo docker run -itd --net=host \
-e RUNTIME_DRIVER_HOST=x.x.x.x \ -e RUNTIME_DRIVER_HOST=x.x.x.x \
-e RUNTIME_DRIVER_PORT=54321 \ -e RUNTIME_DRIVER_PORT=54321 \
-e RUNTIME_EXECUTOR_INSTANCES=2 \ -e RUNTIME_EXECUTOR_INSTANCES=2 \
-e RUNTIME_EXECUTOR_CORES=2 \ -e RUNTIME_EXECUTOR_CORES=4 \
-e RUNTIME_EXECUTOR_MEMORY=20g \ -e RUNTIME_EXECUTOR_MEMORY=2g \
-e RUNTIME_TOTAL_EXECUTOR_CORES=4 \ -e RUNTIME_TOTAL_EXECUTOR_CORES=8 \
-e RUNTIME_DRIVER_CORES=4 \ -e RUNTIME_DRIVER_CORES=2 \
-e RUNTIME_DRIVER_MEMORY=10g \ -e RUNTIME_DRIVER_MEMORY=2g \
intelanalytics/bigdl-k8s:latest bash intelanalytics/bigdl-k8s:latest bash
``` ```
In the script: In the script:
* **Please switch the tag according to the BigDL image you pull.** * **Please switch the tag according to the BigDL image you pull.**
* **Please make sure you are mounting the correct Volume path (e.g. NFS) into the container.** * **Please make sure you are mounting the correct Volume path (e.g. NFS) into the container.**
* `--net=host`: use the host network stack for the Docker container.
* `-v /etc/kubernetes:/etc/kubernetes`: specify the path of Kubernetes configurations to mount into the Docker container.
* `-v /root/.kube:/root/.kube`: specify the path of Kubernetes installation to mount into the Docker container.
* `-v /path/to/nfsdata:/bigdl/nfsdata`: mount NFS path on the host into the container as the specified path (e.g. "/bigdl/nfsdata"). * `-v /path/to/nfsdata:/bigdl/nfsdata`: mount NFS path on the host into the container as the specified path (e.g. "/bigdl/nfsdata").
* `NOTEBOOK_PORT`: an integer that specifies the port number for the Notebook (only required if you use notebook). * `NOTEBOOK_PORT`: an integer that specifies the port number for the Notebook (only required if you use notebook).
* `NOTEBOOK_TOKEN`: a string that specifies the token for Notebook (only required if you use notebook). * `NOTEBOOK_TOKEN`: a string that specifies the token for Notebook (only required if you use notebook).
* `RUNTIME_SPARK_MASTER`: a URL format that specifies the Spark master. * `RUNTIME_SPARK_MASTER`: a URL format that specifies the Spark master: k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>.
* `RUNTIME_K8S_SERVICE_ACCOUNT`: a string that specifies the service account for driver pod. * `RUNTIME_K8S_SERVICE_ACCOUNT`: a string that specifies the service account for driver pod.
* `RUNTIME_K8S_SPARK_IMAGE`: the launched k8s image for Spark. * `RUNTIME_K8S_SPARK_IMAGE`: the name of the BigDL K8s docker image.
* `RUNTIME_PERSISTENT_VOLUME_CLAIM`: a string that specifies the Kubernetes volumeName (e.g. "nfsvolumeclaim"). * `RUNTIME_PERSISTENT_VOLUME_CLAIM`: a string that specifies the Kubernetes volumeName (e.g. "nfsvolumeclaim").
* `RUNTIME_DRIVER_HOST`: a URL format that specifies the driver localhost (only required by k8s-client mode). * `RUNTIME_DRIVER_HOST`: a URL format that specifies the driver localhost (only required by k8s-client mode).
* `RUNTIME_DRIVER_PORT`: a string that specifies the driver port (only required by k8s-client mode). * `RUNTIME_DRIVER_PORT`: a string that specifies the driver port (only required by k8s-client mode).
@ -151,6 +174,10 @@ In the script:
* `RUNTIME_DRIVER_CORES`: an integer that specifies the number of cores for the driver node. * `RUNTIME_DRIVER_CORES`: an integer that specifies the number of cores for the driver node.
* `RUNTIME_DRIVER_MEMORY`: a string that specifies the memory for the driver node. * `RUNTIME_DRIVER_MEMORY`: a string that specifies the memory for the driver node.
__Notes:__
* The __Client Container__ contains all the required environment except K8s configurations.
* You don't need to create Spark executor containers manually, which are scheduled by K8s at runtime.
### 2.3 Launch the K8s Client Container ### 2.3 Launch the K8s Client Container
Once the container is created, a `containerID` would be returned and with which you can enter the container following the command below: Once the container is created, a `containerID` would be returned and with which you can enter the container following the command below:
@ -180,7 +207,7 @@ pip install torch torchvision
## 4. Prepare Dataset ## 4. Prepare Dataset
To run the Fashion-MNIST example provided by this tutorial on K8s, you should upload the dataset to a K8s Volume (e.g. NFS). To run the Fashion-MNIST example provided by this tutorial on K8s, you should upload the dataset to a K8s Volume (e.g. NFS).
Please download the Fashion-MNIST dataset manually on your __Develop Node__ (where you launch the container image) and put the data into the Volume. Note that PyTorch `FashionMNIST Dataset` requires unzipped files located in `FashionMNIST/raw/` under the root folder. Please download the Fashion-MNIST dataset manually on your __Develop Node__ and put the data into the Volume. Note that PyTorch `FashionMNIST Dataset` requires unzipped files located in `FashionMNIST/raw/` under the root folder.
```bash ```bash
# PyTorch official dataset download link # PyTorch official dataset download link
@ -193,6 +220,8 @@ mv /path/to/fashion-mnist/data/fashion /bigdl/nfsdata/dataset/FashionMNIST/raw
gzip -dk /bigdl/nfsdata/dataset/FashionMNIST/raw/* gzip -dk /bigdl/nfsdata/dataset/FashionMNIST/raw/*
``` ```
In the given example, you can specify the argument `--remote_dir` to be the directory on NFS for the Fashion-MNIST dataset.
--- ---
## 5. Prepare Custom Modules ## 5. Prepare Custom Modules
@ -251,129 +280,83 @@ In the following part, we will illustrate four ways to submit and run BigDL Orca
You can choose one of them based on your preference or cluster settings. You can choose one of them based on your preference or cluster settings.
We provide the running command for the [Fashion-MNIST example](https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/pytorch/FashionMNIST/) in this section. We provide the running command for the [Fashion-MNIST example](https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/pytorch/FashionMNIST/) in the __Client Container__ in this section.
### 6.1 Use `python` command ### 6.1 Use `python` command
This is the easiest and most recommended way to run BigDL Orca on K8s as a normal Python program.
See [here](#init-orca-context) for the runtime configurations.
#### 6.1.1 K8s-Client #### 6.1.1 K8s-Client
Before running the example on `k8s-client` mode, you should: Run the example with the following command by setting the cluster_mode to "k8s-client":
* On the __Client Container__:
1. Please call `init_orca_context` at very begining part of each Orca program.
```python
from bigdl.orca import init_orca_context, stop_orca_context
conf={
"spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/bigdl/nfsdata",
}
init_orca_context(cluster_mode="k8s-client", num_nodes=2, cores=2, memory="2g",
master="k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>",
container_image="intelanalytics/bigdl-k8s:2.1.0",
extra_python_lib="/path/to/model.py", conf=conf)
```
To load data from NFS, please use the following configuration propeties:
* `spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName`: specify the claim name of `persistentVolumeClaim` with volumnName `nfsvolumeclaim` to mount `persistentVolume` into executor pods;
* `spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path`: add volumeName `nfsvolumeclaim` of the volumeType `persistentVolumeClaim` to executor pods on the NFS path specified in value;
2. Using Conda to install BigDL and needed Python dependency libraries (see __[Section 3](#3-prepare-environment)__).
Please run the Fashion-MNIST example following the command below:
```bash ```bash
python train.py --cluster_mode k8s-client --remote_dir file:///bigdl/nfsdata/dataset python train.py --cluster_mode k8s-client --remote_dir file:///bigdl/nfsdata/dataset
``` ```
In the script:
* `cluster_mode`: set the cluster_mode in `init_orca_context`.
* `remote_dir`: directory on NFS for loading the dataset.
#### 6.1.2 K8s-Cluster #### 6.1.2 K8s-Cluster
Before running the example on `k8s-cluster` mode, you should: Before running the example on `k8s-cluster` mode, you should:
* On the __Client Container__: * In the __Client Container__:
1. Please call `init_orca_context` at very begining part of each Orca program.
```python
from bigdl.orca import init_orca_context, stop_orca_context
conf={ Pack the current activate conda environment to an archive:
"spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim", ```bash
"spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/bigdl/nfsdata", conda pack -o environment.tar.gz
"spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim", ```
"spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/bigdl/nfsdata",
"spark.kubernetes.authenticate.driver.serviceAccountName":"spark",
"spark.kubernetes.file.upload.path":"/bigdl/nfsdata/"
}
init_orca_context(cluster_mode="k8s-cluster", num_nodes=2, cores=2, memory="2g",
master="k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>",
container_image="intelanalytics/bigdl-k8s:2.1.0",
penv_archive="file:///bigdl/nfsdata/environment.tar.gz",
extra_python_lib="/bigdl/nfsdata/model.py", conf=conf)
```
When running Orca programs on `k8s-cluster` mode, please use the following additional configuration propeties:
* `spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName`: specify the claim name of `persistentVolumeClaim` with volumnName `nfsvolumeclaim` to mount `persistentVolume` into driver pod;
* `spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path`: add volumeName `nfsvolumeclaim` of the volumeType `persistentVolumeClaim` to driver pod on the NFS path specified in value;
* `spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName`: specify the claim name of `persistentVolumeClaim` with volumnName `nfsvolumeclaim` to mount `persistentVolume` into executor pods;
* `spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path`: add volumeName `nfsvolumeclaim` of the volumeType `persistentVolumeClaim` to executor pods on the NFS path specified in value;
* `spark.kubernetes.authenticate.driver.serviceAccountName`: the service account for driver pod;
* `spark.kubernetes.file.upload.path`: the path to store files at spark submit side in cluster mode;
2. Using Conda to install BigDL and needed Python dependency libraries (see __[Section 3](#3-prepare-environment)__), then pack the current activate Conda environment to an archive.
```
conda pack -o /path/to/environment.tar.gz
```
* On the __Develop Node__: * On the __Develop Node__:
1. Upload the Conda archive to NFS. 1. Upload the conda archive to NFS.
```bash ```bash
docker cp <containerID>:/opt/spark/work-dir/environment.tar.gz /bigdl/nfsdata docker cp <containerID>:/path/to/environment.tar.gz /bigdl/nfsdata
``` ```
2. Upload the example Python file to NFS. 2. Upload the example Python file to NFS.
```bash ```bash
mv /path/to/train.py /bigdl/nfsdata cp /path/to/train.py /bigdl/nfsdata
``` ```
3. Upload the extra Python dependency file to NFS. 3. Upload the extra Python dependency files to NFS.
```bash ```bash
mv /path/to/model.py /bigdl/nfsdata cp /path/to/model.py /bigdl/nfsdata
``` ```
Please run the Fashion-MNIST example in __Client Container__ following the command below: Run the example with the following command by setting the cluster_mode to “k8s-cluster”:
```bash ```bash
python /bigdl/nfsdata/train.py --cluster_mode k8s-cluster --remote_dir /bigdl/nfsdata/dataset python /bigdl/nfsdata/train.py --cluster_mode k8s-cluster --remote_dir /bigdl/nfsdata/dataset
``` ```
In the script:
* `cluster_mode`: set the cluster_mode in `init_orca_context`.
* `remote_dir`: directory on NFS for loading the dataset.
__Note:__ It will return a `driver pod name` when the application is completed.
Please retreive training stats on the __Develop Node__ following the command below:
* Retrive training logs on the driver pod:
```bash
kubectl logs <driver-pod-name>
```
* Check pod status or get basic informations around pod:
```bash
kubectl describe pod <driver-pod-name>
```
### 6.2 Use `spark-submit` ### 6.2 Use `spark-submit`
Set the cluster_mode to "bigdl-submit" in `init_orca_context`.
```python
init_orca_context(cluster_mode="spark-submit")
```
Pack the current activate conda environment to an archive in the __Client Container__:
```bash
conda pack -o environment.tar.gz
```
Some runtime configurations for Spark are as follows:
* `--master`: a URL format that specifies the Spark master: k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>.
* `--name`: the name of the Spark application.
* `--conf spark.kubernetes.container.image`: the name of the BigDL K8s docker image.
* `--conf spark.kubernetes.authenticate.driver.serviceAccountName`: the service account for the driver pod.
* `--conf spark.executor.instances`: the number of executors.
* `--executor-memory`: the memory for each executor.
* `--driver-memory`: the memory for the driver node.
* `--executor-cores`: the number of cores for each executor.
* `--total-executor-cores`: the total number of executor cores.
* `--properties-file`: the BigDL configuration properties to be uploaded to K8s.
* `--py-files`: the extra Python dependency files to be uploaded to K8s.
* `--archives`: the conda archive to be uploaded to K8s.
* `--conf spark.driver.extraClassPath`: upload and register BigDL jars files to the driver's classpath.
* `--conf spark.executor.extraClassPath`: upload and register BigDL jars files to the executors' classpath.
* `--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName`: specify the claim name of `persistentVolumeClaim` to mount `persistentVolume` into executor pods.
* `--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path`: specify the path to be mounted as `persistentVolumeClaim` to executor pods.
#### 6.2.1 K8s Client #### 6.2.1 K8s Client
Before submitting the example on `k8s-client` mode, you should: Submit and run the program for `k8s-client` mode following the `spark-submit` script below:
* On the __Client Container__:
1. Please call `init_orca_context` at very begining part of each Orca program.
```python
from bigdl.orca import init_orca_context
init_orca_context(cluster_mode="spark-submit")
```
2. Using Conda to install BigDL and needed Python dependency libraries (see __[Section 3](#3-prepare-environment)__), then pack the current activate Conda environment to an archive.
```bash
conda pack -o environment.tar.gz
```
Please submit the example following the script below:
```bash ```bash
${SPARK_HOME}/bin/spark-submit \ ${SPARK_HOME}/bin/spark-submit \
--master ${RUNTIME_SPARK_MASTER} \ --master ${RUNTIME_SPARK_MASTER} \
@ -390,68 +373,41 @@ ${SPARK_HOME}/bin/spark-submit \
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \ --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \ --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--conf spark.pyspark.driver.python=python \ --conf spark.pyspark.driver.python=python \
--conf spark.pyspark.python=./env/bin/python \ --conf spark.pyspark.python=./environment/bin/python \
--archives /path/to/environment.tar.gz#env \ --archives /path/to/environment.tar.gz#environment \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \ --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${BIGDL_HOME}/python/bigdl-spark_3.1.2-2.1.0-python-api.zip,/path/to/train.py,/path/to/model.py \ --py-files ${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/path/to/train.py,/path/to/model.py \
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \ --conf spark.driver.extraClassPath=${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \ --conf spark.executor.extraClassPath=${BIGDL_HOME}/jars/* \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl/nfsdata \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl/nfsdata \
local:///path/to/train.py --cluster_mode "spark-submit" --remote_dir /bigdl/nfsdata/dataset train.py --cluster_mode spark-submit --remote_dir /bigdl/nfsdata/dataset
``` ```
In the script: In the `spark-submit` script:
* `master`: the spark master with a URL format; * `deploy-mode`: set it to `client` when running programs on k8s-client mode.
* `deploy-mode`: set it to `client` when submitting in client mode; * `--conf spark.driver.host`: the localhost for the driver pod.
* `name`: the name of Spark application; * `--conf spark.pyspark.driver.python`: set the activate Python location in __Client Container__ as the driver's Python environment.
* `spark.driver.host`: the localhost for driver pod (only required when submitting in client mode); * `--conf spark.pyspark.python`: set the Python location in conda archive as each executor's Python environment.
* `spark.kubernetes.container.image`: the BigDL docker image you downloaded;
* `spark.kubernetes.authenticate.driver.serviceAccountName`: the service account for driver pod;
* `spark.pyspark.driver.python`: specify the Python location in Conda archive as driver's Python environment;
* `spark.pyspark.python`: specify the Python location in Conda archive as executors' Python environment;
* `archives`: upload the packed Conda archive to K8s;
* `properties-file`: upload BigDL configuration properties to K8s;
* `py-files`: upload extra Python dependency files to K8s;
* `spark.driver.extraClassPath`: upload and register the BigDL jars files to the driver's classpath;
* `spark.executor.extraClassPath`: upload and register the BigDL jars files to the executors' classpath;
* `spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName`: specify the claim name of `persistentVolumeClaim` with specified volumnName to mount `persistentVolume` into executor pods;
* `spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path`: add specified volumeName of the volumeType `persistentVolumeClaim` to executor pods on the NFS path specified in value;
* `cluster_mode`: the cluster_mode in `init_orca_context`;
* `remote_dir`: directory on NFS for loading the dataset.
#### 6.2.2 K8s Cluster #### 6.2.2 K8s Cluster
Before submitting the example on `k8s-cluster` mode, you should:
* On the __Client Container__:
1. Please call `init_orca_context` at very begining part of each Orca program.
```python
from bigdl.orca import init_orca_context
init_orca_context(cluster_mode="spark-submit") * On the __Develop Node__:
``` 1. Upload the conda archive to NFS.
2. Using Conda to install BigDL and needed Python dependency libraries (see __[Section 3](#3-prepare-environment)__), then pack the Conda environment to an archive. ```bash
```bash docker cp <containerID>:/path/to/environment.tar.gz /bigdl/nfsdata
conda pack -o environment.tar.gz ```
``` 2. Upload the example Python file to NFS.
```bash
cp /path/to/train.py /bigdl/nfsdata
```
3. Upload the extra Python dependency files to NFS.
```bash
cp /path/to/model.py /bigdl/nfsdata
```
* On the __Develop Node__ (where you launch the __Client Container__): Submit and run the program for `k8s-cluster` mode following the `spark-submit` script below:
1. Upload Conda archive to NFS.
```bash
docker cp <containerID>:/path/to/environment.tar.gz /bigdl/nfsdata
```
2. Upload the example python files to NFS.
```bash
mv /path/to/train.py /bigdl/nfsdata
```
3. Upload the extra Python dependencies to NFS.
```bash
mv /path/to/model.py /bigdl/nfsdata
```
Please run the example following the script below in the __Client Container__:
```bash ```bash
${SPARK_HOME}/bin/spark-submit \ ${SPARK_HOME}/bin/spark-submit \
--master ${RUNTIME_SPARK_MASTER} \ --master ${RUNTIME_SPARK_MASTER} \
@ -460,9 +416,9 @@ ${SPARK_HOME}/bin/spark-submit \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \ --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \ --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
--archives file:///bigdl/nfsdata/environment.tar.gz#python_env \ --archives file:///bigdl/nfsdata/environment.tar.gz#environment \
--conf spark.pyspark.python=python_env/bin/python \ --conf spark.pyspark.python=environment/bin/python \
--conf spark.executorEnv.PYTHONHOME=python_env \ --conf spark.executorEnv.PYTHONHOME=environment \
--conf spark.kubernetes.file.upload.path=/bigdl/nfsdata \ --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata \
--executor-cores ${RUNTIME_EXECUTOR_CORES} \ --executor-cores ${RUNTIME_EXECUTOR_CORES} \
--executor-memory ${RUNTIME_EXECUTOR_MEMORY} \ --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
@ -477,41 +433,14 @@ ${SPARK_HOME}/bin/spark-submit \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl/nfsdata \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl/nfsdata \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl/nfsdata \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl/nfsdata \
file:///bigdl/nfsdata/train.py --cluster_mode "spark-submit" --remote_dir /bigdl/nfsdata/dataset file:///bigdl/nfsdata/train.py --cluster_mode spark-submit --remote_dir /bigdl/nfsdata/dataset
``` ```
In the script: In the `spark-submit` script:
* `master`: the spark master with a URL format; * `deploy-mode`: set it to `cluster` when running programs on k8s-cluster mode.
* `deploy-mode`: set it to `cluster` when submitting in cluster mode; * `spark.pyspark.python`: sset the Python location in conda archive as each executor's Python environment.
* `name`: the name of Spark application; * `spark.executorEnv.PYTHONHOME`: the search path of Python libraries on executor pods.
* `spark.kubernetes.container.image`: the BigDL docker image you downloaded; * `spark.kubernetes.file.upload.path`: the path to store files at spark submit side in k8s-cluster mode.
* `spark.kubernetes.authenticate.driver.serviceAccountName`: the service account for driver pod;
* `archives`: upload the Conda archive to K8s;
* `properties-file`: upload BigDL configuration properties to K8s;
* `py-files`: upload needed extra Python dependency files to K8s;
* `spark.pyspark.python`: specify the Python location in Conda archive as executors' Python environment;
* `spark.executorEnv.PYTHONHOME`: the search path of Python libraries on executor pod;
* `spark.kubernetes.file.upload.path`: the path to store files at spark submit side in cluster mode;
* `spark.driver.extraClassPath`: upload and register the BigDL jars files to the driver's classpath;
* `spark.executor.extraClassPath`: upload and register the BigDL jars files to the executors' classpath;
* `spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName`: specify the claim name of `persistentVolumeClaim` with specified volumnName to mount `persistentVolume` into driver pod;
* `spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path`: add specified volumeName of the volumeType `persistentVolumeClaim` to driver pod on the NFS path specified in value;
* `spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName`: specify the claim name of `persistentVolumeClaim` with specified volumnName to mount `persistentVolume` into executor pods;
* `spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path`: add specified volumeName of the volumeType `persistentVolumeClaim` to executor pods on the NFS path specified in value;
* `cluster_mode`: specify the cluster_mode in `init_orca_context`;
* `remote_dir`: directory on NFS for loading the dataset.
Please retrieve training stats on the __Develop Node__ following the commands below:
* Retrive training logs on the driver pod:
```bash
kubectl logs `orca-k8s-cluster-tutorial-driver`
```
* Check pod status or get basic informations around pod using:
```bash
kubectl describe pod `orca-k8s-cluster-tutorial-driver`
```
### 6.3 Use Kubernetes Deployment (with Conda Archive) ### 6.3 Use Kubernetes Deployment (with Conda Archive)

View file

@ -178,7 +178,7 @@ In the following part, we will illustrate three ways to submit and run BigDL Orc
You can choose one of them based on your preference or cluster settings. You can choose one of them based on your preference or cluster settings.
We provide the running command for the [Fashion-MNIST example](https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/pytorch/FashionMNIST/) in this section. We provide the running command for the [Fashion-MNIST example](https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/pytorch/FashionMNIST/) on the __Client Node__ in this section.
### 5.1 Use `python` Command ### 5.1 Use `python` Command
This is the easiest and most recommended way to run BigDL Orca on YARN as a normal Python program. Using this way, you only need to prepare the environment on the __Client Node__ and the environment would be automatically packaged and distributed to the YARN cluster. This is the easiest and most recommended way to run BigDL Orca on YARN as a normal Python program. Using this way, you only need to prepare the environment on the __Client Node__ and the environment would be automatically packaged and distributed to the YARN cluster.
@ -208,8 +208,8 @@ jupyter notebook --notebook-dir=/path/to/notebook/directory --ip=* --no-browser
You can copy the code in [train.py](https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/pytorch/FashionMNIST/train.py) to the notebook and run the cells. Set the cluster_mode to "yarn-client" in `init_orca_context`. You can copy the code in [train.py](https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/pytorch/FashionMNIST/train.py) to the notebook and run the cells. Set the cluster_mode to "yarn-client" in `init_orca_context`.
```python ```python
sc = init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2, sc = init_orca_context(cluster_mode="yarn-client", cores=4, memory="2g", num_nodes=2,
driver_cores=2, driver_memory="4g", driver_cores=2, driver_memory="2g",
extra_python_lib="model.py") extra_python_lib="model.py")
``` ```
Note that Jupyter Notebook cannot run on `yarn-cluster` mode, as the driver is not running on the __Client Node__ (where you run the notebook). Note that Jupyter Notebook cannot run on `yarn-cluster` mode, as the driver is not running on the __Client Node__ (where you run the notebook).
@ -243,8 +243,8 @@ Submit and run the example for `yarn-client` mode following the `bigdl-submit` s
bigdl-submit \ bigdl-submit \
--master yarn \ --master yarn \
--deploy-mode client \ --deploy-mode client \
--executor-memory 10g \ --executor-memory 2g \
--driver-memory 4g \ --driver-memory 2g \
--executor-cores 4 \ --executor-cores 4 \
--num-executors 2 \ --num-executors 2 \
--py-files model.py \ --py-files model.py \
@ -255,7 +255,7 @@ bigdl-submit \
``` ```
In the `bigdl-submit` script: In the `bigdl-submit` script:
* `--master`: the spark master, set it to "yarn". * `--master`: the spark master, set it to "yarn".
* `--deploy-mode`: set it to "client" when running programs on yarn-client mode. * `--deploy-mode`: set it to `client` when running programs on yarn-client mode.
* `--conf spark.pyspark.driver.python`: set the activate Python location on __Client Node__ as the driver's Python environment. You can find it by running `which python`. * `--conf spark.pyspark.driver.python`: set the activate Python location on __Client Node__ as the driver's Python environment. You can find it by running `which python`.
* `--conf spark.pyspark.python`: set the Python location in conda archive as each executor's Python environment. * `--conf spark.pyspark.python`: set the Python location in conda archive as each executor's Python environment.
@ -266,8 +266,8 @@ Submit and run the program for `yarn-cluster` mode following the `bigdl-submit`
bigdl-submit \ bigdl-submit \
--master yarn \ --master yarn \
--deploy-mode cluster \ --deploy-mode cluster \
--executor-memory 10g \ --executor-memory 2g \
--driver-memory 4g \ --driver-memory 2g \
--executor-cores 4 \ --executor-cores 4 \
--num-executors 2 \ --num-executors 2 \
--py-files model.py \ --py-files model.py \
@ -278,7 +278,7 @@ bigdl-submit \
``` ```
In the `bigdl-submit` script: In the `bigdl-submit` script:
* `--master`: the spark master, set it to "yarn". * `--master`: the spark master, set it to "yarn".
* `--deploy-mode`: set it to "cluster" when running programs on yarn-cluster mode. * `--deploy-mode`: set it to `cluster` when running programs on yarn-cluster mode.
* `--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON`: set the Python location in conda archive as the Python environment of the Application Master. * `--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON`: set the Python location in conda archive as the Python environment of the Application Master.
* `--conf spark.executorEnv.PYSPARK_PYTHON`: also set the Python location in conda archive as each executor's Python environment. The Application Master and the executors will all use the archive for the Python environment. * `--conf spark.executorEnv.PYSPARK_PYTHON`: also set the Python location in conda archive as each executor's Python environment. The Application Master and the executors will all use the archive for the Python environment.
@ -323,43 +323,17 @@ Some runtime configurations for Spark are as follows:
* `--num_executors`: the number of executors. * `--num_executors`: the number of executors.
* `--py-files`: the extra Python dependency files to be uploaded to YARN. * `--py-files`: the extra Python dependency files to be uploaded to YARN.
* `--archives`: the conda archive to be uploaded to YARN. * `--archives`: the conda archive to be uploaded to YARN.
#### 5.3.1 Yarn Client
Submit and run the program for `yarn-client` mode following the `spark-submit` script below:
```bash
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode client \
--executor-memory 10g \
--driver-memory 4g \
--executor-cores 4 \
--num-executors 2 \
--archives /path/to/environment.tar.gz#environment \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--conf spark.pyspark.driver.python=/path/to/python \
--conf spark.pyspark.python=environment/bin/python \
--py-files ${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,model.py \
--jars ${BIGDL_HOME}/jars/bigdl-assembly-spark_${SPARK_VERSION}-${BIGDL_VERSION}-jar-with-dependencies.jar \
train.py --cluster_mode spark-submit --remote_dir hdfs://path/to/remote/data
```
In the `spark-submit` script:
* `--master`: the spark master, set it to "yarn".
* `--deploy-mode`: set it to "client" when running programs on yarn-client mode.
* `--properties-file`: the BigDL configuration properties to be uploaded to YARN. * `--properties-file`: the BigDL configuration properties to be uploaded to YARN.
* `--conf spark.pyspark.driver.python`: set the activate Python location on __Client Node__ as the driver's Python environment. You can find the location by running `which python`.
* `--conf spark.pyspark.python`: set the Python location in conda archive as each executor's Python environment.
* `--jars`: upload and register BigDL jars to YARN. * `--jars`: upload and register BigDL jars to YARN.
#### 5.3.1 Yarn Cluster
#### 5.3.2 Yarn Cluster
Submit and run the program for `yarn-cluster` mode following the `spark-submit` script below: Submit and run the program for `yarn-cluster` mode following the `spark-submit` script below:
```bash ```bash
${SPARK_HOME}/bin/spark-submit \ ${SPARK_HOME}/bin/spark-submit \
--master yarn \ --master yarn \
--deploy-mode cluster \ --deploy-mode cluster \
--executor-memory 4g \ --executor-memory 2g \
--driver-memory 4g \ --driver-memory 2g \
--executor-cores 4 \ --executor-cores 4 \
--num-executors 2 \ --num-executors 2 \
--archives /path/to/environment.tar.gz#environment \ --archives /path/to/environment.tar.gz#environment \
@ -372,8 +346,31 @@ ${SPARK_HOME}/bin/spark-submit \
``` ```
In the `spark-submit` script: In the `spark-submit` script:
* `--master`: the spark master, set it to "yarn". * `--master`: the spark master, set it to "yarn".
* `--deploy-mode`: set it to "cluster" when running programs on yarn-cluster mode. * `--deploy-mode`: set it to `cluster` when running programs on yarn-cluster mode.
* `--properties-file`: the BigDL configuration properties to be uploaded to YARN.
* `--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON`: set the Python location in conda archive as the Python environment of the Application Master. * `--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON`: set the Python location in conda archive as the Python environment of the Application Master.
* `--conf spark.executorEnv.PYSPARK_PYTHON`: also set the Python location in conda archive as each executor's Python environment. The Application Master and the executors will all use the archive for the Python environment. * `--conf spark.executorEnv.PYSPARK_PYTHON`: also set the Python location in conda archive as each executor's Python environment. The Application Master and the executors will all use the archive for the Python environment.
* `--jars`: upload and register BigDL jars to YARN.
#### 5.3.2 Yarn Client
Submit and run the program for `yarn-client` mode following the `spark-submit` script below:
```bash
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode client \
--executor-memory 2g \
--driver-memory 2g \
--executor-cores 4 \
--num-executors 2 \
--archives /path/to/environment.tar.gz#environment \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--conf spark.pyspark.driver.python=/path/to/python \
--conf spark.pyspark.python=environment/bin/python \
--py-files ${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,model.py \
--jars ${BIGDL_HOME}/jars/bigdl-assembly-spark_${SPARK_VERSION}-${BIGDL_VERSION}-jar-with-dependencies.jar \
train.py --cluster_mode spark-submit --remote_dir hdfs://path/to/remote/data
```
In the `spark-submit` script:
* `--master`: the spark master, set it to "yarn".
* `--deploy-mode`: set it to `client` when running programs on yarn-client mode.
* `--conf spark.pyspark.driver.python`: set the activate Python location on __Client Node__ as the driver's Python environment. You can find the location by running `which python`.
* `--conf spark.pyspark.python`: set the Python location in conda archive as each executor's Python environment.