Kai Huang e249e32c6f Update K8s tutorial (part3) (#6719 )

2022-11-22 15:56:23 +08:00

56 KiB

Raw Blame History

Run on Kubernetes Clusters

This tutorial provides a step-by-step guide on how to run BigDL-Orca programs on Kubernetes (K8s) clusters, using a PyTorch Fashin-MNIST program as a working example.

The Client Container that appears in this tutorial refer to the docker container where you launch or submit your applications.

1. Basic Concepts

1.1 init_orca_context

A BigDL Orca program usually starts with the initialization of OrcaContext. For every BigDL Orca program, you should call init_orca_context at the beginning of the program as below:

from bigdl.orca import init_orca_context

init_orca_context(cluster_mode, master, container_image, 
                  cores, memory, num_nodes, driver_cores, driver_memory, 
                  extra_python_lib, penv_archive, conf)

In init_orca_context, you may specify necessary runtime configurations for running the example on K8s, including:

cluster_mode: one of "k8s-client", "k8s-cluster" or "spark-submit" when you run on K8s clusters.
master: a URL format to specify the master address of the K8s cluster.
container_image: a string that specifies the name of docker container image for executors.
cores: an integer that specifies the number of cores for each executor (default to be 2).
memory: a string that specifies the memory for each executor (default to be "2g").
num_nodes: an integer that specifies the number of executors (default to be 1).
driver_cores: an integer that specifies the number of cores for the driver node (default to be 4).
driver_memory: a string that specifies the memory for the driver node (default to be "1g").
extra_python_lib: a string that specifies the path to extra Python packages, separated by comma (default to be None). .py, .zip or .egg files are supported.
penv_archive: a string that specifies the path to a packed conda archive (default to be None).
conf: a dictionary to append extra conf for Spark (default to be None).

Note:

All arguments except cluster_mode will be ignored when using spark-submit or Kubernetes deployment to submit and run Orca programs, in which case you are supposed to specify these configurations via the submit command or the YAML file.

After Orca programs finish, you should always call stop_orca_context at the end of the program to release resources and shutdown the underlying distributed runtime engine (such as Spark or Ray).

from bigdl.orca import stop_orca_context

stop_orca_context()

For more details, please see OrcaContext.

1.2 K8s-Client & K8s-Cluster

The difference between k8s-client mode and k8s-cluster mode is where you run your Spark driver.

For k8s-client, the Spark driver runs in the client process (outside the K8s cluster), while for k8s-cluster the Spark driver runs inside the K8s cluster.

Please see more details in K8s-Cluster and K8s-Client.

1.3 Load Data from Volumes

When you are running programs on K8s, please load data from Volumes accessible to all K8s pods. We use Network File Systems (NFS) in this tutorial as an example.

After mounting the Volume (NFS) into the BigDL container (see Section 2.2 for more details), the Fashion-MNIST example could load data from NFS as local storage.

import torch
import torchvision
import torchvision.transforms as transforms

def train_data_creator(config, batch_size):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (0.5,))])

    trainset = torchvision.datasets.FashionMNIST(root="/bigdl/nfsdata/dataset", train=True, 
                                                 download=False, transform=transform)

    trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                              shuffle=True, num_workers=0)
    return trainloader

2. Create BigDL K8s Container

2.1 Pull Docker Image

Please pull the BigDL bigdl-k8s image (built on top of Spark 3.1.3) from Docker Hub as follows:

# For the latest nightly build version
sudo docker pull intelanalytics/bigdl-k8s:latest

# For the release version, e.g. 2.1.0
sudo docker pull intelanalytics/bigdl-k8s:2.1.0

2.2 Create a K8s Client Container

Please create the Client Container using the script below:

sudo docker run -itd --net=host \
    -v /etc/kubernetes:/etc/kubernetes \
    -v /root/.kube:/root/.kube \
    intelanalytics/bigdl-k8s:latest bash

In the script:

Please switch the tag according to the BigDL image you pull.
--net=host: use the host network stack for the Docker container.
-v /etc/kubernetes:/etc/kubernetes: specify the path of Kubernetes configurations to mount into the Docker container.
-v /root/.kube:/root/.kube: specify the path of Kubernetes installation to mount into the Docker container.

Notes:

The Client Container contains all the required environment except K8s configurations.
You don't need to create Spark executor containers manually, which are scheduled by K8s at runtime.

We recommend you to specify more arguments when creating the Client Container:

sudo docker run -itd --net=host \
    -v /etc/kubernetes:/etc/kubernetes \
    -v /root/.kube:/root/.kube \
    -v /path/to/nfsdata:/bigdl/nfsdata \
    -e NOTEBOOK_PORT=12345 \
    -e NOTEBOOK_TOKEN="your-token" \
    -e http_proxy=http://your-proxy-host:your-proxy-port \
    -e https_proxy=https://your-proxy-host:your-proxy-port \
    -e RUNTIME_SPARK_MASTER=k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
    -e RUNTIME_K8S_SERVICE_ACCOUNT=spark \
    -e RUNTIME_K8S_SPARK_IMAGE=intelanalytics/bigdl-k8s:latest \
    -e RUNTIME_PERSISTENT_VOLUME_CLAIM=nfsvolumeclaim \
    -e RUNTIME_DRIVER_HOST=x.x.x.x \
    -e RUNTIME_DRIVER_PORT=54321 \
    -e RUNTIME_EXECUTOR_INSTANCES=2 \
    -e RUNTIME_EXECUTOR_CORES=2 \
    -e RUNTIME_EXECUTOR_MEMORY=20g \
    -e RUNTIME_TOTAL_EXECUTOR_CORES=4 \
    -e RUNTIME_DRIVER_CORES=4 \
    -e RUNTIME_DRIVER_MEMORY=10g \
    intelanalytics/bigdl-k8s:latest bash

In the script:

Please switch the tag according to the BigDL image you pull.
Please make sure you are mounting the correct Volume path (e.g. NFS) into the container.
-v /path/to/nfsdata:/bigdl/nfsdata: mount NFS path on the host into the container as the specified path (e.g. "/bigdl/nfsdata").
NOTEBOOK_PORT: an integer that specifies the port number for the Notebook (only required if you use notebook).
NOTEBOOK_TOKEN: a string that specifies the token for Notebook (only required if you use notebook).
RUNTIME_SPARK_MASTER: a URL format that specifies the Spark master.
RUNTIME_K8S_SERVICE_ACCOUNT: a string that specifies the service account for driver pod.
RUNTIME_K8S_SPARK_IMAGE: the launched k8s image for Spark.
RUNTIME_PERSISTENT_VOLUME_CLAIM: a string that specifies the Kubernetes volumeName (e.g. "nfsvolumeclaim").
RUNTIME_DRIVER_HOST: a URL format that specifies the driver localhost (only required by k8s-client mode).
RUNTIME_DRIVER_PORT: a string that specifies the driver port (only required by k8s-client mode).
RUNTIME_EXECUTOR_INSTANCES: an integer that specifies the number of executors.
RUNTIME_EXECUTOR_CORES: an integer that specifies the number of cores for each executor.
RUNTIME_EXECUTOR_MEMORY: a string that specifies the memory for each executor.
RUNTIME_TOTAL_EXECUTOR_CORES: an integer that specifies the number of cores for all executors.
RUNTIME_DRIVER_CORES: an integer that specifies the number of cores for the driver node.
RUNTIME_DRIVER_MEMORY: a string that specifies the memory for the driver node.

2.3 Launch the K8s Client Container

Once the container is created, a containerID would be returned and with which you can enter the container following the command below:

sudo docker exec -it <containerID> bash

In the remaining part of this tutorial, you are supposed to operate and run commands inside this Client Container.

3. Prepare Environment

In the launched BigDL K8s Client Container, please setup the environment following the steps below:

See here to install conda and prepare the Python environment.
See here to install BigDL Orca in the created conda environment.
You should install all the other Python libraries that you need in your program in the conda environment as well. torch and torchvision are needed to run the Fashion-MNIST example:

pip install torch torchvision

For more details, please see Python User Guide.

4. Prepare Dataset

To run the Fashion-MNIST example provided by this tutorial on K8s, you should upload the dataset to a K8s Volume (e.g. NFS).

Please download the Fashion-MNIST dataset manually on your Develop Node (where you launch the container image) and put the data into the Volume. Note that PyTorch FashionMNIST Dataset requires unzipped files located in FashionMNIST/raw/ under the root folder.

# PyTorch official dataset download link
git clone https://github.com/zalandoresearch/fashion-mnist.git

# Move the dataset to NFS under the folder FashionMNIST/raw
mv /path/to/fashion-mnist/data/fashion /bigdl/nfsdata/dataset/FashionMNIST/raw

# Extract FashionMNIST archives
gzip -dk /bigdl/nfsdata/dataset/FashionMNIST/raw/*

5. Prepare Custom Modules

Spark allows to upload Python files(.py), and zipped Python packages(.zip) across the cluster by setting --py-files option in Spark scripts or specifying extra_python_lib in init_orca_context.

The FasionMNIST example needs to import modules from model.py.

Note: Please upload the extra Python dependency files to the Volume (e.g. NFS) when running the program on k8s-cluster mode (see Section 6.2.2 for more details).

When using python command, please specify extra_python_lib in init_orca_context.

init_orca_context(..., extra_python_lib="/bigdl/nfsdata/model.py")

For more details, please see BigDL Python Dependencies.

When using spark-submit, please specify --py-files option in the submit command.

spark-submit
    ...
    --py-files file:///bigdl/nfsdata/model.py
    ...

For more details, please see Spark Python Dependencies.

After uploading model.py to K8s, you can import this custom module as follows:

from model import model_creator, optimizer_creator

Note:

If your program depends on a nested directory of Python files, you are recommended to follow the steps below to use a zipped package instead.

Compress the directory into a zipped package.

zip -q -r FashionMNIST_zipped.zip FashionMNIST

Upload the zipped package (FashionMNIST_zipped.zip) to K8s by setting --py-files or specifying extra_python_lib as discussed above.
You can then import the custom modules from the unzipped file in your program as follows:

from FashionMNIST.model import model_creator, optimizer_creator

6. Run Jobs on K8s

In the following part, we will illustrate four ways to submit and run BigDL Orca applications on K8s.

Use python command
Use spark-submit
Use Kubernetes Deployment (with Conda Archive)
Use Kubernetes Deployment (with Integrated Image)

You can choose one of them based on your preference or cluster settings.

We provide the running command for the Fashion-MNIST example in this section.

6.1 Use `python` command

6.1.1 K8s-Client

Before running the example on k8s-client mode, you should:

On the Client Container:

Please call init_orca_context at very begining part of each Orca program.

from bigdl.orca import init_orca_context, stop_orca_context

conf={
    "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim",
    "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/bigdl/nfsdata",
    }

init_orca_context(cluster_mode="k8s-client", num_nodes=2, cores=2, memory="2g",
                master="k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>",
                container_image="intelanalytics/bigdl-k8s:2.1.0",
                extra_python_lib="/path/to/model.py", conf=conf)

To load data from NFS, please use the following configuration propeties:

spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName: specify the claim name of persistentVolumeClaim with volumnName nfsvolumeclaim to mount persistentVolume into executor pods;
spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path: add volumeName nfsvolumeclaim of the volumeType persistentVolumeClaim to executor pods on the NFS path specified in value;

Using Conda to install BigDL and needed Python dependency libraries (see Section 3).

Please run the Fashion-MNIST example following the command below:

python train.py --cluster_mode k8s-client --remote_dir file:///bigdl/nfsdata/dataset

In the script:

cluster_mode: set the cluster_mode in init_orca_context.
remote_dir: directory on NFS for loading the dataset.

6.1.2 K8s-Cluster

Before running the example on k8s-cluster mode, you should:

On the Client Container:
1. Please call init_orca_context at very begining part of each Orca program.
```
from bigdl.orca import init_orca_context, stop_orca_context

conf={
      "spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim",
      "spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/bigdl/nfsdata",
      "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim",
      "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/bigdl/nfsdata",
      "spark.kubernetes.authenticate.driver.serviceAccountName":"spark",
      "spark.kubernetes.file.upload.path":"/bigdl/nfsdata/"
      }

init_orca_context(cluster_mode="k8s-cluster", num_nodes=2, cores=2, memory="2g",
                  master="k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>", 
                  container_image="intelanalytics/bigdl-k8s:2.1.0",
                  penv_archive="file:///bigdl/nfsdata/environment.tar.gz",
                  extra_python_lib="/bigdl/nfsdata/model.py", conf=conf)
```
  When running Orca programs on k8s-cluster mode, please use the following additional configuration propeties:
  - spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName: specify the claim name of persistentVolumeClaim with volumnName nfsvolumeclaim to mount persistentVolume into driver pod;
  - spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path: add volumeName nfsvolumeclaim of the volumeType persistentVolumeClaim to driver pod on the NFS path specified in value;
  - spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName: specify the claim name of persistentVolumeClaim with volumnName nfsvolumeclaim to mount persistentVolume into executor pods;
  - spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path: add volumeName nfsvolumeclaim of the volumeType persistentVolumeClaim to executor pods on the NFS path specified in value;
  - spark.kubernetes.authenticate.driver.serviceAccountName: the service account for driver pod;
  - spark.kubernetes.file.upload.path: the path to store files at spark submit side in cluster mode;
2. Using Conda to install BigDL and needed Python dependency libraries (see Section 3), then pack the current activate Conda environment to an archive.
```
conda pack -o /path/to/environment.tar.gz
```

On the Develop Node:

Upload the Conda archive to NFS.

docker cp <containerID>:/opt/spark/work-dir/environment.tar.gz /bigdl/nfsdata

Upload the example Python file to NFS.
```
mv /path/to/train.py /bigdl/nfsdata
```
Upload the extra Python dependency file to NFS.
```
mv /path/to/model.py /bigdl/nfsdata
```

Please run the Fashion-MNIST example in Client Container following the command below:

python /bigdl/nfsdata/train.py --cluster_mode k8s-cluster --remote_dir /bigdl/nfsdata/dataset

In the script:

cluster_mode: set the cluster_mode in init_orca_context.
remote_dir: directory on NFS for loading the dataset.

Note: It will return a driver pod name when the application is completed.

Please retreive training stats on the Develop Node following the command below:

Retrive training logs on the driver pod:
```
kubectl logs <driver-pod-name>
```
Check pod status or get basic informations around pod:
```
kubectl describe pod <driver-pod-name>
```

6.2 Use `spark-submit`

6.2.1 K8s Client

Before submitting the example on k8s-client mode, you should:

On the Client Container:
1. Please call init_orca_context at very begining part of each Orca program.
```
from bigdl.orca import init_orca_context

init_orca_context(cluster_mode="spark-submit")
```
2. Using Conda to install BigDL and needed Python dependency libraries (see Section 3), then pack the current activate Conda environment to an archive.
```
conda pack -o environment.tar.gz
```

Please submit the example following the script below:

${SPARK_HOME}/bin/spark-submit \
    --master ${RUNTIME_SPARK_MASTER} \
    --deploy-mode client \
    --name orca-k8s-client-tutorial \
    --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
    --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
    --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
    --driver-cores ${RUNTIME_DRIVER_CORES} \
    --driver-memory ${RUNTIME_DRIVER_MEMORY} \
    --executor-cores ${RUNTIME_EXECUTOR_CORES} \
    --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
    --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
    --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
    --conf spark.pyspark.driver.python=python \
    --conf spark.pyspark.python=./env/bin/python \
    --archives /path/to/environment.tar.gz#env \
    --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
    --py-files ${BIGDL_HOME}/python/bigdl-spark_3.1.2-2.1.0-python-api.zip,/path/to/train.py,/path/to/model.py \
    --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
    --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
    --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
    --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl/nfsdata \
    local:///path/to/train.py --cluster_mode "spark-submit" --remote_dir /bigdl/nfsdata/dataset

In the script:

master: the spark master with a URL format;
deploy-mode: set it to client when submitting in client mode;
name: the name of Spark application;
spark.driver.host: the localhost for driver pod (only required when submitting in client mode);
spark.kubernetes.container.image: the BigDL docker image you downloaded;
spark.kubernetes.authenticate.driver.serviceAccountName: the service account for driver pod;
spark.pyspark.driver.python: specify the Python location in Conda archive as driver's Python environment;
spark.pyspark.python: specify the Python location in Conda archive as executors' Python environment;
archives: upload the packed Conda archive to K8s;
properties-file: upload BigDL configuration properties to K8s;
py-files: upload extra Python dependency files to K8s;
spark.driver.extraClassPath: upload and register the BigDL jars files to the driver's classpath;
spark.executor.extraClassPath: upload and register the BigDL jars files to the executors' classpath;
spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName: specify the claim name of persistentVolumeClaim with specified volumnName to mount persistentVolume into executor pods;
spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path: add specified volumeName of the volumeType persistentVolumeClaim to executor pods on the NFS path specified in value;
cluster_mode: the cluster_mode in init_orca_context;
remote_dir: directory on NFS for loading the dataset.

6.2.2 K8s Cluster

Before submitting the example on k8s-cluster mode, you should:

On the Client Container:
1. Please call init_orca_context at very begining part of each Orca program.
```
from bigdl.orca import init_orca_context

init_orca_context(cluster_mode="spark-submit")
```
2. Using Conda to install BigDL and needed Python dependency libraries (see Section 3), then pack the Conda environment to an archive.
```
conda pack -o environment.tar.gz
```
On the Develop Node (where you launch the Client Container):
1. Upload Conda archive to NFS.
```
docker cp <containerID>:/path/to/environment.tar.gz /bigdl/nfsdata
```
2. Upload the example python files to NFS.
```
mv /path/to/train.py /bigdl/nfsdata
```
3. Upload the extra Python dependencies to NFS.
```
mv /path/to/model.py /bigdl/nfsdata
```

Please run the example following the script below in the Client Container:

${SPARK_HOME}/bin/spark-submit \
    --master ${RUNTIME_SPARK_MASTER} \
    --deploy-mode cluster \
    --name orca-k8s-cluster-tutorial \
    --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
    --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
    --archives file:///bigdl/nfsdata/environment.tar.gz#python_env \
    --conf spark.pyspark.python=python_env/bin/python \
    --conf spark.executorEnv.PYTHONHOME=python_env \
    --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata \
    --executor-cores ${RUNTIME_EXECUTOR_CORES} \
    --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
    --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
    --driver-cores ${RUNTIME_DRIVER_CORES} \
    --driver-memory ${RUNTIME_DRIVER_MEMORY} \
    --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
    --py-files local://${BIGDL_HOME}/python/bigdl-spark_3.1.2-2.1.0-SNAPSHOT-python-api.zip,file:///bigdl/nfsdata/train.py,file:///bigdl/nfsdata/model.py \
    --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
    --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
    --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
    --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl/nfsdata \
    --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
    --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl/nfsdata \
    file:///bigdl/nfsdata/train.py --cluster_mode "spark-submit" --remote_dir /bigdl/nfsdata/dataset

In the script:

master: the spark master with a URL format;
deploy-mode: set it to cluster when submitting in cluster mode;
name: the name of Spark application;
spark.kubernetes.container.image: the BigDL docker image you downloaded;
spark.kubernetes.authenticate.driver.serviceAccountName: the service account for driver pod;
archives: upload the Conda archive to K8s;
properties-file: upload BigDL configuration properties to K8s;
py-files: upload needed extra Python dependency files to K8s;
spark.pyspark.python: specify the Python location in Conda archive as executors' Python environment;
spark.executorEnv.PYTHONHOME: the search path of Python libraries on executor pod;
spark.kubernetes.file.upload.path: the path to store files at spark submit side in cluster mode;
spark.driver.extraClassPath: upload and register the BigDL jars files to the driver's classpath;
spark.executor.extraClassPath: upload and register the BigDL jars files to the executors' classpath;
spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName: specify the claim name of persistentVolumeClaim with specified volumnName to mount persistentVolume into driver pod;
spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path: add specified volumeName of the volumeType persistentVolumeClaim to driver pod on the NFS path specified in value;
spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName: specify the claim name of persistentVolumeClaim with specified volumnName to mount persistentVolume into executor pods;
spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path: add specified volumeName of the volumeType persistentVolumeClaim to executor pods on the NFS path specified in value;
cluster_mode: specify the cluster_mode in init_orca_context;
remote_dir: directory on NFS for loading the dataset.

Please retrieve training stats on the Develop Node following the commands below:

Retrive training logs on the driver pod:

kubectl logs `orca-k8s-cluster-tutorial-driver`

Check pod status or get basic informations around pod using:
```
kubectl describe pod `orca-k8s-cluster-tutorial-driver`
```

6.3 Use Kubernetes Deployment (with Conda Archive)

BigDL supports users (which want to execute programs directly on Develop Node) to run an application by creating a Kubernetes Deployment object.

Before submitting the Orca application, you should:

On the Develop Node

Use Conda to install BigDL and needed Python dependency libraries (see Section 3), then pack the activate Conda environment to an archive.
```
conda pack -o environment.tar.gz
```

Upload Conda archive, example Python files and extra Python dependencies to NFS.

# Upload Conda archive
cp /path/to/environment.tar.gz /bigdl/nfsdata

# Upload example Python files
cp /path/to/train.py /bigdl/nfsdata

# Uplaod extra Python dependencies
cp /path/to/model.py /bigdl/nfsdata

6.3.1 K8s Client

BigDL has provided an example YAML file (see orca-tutorial-client.yaml, which describes a Deployment that runs the intelanalytics/bigdl-k8s:2.1.0 image) to run the tutorial FashionMNIST program on k8s-client mode:

Notes:

Please call init_orca_context at very begining part of each Orca program.

from bigdl.orca import init_orca_context

init_orca_context(cluster_mode="spark-submit")

Spark client needs to specify spark.pyspark.driver.python, this python env should be on NFS dir.
```
--conf spark.pyspark.driver.python=/bigdl/nfsdata/python_env/bin/python \
```

# orca-tutorial-client.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: orca-pytorch-job
spec:
  template:
    spec:
      serviceAccountName: spark
      restartPolicy: Never
      hostNetwork: true
      containers:
      - name: spark-k8s-client
        image: intelanalytics/bigdl-k8s:2.1.0
        imagePullPolicy: IfNotPresent
        command: ["/bin/sh","-c"]
        args: ["
                export RUNTIME_DRIVER_HOST=$( hostname -I | awk '{print $1}' );
                ${SPARK_HOME}/bin/spark-submit \
                --master ${RUNTIME_SPARK_MASTER} \
                --deploy-mode ${SPARK_MODE} \
                --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
                --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
                --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
                --name orca-pytorch-tutorial \
                --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
                --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
                --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata/ \
                --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
                --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata/ \
                --archives file:///bigdl/nfsdata/environment.tar.gz#python_env \
                --conf spark.pyspark.driver.python=/bigdl/nfsdata/python_env/bin/python \
                --conf spark.pyspark.python=python_env/bin/python \
                --conf spark.executorEnv.PYTHONHOME=python_env \
                --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata/ \
                --num-executors 2 \
                --executor-cores 16 \
                --executor-memory 50g \
                --total-executor-cores 32 \
                --driver-cores 4 \
                --driver-memory 50g \
                --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
                --py-files local://${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,local:///bigdl/nfsdata/train.py,local:///bigdl/nfsdata/model.py \
                --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
                --conf spark.kubernetes.executor.deleteOnTermination=True \
                --conf spark.sql.catalogImplementation='in-memory' \
                --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
                --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
                local:///bigdl/nfsdata/train.py
                --cluster_mode spark-submit
                --remote_dir file:///bigdl/nfsdata/dataset
                "]
        securityContext:
          privileged: true
        env:
          - name: RUNTIME_K8S_SPARK_IMAGE
            value: intelanalytics/bigdl-k8s:2.1.0
          - name: RUNTIME_SPARK_MASTER
            value: k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>
          - name: RUNTIME_DRIVER_PORT
            value: !!str 54321
          - name: SPARK_MODE
            value: client
          - name: RUNTIME_K8S_SERVICE_ACCOUNT
            value: spark
          - name: BIGDL_HOME
            value: /opt/bigdl-2.1.0
          - name: SPARK_HOME
            value: /opt/spark
          - name: SPARK_VERSION
            value: 3.1.2
          - name: BIGDL_VERSION
            value: 2.1.0
        resources:
          requests:
            cpu: 1
          limits:
            cpu: 4
        volumeMounts:
          - name: nfs-storage
            mountPath: /bigdl/nfsdata
          - name: nfs-storage
            mountPath: /root/.kube/config
            subPath: kubeconfig
      volumes:
      - name: nfs-storage
        persistentVolumeClaim:
          claimName: nfsvolumeclaim

In the YAML file:

metadata: A nested object filed that every deployment object must specify a metadata.
- name: A string that uniquely identifies this object and job.
restartPolicy: Restart policy for all Containers within the pod. One of Always, OnFailure, Never. Default to Always.
containers: A single application Container that you want to run within a pod.
- name: Name of the Container, each Container in a pod must have a unique name.
- image: Name of the Container image.
- imagePullPolicy: Image pull policy. One of Always, Never and IfNotPresent. Defaults to Always if :latest tag is specified, or IfNotPresent otherwise.
- command: command for the containers that run in the Pod.
- args: arguments to submit the spark application in the Pod. See more details of the spark-submit script in Section 6.2.1.
- securityContext: SecurityContext defines the security options the container should be run with.
- env: List of environment variables to set in the Container, which will be used when submitting the application.
  - env.name: Name of the environment variable.
  - env.value: Value of the environment variable.
- resources: Allocate resources in the cluster to each pod.
  - resource.limits: Limits describes the maximum amount of compute resources allowed.
  - resource.requests: Requests describes the minimum amount of compute resources required.
- volumeMounts: Declare where to mount volumes into containers.
  - name: Match with the Name of a Volume.
  - mountPath: Path within the Container at which the volume should be mounted.
  - subPath: Path within the volume from which the Container's volume should be mounted.
- volume: specify the volumes to provide for the Pod.
  - persistentVolumeClaim: mount a PersistentVolume into a Pod

Create a Pod and run Fashion-MNIST application based on the YAML file.

kubectl apply -f orca-tutorial-client.yaml

List all pods to find the driver pod, which will be named as orca-pytorch-job-xxx.

# find out driver pod
kubectl get pods

View logs from the driver pod to retrive the training stats.

# retrive training logs
kubectl logs `orca-pytorch-job-xxx`

After the task finish, you could delete the job as the command below.

kubectl delete job orca-pytorch-job

6.3.2 K8s Cluster

BigDL has provided an example YAML file (see orca-tutorial-cluster.yaml, which describes a Deployment that runs the intelanalytics/bigdl-k8s:2.1.0 image) to run the tutorial FashionMNIST program on k8s-cluster mode:

Notes:

Please call init_orca_context at very begining part of each Orca program.

from bigdl.orca import init_orca_context

init_orca_context(cluster_mode="spark-submit")

# orca-tutorial-cluster.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: orca-pytorch-job
spec:
  template:
    spec:
      serviceAccountName: spark
      restartPolicy: Never
      hostNetwork: true
      containers:
      - name: spark-k8s-cluster
        image: intelanalytics/bigdl-k8s:2.1.0
        imagePullPolicy: IfNotPresent
        command: ["/bin/sh","-c"]
        args: ["
                ${SPARK_HOME}/bin/spark-submit \
                --master ${RUNTIME_SPARK_MASTER} \
                --deploy-mode ${SPARK_MODE} \
                --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
                --name orca-pytorch-tutorial \
                --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
                --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
                --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata/ \
                --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
                --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata/ \
                --archives file:///bigdl/nfsdata/environment.tar.gz#python_env \
                --conf spark.pyspark.python=python_env/bin/python \
                --conf spark.executorEnv.PYTHONHOME=python_env \
                --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata/ \
                --num_executors 2 \
                --executor-cores 16 \
                --executor-memory 50g \
                --total-executor-cores 32 \
                --driver-cores 4 \
                --driver-memory 50g \
                --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
                --py-files local://${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,local:///bigdl/nfsdata/train.py,local:///bigdl/nfsdata/model.py \
                --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
                --conf spark.kubernetes.executor.deleteOnTermination=True \
                --conf spark.sql.catalogImplementation='in-memory' \
                --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
                --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
                local:///bigdl/nfsdata/train.py
                --cluster_mode spark-submit
                --remote_dir file:///bigdl/nfsdata/dataset
                "]
        securityContext:
          privileged: true
        env:
          - name: RUNTIME_K8S_SPARK_IMAGE
            value: intelanalytics/bigdl-k8s:2.1.0
          - name: RUNTIME_SPARK_MASTER
            value: k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>
          - name: SPARK_MODE
            value: cluster
          - name: RUNTIME_K8S_SERVICE_ACCOUNT
            value: spark
          - name: BIGDL_HOME
            value: /opt/bigdl-2.1.0
          - name: SPARK_HOME
            value: /opt/spark
          - name: SPARK_VERSION
            value: 3.1.2
          - name: BIGDL_VERSION
            value: 2.1.0
        resources:
          requests:
            cpu: 1
          limits:
            cpu: 4
        volumeMounts:
          - name: nfs-storage
            mountPath: /bigdl/nfsdata
          - name: nfs-storage
            mountPath: /root/.kube/config
            subPath: kubeconfig
      volumes:
      - name: nfs-storage
        persistentVolumeClaim:
          claimName: nfsvolumeclaim

In the YAML file:

restartPolicy: Restart policy for all Containers within the pod. One of Always, OnFailure, Never. Default to Always.
containers: A single application Container that you want to run within a pod.
- name: Name of the Container, each Container in a pod must have a unique name.
- image: Name of the Container image.
- imagePullPolicy: Image pull policy. One of Always, Never and IfNotPresent. Defaults to Always if :latest tag is specified, or IfNotPresent otherwise.
- command: command for the containers that run in the Pod.
- args: arguments to submit the spark application in the Pod. See more details of the spark-submit script in Section 6.2.2.
- securityContext: SecurityContext defines the security options the container should be run with.
- env: List of environment variables to set in the Container, which will be used when submitting the application.
  - env.name: Name of the environment variable.
  - env.value: Value of the environment variable.
- resources: Allocate resources in the cluster to each pod.
  - resource.limits: Limits describes the maximum amount of compute resources allowed.
  - resource.requests: Requests describes the minimum amount of compute resources required.
- volumeMounts: Declare where to mount volumes into containers.
  - name: Match with the Name of a Volume.
  - mountPath: Path within the Container at which the volume should be mounted.
  - subPath: Path within the volume from which the Container's volume should be mounted.
- volume: specify the volumes to provide for the Pod.
  - persistentVolumeClaim: mount a PersistentVolume into a Pod

Create a Pod and run Fashion-MNIST application based on the YAML file.

kubectl apply -f orca-tutorial-cluster.yaml

List all pods to find the driver pod (since the client pod only returns training status), which will be named as orca-pytorch-job-driver.

# checkout training status
kubectl logs `orca-pytorch-job-xxx`

# find out driver pod
kubectl get pods

View logs from the driver pod to retrive the training stats.

# retrive training logs
kubectl logs `orca-pytorch-job-driver`

After the task finish, you could delete the job as the command below.

kubectl delete job orca-pytorch-job

6.4 Use Kubernetes Deployment (without Integrated Image)

BigDL also supports uses to skip preparing envionment through providing a container image (intelanalytics/bigdl-k8s:orca-2.1.0) which has integrated all BigDL required environments.

Notes:

The image will be pulled automatically when you deploy pods with the YAML file.
Conda archive is no longer needed in this method, please skip Section 3, since BigDL has integrated environment in intelanalytics/bigdl-k8s:orca-2.1.0.
If you need to install extra Python libraries which may not included in the image, please submit applications with Conda archive (refer to Section 6.3).

Before submitting the example application, you should:

On the Develop Node

Download dataset and upload it to NFS.
```
mv /path/to/dataset /bigdl/nfsdata 
```

Upload example Python files and extra Python dependencies to NFS.

# Upload example Python files
cp /path/to/train.py /bigdl/nfsdata

# Uplaod extra Python dependencies
cp /path/to/model.py /bigdl/nfsdata

6.4.1 K8s Client

BigDL has provided an example YAML file (see integrated_image_client.yaml, which describes a deployment that runs the intelanalytics/bigdl-k8s:orca-2.1.0 image) to run the tutorial FashionMNIST program on k8s-client mode:

Notes:

Please call init_orca_context at very begining part of each Orca program.

from bigdl.orca import init_orca_context

init_orca_context(cluster_mode="spark-submit")

Spark client needs to specify spark.pyspark.driver.python, this python env should be on NFS dir.
```
--conf spark.pyspark.driver.python=/bigdl/nfsdata/orca_env/bin/python \
```

#integrate_image_client.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: orca-integrate-job
spec:
  template:
    spec:
      serviceAccountName: spark
      restartPolicy: Never
      hostNetwork: true
      containers:
      - name: spark-k8s-client
        image: intelanalytics/bigdl-spark-3.1.2:orca-2.1.0
        imagePullPolicy: IfNotPresent
        command: ["/bin/sh","-c"]
        args: ["
                export RUNTIME_DRIVER_HOST=$( hostname -I | awk '{print $1}' );
                ${SPARK_HOME}/bin/spark-submit \
                --master ${RUNTIME_SPARK_MASTER} \
                --deploy-mode ${SPARK_MODE} \
                --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
                --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
                --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
                --name orca-integrate-pod \
                --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
                --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
                --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
                --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata \
                --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
                --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata \
                --conf spark.pyspark.driver.python=python \
                --conf spark.pyspark.python=/usr/local/envs/bigdl/bin/python \
                --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata/ \
                --executor-cores 10 \
                --executor-memory 50g \
                --num-executors 4 \
                --total-executor-cores 40 \
                --driver-cores 10 \
                --driver-memory 50g \
                --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
                --py-files local://${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,local:///bigdl/nfsdata/train.py,local:///bigdl/nfsdata/model.py \
                --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
                --conf spark.sql.catalogImplementation='in-memory' \
                --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
                --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
                local:///bigdl/nfsdata/train.py
                --cluster_mode spark-submit
                --remote_dir file:///bigdl/nfsdata/dataset
                "]
        securityContext:
          privileged: true
        env:
          - name: RUNTIME_K8S_SPARK_IMAGE
            value: intelanalytics/bigdl-spark-3.1.2:orca-2.1.0
          - name: RUNTIME_SPARK_MASTER
            value: k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>
          - name: RUNTIME_DRIVER_PORT
            value: !!str 54321
          - name: SPARK_MODE
            value: client
          - name: RUNTIME_K8S_SERVICE_ACCOUNT
            value: spark
          - name: BIGDL_HOME
            value: /opt/bigdl-2.1.0
          - name: SPARK_HOME
            value: /opt/spark
          - name: SPARK_VERSION
            value: 3.1.2
          - name: BIGDL_VERSION
            value: 2.1.0
        resources:
          requests:
            cpu: 1
          limits:
            cpu: 4
        volumeMounts:
          - name: nfs-storage
            mountPath: /bigdl/nfsdata
          - name: nfs-storage
            mountPath: /root/.kube/config
            subPath: kubeconfig
      volumes:
      - name: nfs-storage
        persistentVolumeClaim:
          claimName: nfsvolumeclaim

In the YAML file:

restartPolicy: Restart policy for all Containers within the pod. One of Always, OnFailure, Never. Default to Always.
containers: A single application Container that you want to run within a pod.
- name: Name of the Container, each Container in a pod must have a unique name.
- image: Name of the Container image.
- imagePullPolicy: Image pull policy. One of Always, Never and IfNotPresent. Defaults to Always if :latest tag is specified, or IfNotPresent otherwise.
- command: command for the containers that run in the Pod.
- args: arguments to submit the spark application in the Pod. See more details of the spark-submit script in Section 6.2.1.
- securityContext: SecurityContext defines the security options the container should be run with.
- env: List of environment variables to set in the Container, which will be used when submitting the application.
  - env.name: Name of the environment variable.
  - env.value: Value of the environment variable.
- resources: Allocate resources in the cluster to each pod.
  - resource.limits: Limits describes the maximum amount of compute resources allowed.
  - resource.requests: Requests describes the minimum amount of compute resources required.
- volumeMounts: Declare where to mount volumes into containers.
  - name: Match with the Name of a Volume.
  - mountPath: Path within the Container at which the volume should be mounted.
  - subPath: Path within the volume from which the Container's volume should be mounted.
- volume: specify the volumes to provide for the Pod.
  - persistentVolumeClaim: mount a PersistentVolume into a Pod

Create a Pod and run Fashion-MNIST application based on the YAML file.

kubectl apply -f integrate_image_client.yaml

List all pods to find the driver pod, which will be named as orca-integrate-job-xxx.

# find out driver pod
kubectl get pods

View logs from the driver pod to retrive the training stats.

# retrive training logs
kubectl logs `orca-integrate-job-xxx`

After the task finish, you could delete the job as the command below.

kubectl delete job orca-integrate-job

6.4.2 K8s Cluster

BigDL has provided an example YAML file (see integrate_image_cluster.yaml, which describes a deployment that runs the intelanalytics/bigdl-k8s:orca-2.1.0 image) to run the tutorial FashionMNIST program on k8s-cluster mode:

Notes:

Please call init_orca_context at very begining part of each Orca program.

from bigdl.orca import init_orca_context

init_orca_context(cluster_mode="spark-submit")

# integrate_image_cluster.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: orca-integrate-job
spec:
  template:
    spec:
      serviceAccountName: spark
      restartPolicy: Never
      hostNetwork: true
      containers:
      - name: spark-k8s-cluster
        image: intelanalytics/bigdl-spark-3.1.2:orca-2.1.0
        imagePullPolicy: IfNotPresent
        command: ["/bin/sh","-c"]
        args: ["
                ${SPARK_HOME}/bin/spark-submit \
                --master ${RUNTIME_SPARK_MASTER} \
                --deploy-mode ${SPARK_MODE} \
                --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
                --name orca-integrate-pod \
                --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
                --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
                --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata \
                --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
                --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata \
                --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata/ \
                --executor-cores 10 \
                --executor-memory 50g \
                --num-executors 4 \
                --total-executor-cores 40 \
                --driver-cores 10 \
                --driver-memory 50g \
                --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
                --py-files local://${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,local:///bigdl/nfsdata/train.py,local:///bigdl/nfsdata/model.py \
                --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
                --conf spark.sql.catalogImplementation='in-memory' \
                --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
                --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
                local:///bigdl/nfsdata/train.py
                --cluster_mode spark-submit
                --remote_dir file:///bigdl/nfsdata/dataset
                "]
        securityContext:
          privileged: true
        env:
          - name: RUNTIME_K8S_SPARK_IMAGE
            value: intelanalytics/bigdl-spark-3.1.2:orca-2.1.0
          - name: RUNTIME_SPARK_MASTER
            value: k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>
          - name: SPARK_MODE
            value: cluster
          - name: RUNTIME_K8S_SERVICE_ACCOUNT
            value: spark
          - name: BIGDL_HOME
            value: /opt/bigdl-2.1.0
          - name: SPARK_HOME
            value: /opt/spark
          - name: SPARK_VERSION
            value: 3.1.2
          - name: BIGDL_VERSION
            value: 2.1.0
        resources:
          requests:
            cpu: 1
          limits:
            cpu: 4
        volumeMounts:
          - name: nfs-storage
            mountPath: /bigdl/nfsdata
          - name: nfs-storage
            mountPath: /root/.kube/config
            subPath: kubeconfig
      volumes:
      - name: nfs-storage
        persistentVolumeClaim:
          claimName: nfsvolumeclaim

In the YAML file:

restartPolicy: Restart policy for all Containers within the pod. One of Always, OnFailure, Never. Default to Always.
containers: A single application Container that you want to run within a pod.
- name: Name of the Container, each Container in a pod must have a unique name.
- image: Name of the Container image.
- imagePullPolicy: Image pull policy. One of Always, Never and IfNotPresent. Defaults to Always if :latest tag is specified, or IfNotPresent otherwise.
- command: command for the containers that run in the Pod.
- args: arguments to submit the spark application in the Pod. See more details of the spark-submit script in Section 6.2.2.
- securityContext: SecurityContext defines the security options the container should be run with.
- env: List of environment variables to set in the Container, which will be used when submitting the application.
  - env.name: Name of the environment variable.
  - env.value: Value of the environment variable.
- resources: Allocate resources in the cluster to each pod.
  - resource.limits: Limits describes the maximum amount of compute resources allowed.
  - resource.requests: Requests describes the minimum amount of compute resources required.
- volumeMounts: Declare where to mount volumes into containers.
  - name: Match with the Name of a Volume.
  - mountPath: Path within the Container at which the volume should be mounted.
  - subPath: Path within the volume from which the Container's volume should be mounted.
- volume: specify the volumes to provide for the Pod.
  - persistentVolumeClaim: mount a PersistentVolume into a Pod

Create a Pod and run Fashion-MNIST application based on the YAML file.

kubectl apply -f integrate_image_cluster.yaml

List all pods to find the driver pod (since the client pod only returns training status), which will be named as orca-integrate-job-driver.

# checkout training status
kubectl logs `orca-integrate-job-xxx`

# find out driver pod
kubectl get pods

View logs from the driver pod to retrive the training stats.

# retrive training logs
kubectl logs `orca-integrate-job-driver`

After the task finish, you could delete the job as the command below.

kubectl delete job orca-integrate-job

56 KiB Raw Blame History

Run on Kubernetes Clusters

1. Basic Concepts

1.1 init_orca_context

1.2 K8s-Client & K8s-Cluster

1.3 Load Data from Volumes

2. Create BigDL K8s Container

2.1 Pull Docker Image

2.2 Create a K8s Client Container

2.3 Launch the K8s Client Container

3. Prepare Environment

4. Prepare Dataset

5. Prepare Custom Modules

6. Run Jobs on K8s

6.1 Use python command

6.1.1 K8s-Client

6.1.2 K8s-Cluster

6.2 Use spark-submit

6.2.1 K8s Client

6.2.2 K8s Cluster

6.3 Use Kubernetes Deployment (with Conda Archive)

6.3.1 K8s Client

6.3.2 K8s Cluster

6.4 Use Kubernetes Deployment (without Integrated Image)

6.4.1 K8s Client

6.4.2 K8s Cluster

56 KiB

Raw Blame History

6.1 Use `python` command

6.2 Use `spark-submit`