ipex-llm/docs/readthedocs/source/doc/Orca/Tutorial/k8s.md
2022-11-14 20:35:38 +08:00

58 KiB

Run on Kubernetes Clusters

This tutorial provides a step-by-step guide on how to run BigDL-Orca programs on Kubernetes (K8s) clusters, using a PyTorch Fashin-MNIST program as a working example.

1. Key Concepts

1.1 Init_orca_context

A BigDL Orca program usually starts with the initialization of OrcaContext. For every BigDL Orca program, you should call init_orca_context at the beginning of the program as below:

from bigdl.orca import init_orca_context

init_orca_context(cluster_mode, master, container_image, 
                  cores, memory, num_nodes, driver_cores, driver_memory, 
                  extra_python_lib, penv_archive, jars, conf)

In init_orca_context, you may specify necessary runtime configurations for running the example on K8s, including:

  • cluster_mode: a String that specifies the underlying cluster; valid value includes "local", "yarn-client", "yarn-cluster", "k8s-client", "k8s-cluster", "bigdl-submit", "spark-submit", etc.
  • master: a URL format to specify the master address of K8s cluster.
  • container_image: a String that specifies the name of docker container image for executors.
  • cores: an Integer that specifies the number of cores for each executor (default to be 2).
  • memory: a String that specifies the memory for each executor (default to be "2g").
  • num_nodes: an Integer that specifies the number of executors (default to be 1).
  • driver_cores: an Integer that specifies the number of cores for the driver node (default to be 4).
  • driver_memory: a String that specifies the memory for the driver node (default to be "1g").
  • extra_python_lib: a String that specifies the path to extra Python packages, one of .py, .zip or .egg files (default to be None).
  • penv_archive: a String that specifies the path to a packed Conda archive (default to be None).
  • jars: a String that specifies the path to needed jars files (default to be None).
  • conf: a Key-Value format to append extra conf for Spark (default to be None).

Note:

  • All arguments except cluster_mode will be ignored when using spark-submit and Kubernetes deployment to submit and run Orca programs.

After the Orca programs finish, you should call stop_orca_context at the end of the program to release resources and shutdown the underlying distributed runtime engine (such as Spark or Ray).

from bigdl.orca import stop_orca_context

stop_orca_context()

For more details, please see OrcaContext.

1.2 K8s Client&Cluster

The difference between k8s-client and k8s-cluster is where you run your Spark driver.

For k8s-client, the Spark driver runs in the client process (outside the K8s cluster), while for k8s-cluster the Spark driver runs inside the K8s cluster.

Please see more details in K8s-Cluster and K8s-Client.

1.3 Load Data from Network File Systems (NFS)

When you are running programs on K8s, please load data from volumes and we use NFS in this tutorial as an example.

After mounting the volume (NFS) into BigDL container (see Section 2.2), the Fashion-MNIST example could load data from NFS.

import torch
import torchvision
import torchvision.transforms as transforms
from bigdl.orca.data.file import get_remote_file_to_local

def train_data_creator(config, batch_size):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (0.5,))])
    
    get_remote_file_to_local(remote_path="/path/to/nfsdata", local_path="/tmp/dataset")

    trainset = torchvision.datasets.FashionMNIST(root="/bigdl/nfsdata/dataset", train=True, 
                                                 download=False, transform=transform)

    trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                              shuffle=True, num_workers=0)
    return trainloader

2. Create & Lunch BigDL K8s Container

2.1 Pull Docker Image

Please pull the BigDL 2.1.0 bigdl-k8s image from Docker Hub as follows:

sudo docker pull intelanalytics/bigdl-k8s:2.1.0

Note:

  • If you need the nightly built BigDL, please pull the latest image as below:
    sudo docker pull intelanalytics/bigdl-k8s:latest
    
  • The 2.1.0 and latest BigDL image is built on top of Spark 3.1.2.

2.2 Create a K8s Client Container

Please launch the Client Container following the script below:

sudo docker run -itd --net=host \
    -v /etc/kubernetes:/etc/kubernetes \
    -v /root/.kube:/root/.kube \
    intelanalytics/bigdl-k8s:2.1.0 bash

In the script:

  • --net=host: use the host network stack for the Docker container;
  • -v /etc/kubernetes:/etc/kubernetes: specify the path of kubernetes configurations;
  • -v /root/.kube:/root/.kube: specify the path of kubernetes installation;

Notes:

  • Please switch the tag from 2.1.0 to latest if you pull the latest BigDL image.
  • The Client Container contains all the required environment except K8s configs.
  • You needn't to create an Executor Container manually, which is scheduled by K8s at runtime.

We recommend you to specify more arguments when creating a container:

sudo docker run -itd --net=host \
    -v /etc/kubernetes:/etc/kubernetes \
    -v /root/.kube:/root/.kube \
    -v /path/to/nfsdata:/bigdl/nfsdata \
    -e NOTEBOOK_PORT=12345 \
    -e NOTEBOOK_TOKEN="your-token" \
    -e http_proxy=http://your-proxy-host:your-proxy-port \
    -e https_proxy=https://your-proxy-host:your-proxy-port \
    -e RUNTIME_SPARK_MASTER=k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
    -e RUNTIME_K8S_SERVICE_ACCOUNT=spark \
    -e RUNTIME_K8S_SPARK_IMAGE=intelanalytics/bigdl-k8s:2.1.0 \
    -e RUNTIME_PERSISTENT_VOLUME_CLAIM=nfsvolumeclaim \
    -e RUNTIME_DRIVER_HOST=x.x.x.x \
    -e RUNTIME_DRIVER_PORT=54321 \
    -e RUNTIME_EXECUTOR_INSTANCES=2 \
    -e RUNTIME_EXECUTOR_CORES=2 \
    -e RUNTIME_EXECUTOR_MEMORY=20g \
    -e RUNTIME_TOTAL_EXECUTOR_CORES=4 \
    -e RUNTIME_DRIVER_CORES=4 \
    -e RUNTIME_DRIVER_MEMORY=10g \
    intelanalytics/bigdl-k8s:2.1.0 bash 

Notes:

  • Please make sure you are mounting the correct volumn path (e.g. NFS) in a container.
  • Please switch the 2.1.0 tag to latest if you pull the latest BigDL image.

In the script:

  • --net=host: use the host network stack for the Docker container;
  • /etc/kubernetes:/etc/kubernetes: specify the path of kubernetes configurations;
  • /root/.kube:/root/.kube: specify the path of kubernetes installation;
  • /path/to/nfsdata:/bigdl/data: mount NFS path on host in a container as the sepcified path in value;
  • NOTEBOOK_PORT: an Integer that specifies port number for Notebook (only required by notebook);
  • NOTEBOOK_TOKEN: a String that specifies the token for Notebook (only required by notebook);
  • RUNTIME_SPARK_MASTER: a URL format that specifies the Spark master;
  • RUNTIME_K8S_SERVICE_ACCOUNT: a String that specifies the service account for driver pod;
  • RUNTIME_K8S_SPARK_IMAGE: the lanuched k8s image;
  • RUNTIME_PERSISTENT_VOLUME_CLAIM: a String that specifies the Kubernetes volumeName;
  • RUNTIME_DRIVER_HOST: a URL format that specifies the driver localhost (only required by client mode);
  • RUNTIME_DRIVER_PORT: a String that specifies the driver port (only required by client mode);
  • RUNTIME_EXECUTOR_INSTANCES: an Integer that specifies the number of executors;
  • RUNTIME_EXECUTOR_CORES: an Integer that specifies the number of cores for each executor;
  • RUNTIME_EXECUTOR_MEMORY: a String that specifies the memory for each executor;
  • RUNTIME_TOTAL_EXECUTOR_CORES: an Integer that specifies the number of cores for all executors;
  • RUNTIME_DRIVER_CORES: an Integer that specifies the number of cores for the driver node;
  • RUNTIME_DRIVER_MEMORY: a String that specifies the memory for the driver node;

2.3 Launch the K8s Client Container

Once the container is created, docker image would return a containerID, please launch the container following the command below:

sudo docker exec -it <containerID> bash

3. Prepare Environment

In the launched BigDL K8s Container, please setup environment following steps below:

3.1 Install Python Libraries

3.1.1 Install Conda

Please use conda to prepare the Python environment on the Client Container (where you submit applications), you could download and install Conda following Conda User Guide or executing the command as below.

# Download Anaconda installation script 
wget -P /tmp https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh

# Execute the script to install conda
bash /tmp/Anaconda3-2020.02-Linux-x86_64.sh

# Please type this command in your terminal to activate Conda environment
source ~/.bashrc

3.1.2 Use Conda to Install BigDL and Other Python Libraries

Create a Conda environment, install BigDL and all needed Python libraries in the activate Conda:

# "env" is conda environment name, you can use any name you like.
# Please change Python version to 3.8 if you need a Python 3.8 environment.
conda create -n env python=3.7 
conda activate env

Please install the 2.1.0 release version of BigDL (built on top of Spark 3.1.2) as follows:

pip install bigdl-spark3

When you are running in the latest BigDL image, please install the nightly build of BigDL as follows:

pip install --pre --upgrade bigdl-spark3

Please install torch and torchvision to run the Fashion-MNIST example:

pip install torch torchvision

Notes:

  • Using Conda to install BigDL will automatically install libraries including pyspark==3.1.2, and etc.

  • You can install BigDL Orca built on top of Spark 3.1.2 as follows:

    # Install the latest release version
    pip install bigdl-orca-spark3
    
    # Install the latest nightly build version
    pip install --pre --upgrade bigdl-spark3-orca
    
    # You need to install torch and torchvision manually
    pip install torch torchvision
    

    Installing bigdl-orca-spark3 will automatically install pyspark==3.1.2.

  • You also need to install any additional python libraries that your application depends on in the Conda environment.

  • It's only need for you to install needed Python libraries using Conda, since the BigDL K8s container has already setup JAVA_HOME, BIGDL_HOME, SPARK_HOME, SPARK_VERSION, etc.

Please see more details in Python User Guide.

4. Prepare Dataset

To run the example provided by this tutorial on K8s, you should upload the dataset to to a K8s volumn (e.g. NFS).

Please download the Fashion-MNIST dataset manually on your Develop Node (where you launch the container image).

# PyTorch official dataset download link
git clone https://github.com/zalandoresearch/fashion-mnist.git

# Move the dataset to NFS
mv /path/to/fashion-mnist/data/fashion /bigdl/nfsdata/dataset/FashionMNIST/raw

# Extract FashionMNIST archives
gzip -dk /bigdl/nfsdata/dataset/FashionMNIST/raw/*

Note: PyTorch requires tge directory of dataset where FashionMNIST/raw/train-images-idx3-ubyte and FashionMNIST/raw/t10k-images-idx3-ubyte exist.

5. Prepare Custom Modules

Spark allows to upload Python files(.py), and zipped Python packages(.zip) to the executors by setting --py-files option in Spark scripts or extra_python_lib in init_orca_context.

The FasionMNIST example needs to import modules from model.py.

Note: Please upload the extra Python dependency files to NFS when running the program on k8s-cluster mode (see more details in Section 6.2.2).

  • When using python command, please specify extra_python_lib in init_orca_context.

    import os
    from bigdl.orca import init_orca_context, stop_orca_context
    from model import model_creator, optimizer_creator
    
    conf={
          "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim",
          "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/bigdl/nfsdata",
        }
    
    init_orca_context(cluster_mode="k8s-client", num_nodes=2, cores=2, memory="2g",
                      master="k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>",
                      container_image="intelanalytics/bigdl-k8s:latest",
                      extra_python_lib="/path/to/model.py", conf=conf)
    

    Please see more details in Orca Document.

  • When using spark-submit script, please specify --py-files option in the script.

    --py-files ${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,file:///bigdl/nfsdata/model.py
    

    Then import custom modules:

    from bigdl.orca import init_orca_context, stop_orca_context
    from model import model_creator, optimizer_creator
    
    init_orca_context(cluster_mode="spark-submit")
    

    Please see more details in Spark Document.

Notes:

  • You could follow the steps below to use a zipped package instead (recommended if your program depends on a nested directory of Python files) :
    1. Compress the directory into a Zipped Package.
      zip -q -r FashionMNIST_zipped.zip FashionMNIST
      
    2. Please follow the same method as above (using .py files) to upload the zipped package (FashionMNIST_zipped.zip) to K8s.
    3. You should import custom modules from the unzipped file as below.
      from FashionMNIST.model import model_creator, optimizer_creator
      

6. Run Jobs on K8s

In the following part, we will show you how to submit and run the Orca example on K8s:

  • Use python command
  • Use spark-submit script
  • Use Kubernetes Deployment (with Conda Archive)
  • Use Kubernetes Deployment (with Integrated Image)

6.1 Use python command

6.1.1 K8s-Client

Before running the example on k8s-client mode, you should:

  • On the Client Container:
    1. Please call init_orca_context at very begining part of each Orca program.

      from bigdl.orca import init_orca_context, stop_orca_context
      
      conf={
          "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim",
          "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/bigdl/nfsdata",
          }
      
      init_orca_context(cluster_mode="k8s-client", num_nodes=2, cores=2, memory="2g",
                      master="k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>",
                      container_image="intelanalytics/bigdl-k8s:2.1.0",
                      extra_python_lib="/path/to/model.py", conf=conf)
      

      To load data from NFS, please use the following configuration propeties:

      • spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName: specify the claim name of persistentVolumeClaim with volumnName nfsvolumeclaim to mount persistentVolume into executor pods;
      • spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path: add volumeName nfsvolumeclaim of the volumeType persistentVolumeClaim to executor pods on the NFS path specified in value;
    2. Using Conda to install BigDL and needed Python dependency libraries (see Section 3).

Please run the Fashion-MNIST example following the command below:

python train.py --cluster_mode k8s-client --remote_dir file:///bigdl/nfsdata/dataset

In the script:

  • cluster_mode: set the cluster_mode in init_orca_context.
  • remote_dir: directory on NFS for loading the dataset.

6.1.2 K8s-Cluster

Before running the example on k8s-cluster mode, you should:

  • On the Client Container:
    1. Please call init_orca_context at very begining part of each Orca program.

      from bigdl.orca import init_orca_context, stop_orca_context
      
      conf={
            "spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim",
            "spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/bigdl/nfsdata",
            "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim",
            "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/bigdl/nfsdata",
            "spark.kubernetes.authenticate.driver.serviceAccountName":"spark",
            "spark.kubernetes.file.upload.path":"/bigdl/nfsdata/"
            }
      
      init_orca_context(cluster_mode="k8s-cluster", num_nodes=2, cores=2, memory="2g",
                        master="k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>", 
                        container_image="intelanalytics/bigdl-k8s:2.1.0",
                        penv_archive="file:///bigdl/nfsdata/environment.tar.gz",
                        extra_python_lib="/bigdl/nfsdata/model.py", conf=conf)
      

      When running Orca programs on k8s-cluster mode, please use the following additional configuration propeties:

      • spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName: specify the claim name of persistentVolumeClaim with volumnName nfsvolumeclaim to mount persistentVolume into driver pod;
      • spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path: add volumeName nfsvolumeclaim of the volumeType persistentVolumeClaim to driver pod on the NFS path specified in value;
      • spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName: specify the claim name of persistentVolumeClaim with volumnName nfsvolumeclaim to mount persistentVolume into executor pods;
      • spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path: add volumeName nfsvolumeclaim of the volumeType persistentVolumeClaim to executor pods on the NFS path specified in value;
      • spark.kubernetes.authenticate.driver.serviceAccountName: the service account for driver pod;
      • spark.kubernetes.file.upload.path: the path to store files at spark submit side in cluster mode;
    2. Using Conda to install BigDL and needed Python dependency libraries (see Section 3), then pack the current activate Conda environment to an archive.

      conda pack -o /path/to/environment.tar.gz
      
  • On the Develop Node:
    1. Upload the Conda archive to NFS.
      docker cp <containerID>:/opt/spark/work-dir/environment.tar.gz /bigdl/nfsdata
      
    2. Upload the example Python file to NFS.
      mv /path/to/train.py /bigdl/nfsdata
      
    3. Upload the extra Python dependency file to NFS.
      mv /path/to/model.py /bigdl/nfsdata
      

Please run the Fashion-MNIST example in Client Container following the command below:

python /bigdl/nfsdata/train.py --cluster_mode k8s-cluster --remote_dir /bigdl/nfsdata/dataset

In the script:

  • cluster_mode: set the cluster_mode in init_orca_context.
  • remote_dir: directory on NFS for loading the dataset.

Note: It will return a driver pod name when the application is completed.

Please retreive training stats on the Develop Node following the command below:

  • Retrive training logs on the driver pod:

    kubectl logs <driver-pod-name>
    
  • Check pod status or get basic informations around pod:

    kubectl describe pod <driver-pod-name>
    

6.2 Use spark-submit Script

6.2.1 K8s Client

Before submitting the example on k8s-client mode, you should:

  • On the Client Container:
    1. Please call init_orca_context at very begining part of each Orca program.
      from bigdl.orca import init_orca_context
      
      init_orca_context(cluster_mode="spark-submit")
      
    2. Using Conda to install BigDL and needed Python dependency libraries (see Section 3), then pack the current activate Conda environment to an archive.
      conda pack -o environment.tar.gz
      

Please submit the example following the script below:

${SPARK_HOME}/bin/spark-submit \
    --master ${RUNTIME_SPARK_MASTER} \
    --deploy-mode client \
    --name orca-k8s-client-tutorial \
    --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
    --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
    --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
    --driver-cores ${RUNTIME_DRIVER_CORES} \
    --driver-memory ${RUNTIME_DRIVER_MEMORY} \
    --executor-cores ${RUNTIME_EXECUTOR_CORES} \
    --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
    --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
    --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
    --conf spark.pyspark.driver.python=python \
    --conf spark.pyspark.python=./env/bin/python \
    --archives /path/to/environment.tar.gz#env \
    --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
    --py-files ${BIGDL_HOME}/python/bigdl-spark_3.1.2-2.1.0-python-api.zip,/path/to/train.py,/path/to/model.py \
    --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
    --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
    --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
    --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl/nfsdata \
    local:///path/to/train.py --cluster_mode "spark-submit" --remote_dir /bigdl/nfsdata/dataset

In the script:

  • master: the spark master with a URL format;
  • deploy-mode: set it to client when submitting in client mode;
  • name: the name of Spark application;
  • spark.driver.host: the localhost for driver pod (only required when submitting in client mode);
  • spark.kubernetes.container.image: the BigDL docker image you downloaded;
  • spark.kubernetes.authenticate.driver.serviceAccountName: the service account for driver pod;
  • spark.pyspark.driver.python: specify the Python location in Conda archive as driver's Python environment;
  • spark.pyspark.python: specify the Python location in Conda archive as executors' Python environment;
  • archives: upload the packed Conda archive to K8s;
  • properties-file: upload BigDL configuration properties to K8s;
  • py-files: upload extra Python dependency files to K8s;
  • spark.driver.extraClassPath: upload and register the BigDL jars files to the driver's classpath;
  • spark.executor.extraClassPath: upload and register the BigDL jars files to the executors' classpath;
  • spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName: specify the claim name of persistentVolumeClaim with specified volumnName to mount persistentVolume into executor pods;
  • spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path: add specified volumeName of the volumeType persistentVolumeClaim to executor pods on the NFS path specified in value;
  • cluster_mode: the cluster_mode in init_orca_context;
  • remote_dir: directory on NFS for loading the dataset.

6.2.2 K8s Cluster

Before submitting the example on k8s-cluster mode, you should:

  • On the Client Container:

    1. Please call init_orca_context at very begining part of each Orca program.
      from bigdl.orca import init_orca_context
      
      init_orca_context(cluster_mode="spark-submit")
      
    2. Using Conda to install BigDL and needed Python dependency libraries (see Section 3), then pack the Conda environment to an archive.
      conda pack -o environment.tar.gz
      
  • On the Develop Node (where you launch the Client Container):

    1. Upload Conda archive to NFS.

      docker cp <containerID>:/path/to/environment.tar.gz /bigdl/nfsdata
      
    2. Upload the example python files to NFS.

      mv /path/to/train.py /bigdl/nfsdata
      
    3. Upload the extra Python dependencies to NFS.

      mv /path/to/model.py /bigdl/nfsdata
      

Please run the example following the script below in the Client Container:

${SPARK_HOME}/bin/spark-submit \
    --master ${RUNTIME_SPARK_MASTER} \
    --deploy-mode cluster \
    --name orca-k8s-cluster-tutorial \
    --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
    --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
    --archives file:///bigdl/nfsdata/environment.tar.gz#python_env \
    --conf spark.pyspark.python=python_env/bin/python \
    --conf spark.executorEnv.PYTHONHOME=python_env \
    --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata \
    --executor-cores ${RUNTIME_EXECUTOR_CORES} \
    --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
    --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
    --driver-cores ${RUNTIME_DRIVER_CORES} \
    --driver-memory ${RUNTIME_DRIVER_MEMORY} \
    --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
    --py-files local://${BIGDL_HOME}/python/bigdl-spark_3.1.2-2.1.0-SNAPSHOT-python-api.zip,file:///bigdl/nfsdata/train.py,file:///bigdl/nfsdata/model.py \
    --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
    --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
    --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
    --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl/nfsdata \
    --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
    --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl/nfsdata \
    file:///bigdl/nfsdata/train.py --cluster_mode "spark-submit" --remote_dir /bigdl/nfsdata/dataset

In the script:

  • master: the spark master with a URL format;
  • deploy-mode: set it to cluster when submitting in cluster mode;
  • name: the name of Spark application;
  • spark.kubernetes.container.image: the BigDL docker image you downloaded;
  • spark.kubernetes.authenticate.driver.serviceAccountName: the service account for driver pod;
  • archives: upload the Conda archive to K8s;
  • properties-file: upload BigDL configuration properties to K8s;
  • py-files: upload needed extra Python dependency files to K8s;
  • spark.pyspark.python: specify the Python location in Conda archive as executors' Python environment;
  • spark.executorEnv.PYTHONHOME: the search path of Python libraries on executor pod;
  • spark.kubernetes.file.upload.path: the path to store files at spark submit side in cluster mode;
  • spark.driver.extraClassPath: upload and register the BigDL jars files to the driver's classpath;
  • spark.executor.extraClassPath: upload and register the BigDL jars files to the executors' classpath;
  • spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName: specify the claim name of persistentVolumeClaim with specified volumnName to mount persistentVolume into driver pod;
  • spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path: add specified volumeName of the volumeType persistentVolumeClaim to driver pod on the NFS path specified in value;
  • spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName: specify the claim name of persistentVolumeClaim with specified volumnName to mount persistentVolume into executor pods;
  • spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path: add specified volumeName of the volumeType persistentVolumeClaim to executor pods on the NFS path specified in value;
  • cluster_mode: specify the cluster_mode in init_orca_context;
  • remote_dir: directory on NFS for loading the dataset.

Please retrieve training stats on the Develop Node following the commands below:

  • Retrive training logs on the driver pod:

    kubectl logs `orca-k8s-cluster-tutorial-driver`
    
  • Check pod status or get basic informations around pod using:

    kubectl describe pod `orca-k8s-cluster-tutorial-driver`
    

6.3 Use Kubernetes Deployment (with Conda Archive)

BigDL supports users (which want to execute programs directly on Develop Node) to run an application by creating a Kubernetes Deployment object.

Before submitting the Orca application, you should:

  • On the Develop Node
    1. Use Conda to install BigDL and needed Python dependency libraries (see Section 3), then pack the activate Conda environment to an archive.
      conda pack -o environment.tar.gz
      
    2. Upload Conda archive, example Python files and extra Python dependencies to NFS.
      # Upload Conda archive
      cp /path/to/environment.tar.gz /bigdl/nfsdata
      
      # Upload example Python files
      cp /path/to/train.py /bigdl/nfsdata
      
      # Uplaod extra Python dependencies
      cp /path/to/model.py /bigdl/nfsdata
      

6.3.1 K8s Client

BigDL has provided an example YAML file (see orca-tutorial-client.yaml, which describes a Deployment that runs the intelanalytics/bigdl-k8s:2.1.0 image) to run the tutorial FashionMNIST program on k8s-client mode:

Notes:

  • Please call init_orca_context at very begining part of each Orca program.
    from bigdl.orca import init_orca_context
    
    init_orca_context(cluster_mode="spark-submit")
    
  • Spark client needs to specify spark.pyspark.driver.python, this python env should be on NFS dir.
    --conf spark.pyspark.driver.python=/bigdl/nfsdata/python_env/bin/python \
    
# orca-tutorial-client.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: orca-pytorch-job
spec:
  template:
    spec:
      serviceAccountName: spark
      restartPolicy: Never
      hostNetwork: true
      containers:
      - name: spark-k8s-client
        image: intelanalytics/bigdl-k8s:2.1.0
        imagePullPolicy: IfNotPresent
        command: ["/bin/sh","-c"]
        args: ["
                export RUNTIME_DRIVER_HOST=$( hostname -I | awk '{print $1}' );
                ${SPARK_HOME}/bin/spark-submit \
                --master ${RUNTIME_SPARK_MASTER} \
                --deploy-mode ${SPARK_MODE} \
                --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
                --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
                --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
                --name orca-pytorch-tutorial \
                --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
                --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
                --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata/ \
                --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
                --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata/ \
                --archives file:///bigdl/nfsdata/environment.tar.gz#python_env \
                --conf spark.pyspark.driver.python=/bigdl/nfsdata/python_env/bin/python \
                --conf spark.pyspark.python=python_env/bin/python \
                --conf spark.executorEnv.PYTHONHOME=python_env \
                --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata/ \
                --num-executors 2 \
                --executor-cores 16 \
                --executor-memory 50g \
                --total-executor-cores 32 \
                --driver-cores 4 \
                --driver-memory 50g \
                --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
                --py-files local://${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,local:///bigdl/nfsdata/train.py,local:///bigdl/nfsdata/model.py \
                --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
                --conf spark.kubernetes.executor.deleteOnTermination=True \
                --conf spark.sql.catalogImplementation='in-memory' \
                --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
                --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
                local:///bigdl/nfsdata/train.py
                --cluster_mode spark-submit
                --remote_dir file:///bigdl/nfsdata/dataset
                "]
        securityContext:
          privileged: true
        env:
          - name: RUNTIME_K8S_SPARK_IMAGE
            value: intelanalytics/bigdl-k8s:2.1.0
          - name: RUNTIME_SPARK_MASTER
            value: k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>
          - name: RUNTIME_DRIVER_PORT
            value: !!str 54321
          - name: SPARK_MODE
            value: client
          - name: RUNTIME_K8S_SERVICE_ACCOUNT
            value: spark
          - name: BIGDL_HOME
            value: /opt/bigdl-2.1.0
          - name: SPARK_HOME
            value: /opt/spark
          - name: SPARK_VERSION
            value: 3.1.2
          - name: BIGDL_VERSION
            value: 2.1.0
        resources:
          requests:
            cpu: 1
          limits:
            cpu: 4
        volumeMounts:
          - name: nfs-storage
            mountPath: /bigdl/nfsdata
          - name: nfs-storage
            mountPath: /root/.kube/config
            subPath: kubeconfig
      volumes:
      - name: nfs-storage
        persistentVolumeClaim:
          claimName: nfsvolumeclaim

In the YAML file:

  • metadata: A nested object filed that every deployment object must specify a metadata.
    • name: A string that uniquely identifies this object and job.
  • restartPolicy: Restart policy for all Containers within the pod. One of Always, OnFailure, Never. Default to Always.
  • containers: A single application Container that you want to run within a pod.
    • name: Name of the Container, each Container in a pod must have a unique name.
    • image: Name of the Container image.
    • imagePullPolicy: Image pull policy. One of Always, Never and IfNotPresent. Defaults to Always if :latest tag is specified, or IfNotPresent otherwise.
    • command: command for the containers that run in the Pod.
    • args: arguments to submit the spark application in the Pod. See more details of the spark-submit script in Section 6.2.1.
    • securityContext: SecurityContext defines the security options the container should be run with.
    • env: List of environment variables to set in the Container, which will be used when submitting the application.
      • env.name: Name of the environment variable.
      • env.value: Value of the environment variable.
    • resources: Allocate resources in the cluster to each pod.
      • resource.limits: Limits describes the maximum amount of compute resources allowed.
      • resource.requests: Requests describes the minimum amount of compute resources required.
    • volumeMounts: Declare where to mount volumes into containers.
      • name: Match with the Name of a Volume.
      • mountPath: Path within the Container at which the volume should be mounted.
      • subPath: Path within the volume from which the Container's volume should be mounted.
    • volume: specify the volumes to provide for the Pod.
      • persistentVolumeClaim: mount a PersistentVolume into a Pod

Create a Pod and run Fashion-MNIST application based on the YAML file.

kubectl apply -f orca-tutorial-client.yaml

List all pods to find the driver pod, which will be named as orca-pytorch-job-xxx.

# find out driver pod
kubectl get pods

View logs from the driver pod to retrive the training stats.

# retrive training logs
kubectl logs `orca-pytorch-job-xxx`

After the task finish, you could delete the job as the command below.

kubectl delete job orca-pytorch-job

6.3.2 K8s Cluster

BigDL has provided an example YAML file (see orca-tutorial-cluster.yaml, which describes a Deployment that runs the intelanalytics/bigdl-k8s:2.1.0 image) to run the tutorial FashionMNIST program on k8s-cluster mode:

Notes:

  • Please call init_orca_context at very begining part of each Orca program.
    from bigdl.orca import init_orca_context
    
    init_orca_context(cluster_mode="spark-submit")
    
# orca-tutorial-cluster.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: orca-pytorch-job
spec:
  template:
    spec:
      serviceAccountName: spark
      restartPolicy: Never
      hostNetwork: true
      containers:
      - name: spark-k8s-cluster
        image: intelanalytics/bigdl-k8s:2.1.0
        imagePullPolicy: IfNotPresent
        command: ["/bin/sh","-c"]
        args: ["
                ${SPARK_HOME}/bin/spark-submit \
                --master ${RUNTIME_SPARK_MASTER} \
                --deploy-mode ${SPARK_MODE} \
                --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
                --name orca-pytorch-tutorial \
                --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
                --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
                --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata/ \
                --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
                --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata/ \
                --archives file:///bigdl/nfsdata/environment.tar.gz#python_env \
                --conf spark.pyspark.python=python_env/bin/python \
                --conf spark.executorEnv.PYTHONHOME=python_env \
                --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata/ \
                --num_executors 2 \
                --executor-cores 16 \
                --executor-memory 50g \
                --total-executor-cores 32 \
                --driver-cores 4 \
                --driver-memory 50g \
                --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
                --py-files local://${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,local:///bigdl/nfsdata/train.py,local:///bigdl/nfsdata/model.py \
                --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
                --conf spark.kubernetes.executor.deleteOnTermination=True \
                --conf spark.sql.catalogImplementation='in-memory' \
                --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
                --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
                local:///bigdl/nfsdata/train.py
                --cluster_mode spark-submit
                --remote_dir file:///bigdl/nfsdata/dataset
                "]
        securityContext:
          privileged: true
        env:
          - name: RUNTIME_K8S_SPARK_IMAGE
            value: intelanalytics/bigdl-k8s:2.1.0
          - name: RUNTIME_SPARK_MASTER
            value: k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>
          - name: SPARK_MODE
            value: cluster
          - name: RUNTIME_K8S_SERVICE_ACCOUNT
            value: spark
          - name: BIGDL_HOME
            value: /opt/bigdl-2.1.0
          - name: SPARK_HOME
            value: /opt/spark
          - name: SPARK_VERSION
            value: 3.1.2
          - name: BIGDL_VERSION
            value: 2.1.0
        resources:
          requests:
            cpu: 1
          limits:
            cpu: 4
        volumeMounts:
          - name: nfs-storage
            mountPath: /bigdl/nfsdata
          - name: nfs-storage
            mountPath: /root/.kube/config
            subPath: kubeconfig
      volumes:
      - name: nfs-storage
        persistentVolumeClaim:
          claimName: nfsvolumeclaim

In the YAML file:

  • restartPolicy: Restart policy for all Containers within the pod. One of Always, OnFailure, Never. Default to Always.
  • containers: A single application Container that you want to run within a pod.
    • name: Name of the Container, each Container in a pod must have a unique name.
    • image: Name of the Container image.
    • imagePullPolicy: Image pull policy. One of Always, Never and IfNotPresent. Defaults to Always if :latest tag is specified, or IfNotPresent otherwise.
    • command: command for the containers that run in the Pod.
    • args: arguments to submit the spark application in the Pod. See more details of the spark-submit script in Section 6.2.2.
    • securityContext: SecurityContext defines the security options the container should be run with.
    • env: List of environment variables to set in the Container, which will be used when submitting the application.
      • env.name: Name of the environment variable.
      • env.value: Value of the environment variable.
    • resources: Allocate resources in the cluster to each pod.
      • resource.limits: Limits describes the maximum amount of compute resources allowed.
      • resource.requests: Requests describes the minimum amount of compute resources required.
    • volumeMounts: Declare where to mount volumes into containers.
      • name: Match with the Name of a Volume.
      • mountPath: Path within the Container at which the volume should be mounted.
      • subPath: Path within the volume from which the Container's volume should be mounted.
    • volume: specify the volumes to provide for the Pod.
      • persistentVolumeClaim: mount a PersistentVolume into a Pod

Create a Pod and run Fashion-MNIST application based on the YAML file.

kubectl apply -f orca-tutorial-cluster.yaml

List all pods to find the driver pod (since the client pod only returns training status), which will be named as orca-pytorch-job-driver.

# checkout training status
kubectl logs `orca-pytorch-job-xxx`

# find out driver pod
kubectl get pods

View logs from the driver pod to retrive the training stats.

# retrive training logs
kubectl logs `orca-pytorch-job-driver`

After the task finish, you could delete the job as the command below.

kubectl delete job orca-pytorch-job

6.4 Use Kubernetes Deployment (without Integrared Image)

BigDL also supports uses to skip preparing envionment through providing a container image (intelanalytics/bigdl-k8s:orca-2.1.0) which has integrated all BigDL required environments.

Notes:

  • The image will be pulled automatically when you deploy pods with the YAML file.
  • Conda archive is no longer needed in this method, please skip Section 3, since BigDL has integrated environment in intelanalytics/bigdl-k8s:orca-2.1.0.
  • If you need to install extra Python libraries which may not included in the image, please submit applications with Conda archive (refer to Section 6.3).

Before submitting the example application, you should:

  • On the Develop Node
    • Download dataset and upload it to NFS.
      mv /path/to/dataset /bigdl/nfsdata 
      
    • Upload example Python files and extra Python dependencies to NFS.
      # Upload example Python files
      cp /path/to/train.py /bigdl/nfsdata
      
      # Uplaod extra Python dependencies
      cp /path/to/model.py /bigdl/nfsdata
      

6.4.1 K8s Client

BigDL has provided an example YAML file (see integrated_image_client.yaml, which describes a deployment that runs the intelanalytics/bigdl-k8s:orca-2.1.0 image) to run the tutorial FashionMNIST program on k8s-client mode:

Notes:

  • Please call init_orca_context at very begining part of each Orca program.
    from bigdl.orca import init_orca_context
    
    init_orca_context(cluster_mode="spark-submit")
    
  • Spark client needs to specify spark.pyspark.driver.python, this python env should be on NFS dir.
    --conf spark.pyspark.driver.python=/bigdl/nfsdata/orca_env/bin/python \
    
#integrate_image_client.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: orca-integrate-job
spec:
  template:
    spec:
      serviceAccountName: spark
      restartPolicy: Never
      hostNetwork: true
      containers:
      - name: spark-k8s-client
        image: intelanalytics/bigdl-spark-3.1.2:orca-2.1.0
        imagePullPolicy: IfNotPresent
        command: ["/bin/sh","-c"]
        args: ["
                export RUNTIME_DRIVER_HOST=$( hostname -I | awk '{print $1}' );
                ${SPARK_HOME}/bin/spark-submit \
                --master ${RUNTIME_SPARK_MASTER} \
                --deploy-mode ${SPARK_MODE} \
                --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
                --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
                --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
                --name orca-integrate-pod \
                --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
                --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
                --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
                --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata \
                --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
                --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata \
                --conf spark.pyspark.driver.python=python \
                --conf spark.pyspark.python=/usr/local/envs/bigdl/bin/python \
                --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata/ \
                --executor-cores 10 \
                --executor-memory 50g \
                --num-executors 4 \
                --total-executor-cores 40 \
                --driver-cores 10 \
                --driver-memory 50g \
                --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
                --py-files local://${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,local:///bigdl/nfsdata/train.py,local:///bigdl/nfsdata/model.py \
                --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
                --conf spark.sql.catalogImplementation='in-memory' \
                --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
                --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
                local:///bigdl/nfsdata/train.py
                --cluster_mode spark-submit
                --remote_dir file:///bigdl/nfsdata/dataset
                "]
        securityContext:
          privileged: true
        env:
          - name: RUNTIME_K8S_SPARK_IMAGE
            value: intelanalytics/bigdl-spark-3.1.2:orca-2.1.0
          - name: RUNTIME_SPARK_MASTER
            value: k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>
          - name: RUNTIME_DRIVER_PORT
            value: !!str 54321
          - name: SPARK_MODE
            value: client
          - name: RUNTIME_K8S_SERVICE_ACCOUNT
            value: spark
          - name: BIGDL_HOME
            value: /opt/bigdl-2.1.0
          - name: SPARK_HOME
            value: /opt/spark
          - name: SPARK_VERSION
            value: 3.1.2
          - name: BIGDL_VERSION
            value: 2.1.0
        resources:
          requests:
            cpu: 1
          limits:
            cpu: 4
        volumeMounts:
          - name: nfs-storage
            mountPath: /bigdl/nfsdata
          - name: nfs-storage
            mountPath: /root/.kube/config
            subPath: kubeconfig
      volumes:
      - name: nfs-storage
        persistentVolumeClaim:
          claimName: nfsvolumeclaim

In the YAML file:

  • restartPolicy: Restart policy for all Containers within the pod. One of Always, OnFailure, Never. Default to Always.
  • containers: A single application Container that you want to run within a pod.
    • name: Name of the Container, each Container in a pod must have a unique name.
    • image: Name of the Container image.
    • imagePullPolicy: Image pull policy. One of Always, Never and IfNotPresent. Defaults to Always if :latest tag is specified, or IfNotPresent otherwise.
    • command: command for the containers that run in the Pod.
    • args: arguments to submit the spark application in the Pod. See more details of the spark-submit script in Section 6.2.1.
    • securityContext: SecurityContext defines the security options the container should be run with.
    • env: List of environment variables to set in the Container, which will be used when submitting the application.
      • env.name: Name of the environment variable.
      • env.value: Value of the environment variable.
    • resources: Allocate resources in the cluster to each pod.
      • resource.limits: Limits describes the maximum amount of compute resources allowed.
      • resource.requests: Requests describes the minimum amount of compute resources required.
    • volumeMounts: Declare where to mount volumes into containers.
      • name: Match with the Name of a Volume.
      • mountPath: Path within the Container at which the volume should be mounted.
      • subPath: Path within the volume from which the Container's volume should be mounted.
    • volume: specify the volumes to provide for the Pod.
      • persistentVolumeClaim: mount a PersistentVolume into a Pod

Create a Pod and run Fashion-MNIST application based on the YAML file.

kubectl apply -f integrate_image_client.yaml

List all pods to find the driver pod, which will be named as orca-integrate-job-xxx.

# find out driver pod
kubectl get pods

View logs from the driver pod to retrive the training stats.

# retrive training logs
kubectl logs `orca-integrate-job-xxx`

After the task finish, you could delete the job as the command below.

kubectl delete job orca-integrate-job

6.4.2 K8s Cluster

BigDL has provided an example YAML file (see integrate_image_cluster.yaml, which describes a deployment that runs the intelanalytics/bigdl-k8s:orca-2.1.0 image) to run the tutorial FashionMNIST program on k8s-cluster mode:

Notes:

  • Please call init_orca_context at very begining part of each Orca program.
    from bigdl.orca import init_orca_context
    
    init_orca_context(cluster_mode="spark-submit")
    
# integrate_image_cluster.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: orca-integrate-job
spec:
  template:
    spec:
      serviceAccountName: spark
      restartPolicy: Never
      hostNetwork: true
      containers:
      - name: spark-k8s-cluster
        image: intelanalytics/bigdl-spark-3.1.2:orca-2.1.0
        imagePullPolicy: IfNotPresent
        command: ["/bin/sh","-c"]
        args: ["
                ${SPARK_HOME}/bin/spark-submit \
                --master ${RUNTIME_SPARK_MASTER} \
                --deploy-mode ${SPARK_MODE} \
                --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
                --name orca-integrate-pod \
                --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
                --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
                --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata \
                --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
                --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata \
                --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata/ \
                --executor-cores 10 \
                --executor-memory 50g \
                --num-executors 4 \
                --total-executor-cores 40 \
                --driver-cores 10 \
                --driver-memory 50g \
                --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
                --py-files local://${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,local:///bigdl/nfsdata/train.py,local:///bigdl/nfsdata/model.py \
                --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
                --conf spark.sql.catalogImplementation='in-memory' \
                --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
                --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
                local:///bigdl/nfsdata/train.py
                --cluster_mode spark-submit
                --remote_dir file:///bigdl/nfsdata/dataset
                "]
        securityContext:
          privileged: true
        env:
          - name: RUNTIME_K8S_SPARK_IMAGE
            value: intelanalytics/bigdl-spark-3.1.2:orca-2.1.0
          - name: RUNTIME_SPARK_MASTER
            value: k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>
          - name: SPARK_MODE
            value: cluster
          - name: RUNTIME_K8S_SERVICE_ACCOUNT
            value: spark
          - name: BIGDL_HOME
            value: /opt/bigdl-2.1.0
          - name: SPARK_HOME
            value: /opt/spark
          - name: SPARK_VERSION
            value: 3.1.2
          - name: BIGDL_VERSION
            value: 2.1.0
        resources:
          requests:
            cpu: 1
          limits:
            cpu: 4
        volumeMounts:
          - name: nfs-storage
            mountPath: /bigdl/nfsdata
          - name: nfs-storage
            mountPath: /root/.kube/config
            subPath: kubeconfig
      volumes:
      - name: nfs-storage
        persistentVolumeClaim:
          claimName: nfsvolumeclaim

In the YAML file:

  • restartPolicy: Restart policy for all Containers within the pod. One of Always, OnFailure, Never. Default to Always.
  • containers: A single application Container that you want to run within a pod.
    • name: Name of the Container, each Container in a pod must have a unique name.
    • image: Name of the Container image.
    • imagePullPolicy: Image pull policy. One of Always, Never and IfNotPresent. Defaults to Always if :latest tag is specified, or IfNotPresent otherwise.
    • command: command for the containers that run in the Pod.
    • args: arguments to submit the spark application in the Pod. See more details of the spark-submit script in Section 6.2.2.
    • securityContext: SecurityContext defines the security options the container should be run with.
    • env: List of environment variables to set in the Container, which will be used when submitting the application.
      • env.name: Name of the environment variable.
      • env.value: Value of the environment variable.
    • resources: Allocate resources in the cluster to each pod.
      • resource.limits: Limits describes the maximum amount of compute resources allowed.
      • resource.requests: Requests describes the minimum amount of compute resources required.
    • volumeMounts: Declare where to mount volumes into containers.
      • name: Match with the Name of a Volume.
      • mountPath: Path within the Container at which the volume should be mounted.
      • subPath: Path within the volume from which the Container's volume should be mounted.
    • volume: specify the volumes to provide for the Pod.
      • persistentVolumeClaim: mount a PersistentVolume into a Pod

Create a Pod and run Fashion-MNIST application based on the YAML file.

kubectl apply -f integrate_image_cluster.yaml

List all pods to find the driver pod (since the client pod only returns training status), which will be named as orca-integrate-job-driver.

# checkout training status
kubectl logs `orca-integrate-job-xxx`

# find out driver pod
kubectl get pods

View logs from the driver pod to retrive the training stats.

# retrive training logs
kubectl logs `orca-integrate-job-driver`

After the task finish, you could delete the job as the command below.

kubectl delete job orca-integrate-job