From f997cc5d60b8e6a70e05d103d362e3bdda4d4da1 Mon Sep 17 00:00:00 2001 From: Kai Huang Date: Thu, 23 Feb 2023 09:49:12 +0800 Subject: [PATCH] Update k8s yaml (#7625) * remove integrate * update yaml * rename * update * fix path * update * remove local * remove archive * update commands * update spark/bigdl home * fix * align order * update cluster * remove resources --- .../source/doc/Orca/Tutorial/k8s.md | 780 +++++------------- .../source/doc/Orca/Tutorial/yarn.md | 34 +- 2 files changed, 236 insertions(+), 578 deletions(-) diff --git a/docs/readthedocs/source/doc/Orca/Tutorial/k8s.md b/docs/readthedocs/source/doc/Orca/Tutorial/k8s.md index ac8efa9d..7d3a302f 100644 --- a/docs/readthedocs/source/doc/Orca/Tutorial/k8s.md +++ b/docs/readthedocs/source/doc/Orca/Tutorial/k8s.md @@ -2,7 +2,7 @@ This tutorial provides a step-by-step guide on how to run BigDL-Orca programs on Kubernetes (K8s) clusters, using a [PyTorch Fashin-MNIST program](https://github.com/intel-analytics/BigDL/tree/main/python/orca/tutorial/pytorch/FashionMNIST) as a working example. -The **Client Container** that appears in this tutorial refer to the Docker container where you launch or submit your applications. The __Develop Node__ is the host machine where you launch the client container. +The __Develop Node__ is the host machine where you launch the client container or create a Kubernetes Deployment. The **Client Container** is the created Docker container where you launch or submit your applications. --- ## 1. Basic Concepts @@ -31,7 +31,7 @@ In `init_orca_context`, you may specify necessary runtime configurations for run * `conf`: a dictionary to append extra conf for Spark (default to be `None`). __Note__: -* All arguments __except__ `cluster_mode` will be ignored when using [`spark-submit`](#use-spark-submit) or [`Kubernetes deployment`](#use-kubernetes-deployment-with-conda-archive) to submit and run Orca programs, in which case you are supposed to specify these configurations via the submit command or the YAML file. +* All arguments __except__ `cluster_mode` will be ignored when using [`spark-submit`](#use-spark-submit) or [`Kubernetes deployment`](#use-kubernetes-deployment) to submit and run Orca programs, in which case you are supposed to specify these configurations via the submit command or the YAML file. After Orca programs finish, you should always call `stop_orca_context` at the end of the program to release resources and shutdown the underlying distributed runtime engine (such as Spark or Ray). ```python @@ -80,6 +80,11 @@ kubectl logs kubectl describe pod ``` +* You may need to delete the driver pod manually after the application finished: +```bash +kubectl delete pod +``` + ### 1.3 Load Data from Volumes When you are running programs on K8s, please load data from [Volumes](https://kubernetes.io/docs/concepts/storage/volumes/) accessible to all K8s pods. We use Network File Systems (NFS) with path `/bigdl/nfsdata` in this tutorial as an example. You are recommended to put your working directory in the Volume (NFS) as well. @@ -130,23 +135,32 @@ def train_data_creator(config, batch_size): return trainloader ``` - --- -## 2. Create BigDL K8s Container -### 2.1 Pull Docker Image -Please pull the BigDL [`bigdl-k8s`](https://hub.docker.com/r/intelanalytics/bigdl-k8s/tags) image (built on top of Spark 3.1.3) from Docker Hub as follows: +### 2 Pull Docker Image +Please pull the BigDL [`bigdl-k8s`](https://hub.docker.com/r/intelanalytics/bigdl-k8s/tags) image (built on top of Spark 3.1.3) from Docker Hub beforehand as follows: ```bash # For the latest nightly build version sudo docker pull intelanalytics/bigdl-k8s:latest -# For the release version, e.g. 2.1.0 -sudo docker pull intelanalytics/bigdl-k8s:2.1.0 +# For the release version, e.g. 2.2.0 +sudo docker pull intelanalytics/bigdl-k8s:2.2.0 ``` +In the docker container: +- Spark is located at `/opt/spark`. Spark version is 3.1.3. +- BigDL is located at `/opt/bigdl-VERSION`. For the latest nightly build image, BigDL version would be `xxx-SNAPSHOT` (e.g. 2.3.0-SNAPSHOT). -### 2.2 Create a K8s Client Container +--- +## 3. Create BigDL K8s Container +Note that you can __skip__ this section if you want to run applications with [`Kubernetes deployment`](#use-kubernetes-deployment). + +You need to create a BigDL K8s client container only when you use [`python` command](#use-python-command) or [`spark-submit`](#use-spark-submit). + +### 3.1 Create a K8s Client Container Please create the __Client Container__ using the script below: ```bash +export RUNTIME_DRIVER_HOST=$( hostname -I | awk '{print $1}' ) + sudo docker run -itd --net=host \ -v /etc/kubernetes:/etc/kubernetes \ -v /root/.kube:/root/.kube \ @@ -159,8 +173,7 @@ sudo docker run -itd --net=host \ -e RUNTIME_K8S_SERVICE_ACCOUNT=spark \ -e RUNTIME_K8S_SPARK_IMAGE=intelanalytics/bigdl-k8s:latest \ -e RUNTIME_PERSISTENT_VOLUME_CLAIM=nfsvolumeclaim \ - -e RUNTIME_DRIVER_HOST=x.x.x.x \ - -e RUNTIME_DRIVER_PORT=54321 \ + -e RUNTIME_DRIVER_HOST=${RUNTIME_DRIVER_HOST} \ intelanalytics/bigdl-k8s:latest bash ``` @@ -175,17 +188,16 @@ In the script: * `NOTEBOOK_TOKEN`: a string that specifies the token for Notebook. This is not necessary if you don't use notebook. * `RUNTIME_SPARK_MASTER`: a URL format that specifies the Spark master: `k8s://https://:`. * `RUNTIME_K8S_SERVICE_ACCOUNT`: a string that specifies the service account for the driver pod. -* `RUNTIME_K8S_SPARK_IMAGE`: the name of the BigDL K8s Docker image. +* `RUNTIME_K8S_SPARK_IMAGE`: the name of the BigDL K8s Docker image. Note that you need to change the version accordingly. * `RUNTIME_PERSISTENT_VOLUME_CLAIM`: a string that specifies the Kubernetes volumeName (e.g. "nfsvolumeclaim"). * `RUNTIME_DRIVER_HOST`: a URL format that specifies the driver localhost (only required if you use k8s-client mode). -* `RUNTIME_DRIVER_PORT`: a string that specifies the driver port (only required if you use k8s-client mode). __Notes:__ * The __Client Container__ already contains all the required environment configurations for Spark and BigDL Orca. * Spark executor containers are scheduled by K8s at runtime and you don't need to create them manually. -### 2.3 Launch the K8s Client Container +### 3.2 Launch the K8s Client Container Once the container is created, a `containerID` would be returned and with which you can enter the container following the command below: ```bash sudo docker exec -it bash @@ -194,12 +206,12 @@ In the remaining part of this tutorial, you are supposed to operate and run comm --- -## 3. Prepare Environment -In the launched BigDL K8s **Client Container**, please setup the environment following the steps below: +## 4. Prepare Environment +In the launched BigDL K8s **Client Container** (if you use [`python` command](#use-python-command) or [`spark-submit`](#use-spark-submit)) or on the **Develop Node** (if you use [`Kubernetes deployment`](#use-kubernetes-deployment)), please setup the environment following the steps below: - See [here](../Overview/install.md#install-anaconda) to install conda and prepare the Python environment. -- See [here](../Overview/install.md#to-install-orca-for-spark3) to install BigDL Orca in the created conda environment. *Note that if you use [`spark-submit`](#use-spark-submit), please __skip__ this step and __DO NOT__ install BigDL Orca with pip install command in the conda environment.* +- See [here](../Overview/install.md#to-install-orca-for-spark3) to install BigDL Orca in the created conda environment. *Note that if you use [`spark-submit`](#use-spark-submit) or [`Kubernetes deployment`](#use-kubernetes-deployment), please __skip__ this step and __DO NOT__ install BigDL Orca with pip install command in the conda environment.* - You should install all the other Python libraries that you need in your program in the conda environment as well. `torch` and `torchvision` are needed to run the Fashion-MNIST example we provide: ```bash @@ -208,7 +220,7 @@ pip install torch torchvision tqdm --- -## 4. Prepare Dataset +## 5. Prepare Dataset To run the Fashion-MNIST example provided by this tutorial on K8s, you should upload the dataset to a K8s Volume (e.g. NFS). Please download the Fashion-MNIST dataset manually on your __Develop Node__ and put the data into the Volume. Note that PyTorch `FashionMNIST Dataset` requires unzipped files located in `FashionMNIST/raw/` under the dataset folder. @@ -228,7 +240,7 @@ In the given example, you can specify the argument `--data_dir` to be the direct --- -## 5. Prepare Custom Modules +## 6. Prepare Custom Modules Spark allows to upload Python files(`.py`), and zipped Python packages(`.zip`) across the cluster by setting `--py-files` option in Spark scripts or specifying `extra_python_lib` in `init_orca_context`. The FasionMNIST example needs to import the modules from [`model.py`](https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/pytorch/FashionMNIST/model.py). @@ -246,7 +258,7 @@ For more details, please see [BigDL Python Dependencies](https://bigdl.readthedo ```bash spark-submit ... - --py-files file:///bigdl/nfsdata/model.py + --py-files /bigdl/nfsdata/model.py ... ``` For more details, please see [Spark Python Dependencies](https://spark.apache.org/docs/latest/submitting-applications.html). @@ -274,49 +286,45 @@ If your program depends on a nested directory of Python files, you are recommend --- -## 6. Run Jobs on K8s -In the following part, we will illustrate four ways to submit and run BigDL Orca applications on K8s. +## 7. Run Jobs on K8s +In the following part, we will illustrate three ways to submit and run BigDL Orca applications on K8s. * Use `python` command * Use `spark-submit` -* Use Kubernetes Deployment (with Conda Archive) -* Use Kubernetes Deployment (with Integrated Image) +* Use Kubernetes Deployment You can choose one of them based on your preference or cluster settings. We provide the running command for the [Fashion-MNIST example](https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/pytorch/FashionMNIST/) in the __Client Container__ in this section. -### 6.1 Use `python` command +### 7.1 Use `python` command This is the easiest and most recommended way to run BigDL Orca on K8s as a normal Python program. See [here](#init-orca-context) for the runtime configurations. -#### 6.1.1 K8s-Client +#### 7.1.1 K8s-Client Run the example with the following command by setting the cluster_mode to "k8s-client": ```bash python train.py --cluster_mode k8s-client --data_dir /bigdl/nfsdata/dataset ``` -#### 6.1.2 K8s-Cluster -Before running the example on `k8s-cluster` mode, you should: -* In the __Client Container__: +#### 7.1.2 K8s-Cluster +Before running the example on `k8s-cluster` mode in the __Client Container__, you should: -Pack the current activate conda environment to an archive: -```bash -conda pack -o environment.tar.gz -``` - -* On the __Develop Node__: -1. Upload the conda archive to NFS. +1. Pack the current activate conda environment to an archive: ```bash - docker cp :/path/to/environment.tar.gz /bigdl/nfsdata + conda pack -o environment.tar.gz ``` -2. Upload the Python script (`train.py` in our example) to NFS. +2. Upload the conda archive to NFS: + ```bash + cp /path/to/environment.tar.gz /bigdl/nfsdata + ``` +3. Upload the Python script (`train.py` in our example) to NFS: ```bash cp /path/to/train.py /bigdl/nfsdata ``` -3. Upload the extra Python dependency files (`model.py` in our example) to NFS. +4. Upload the extra Python dependency files (`model.py` in our example) to NFS: ```bash cp /path/to/model.py /bigdl/nfsdata ``` @@ -327,26 +335,26 @@ python /bigdl/nfsdata/train.py --cluster_mode k8s-cluster --data_dir /bigdl/nfsd ``` -### 6.2 Use `spark-submit` +### 7.2 Use `spark-submit` -If you prefer to use `spark-submit`, please follow the steps below to prepare the environment in the __Client Container__. +If you prefer to use `spark-submit`, please follow the steps below in the __Client Container__ before submitting the application. . -1. Set the cluster_mode to "spark-submit" in `init_orca_context`. - ```python - sc = init_orca_context(cluster_mode="spark-submit") - ``` - -2. Download the requirement file(s) from [here](https://github.com/intel-analytics/BigDL/tree/main/python/requirements/orca) and install the required Python libraries of BigDL Orca according to your needs. +1. Download the requirement file(s) from [here](https://github.com/intel-analytics/BigDL/tree/main/python/requirements/orca) and install the required Python libraries of BigDL Orca according to your needs. ```bash pip install -r /path/to/requirements.txt ``` Note that you are recommended **NOT** to install BigDL Orca with pip install command in the conda environment if you use spark-submit to avoid possible conflicts. -3. Pack the current activate conda environment to an archive before submitting the example: +2. Pack the current activate conda environment to an archive: ```bash conda pack -o environment.tar.gz ``` +3. Set the cluster_mode to "spark-submit" in `init_orca_context`: + ```python + sc = init_orca_context(cluster_mode="spark-submit") + ``` + Some runtime configurations for Spark are as follows: * `--master`: a URL format that specifies the Spark master: k8s://https://:. @@ -367,7 +375,7 @@ Some runtime configurations for Spark are as follows: * `--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path`: specify the path to be mounted as `persistentVolumeClaim` into executor pods. -#### 6.2.1 K8s Client +#### 7.2.1 K8s Client Submit and run the program for `k8s-client` mode following the `spark-submit` script below: ```bash ${SPARK_HOME}/bin/spark-submit \ @@ -398,21 +406,22 @@ In the `spark-submit` script: * `deploy-mode`: set it to `client` when running programs on k8s-client mode. * `--conf spark.driver.host`: the localhost for the driver pod. * `--conf spark.pyspark.driver.python`: set the activate Python location in __Client Container__ as the driver's Python environment. -* `--conf spark.pyspark.python`: set the Python location in conda archive as each executor's Python environment. +* `--conf spark.pyspark.python`: set the Python location in the conda archive as each executor's Python environment. -#### 6.2.2 K8s Cluster +#### 7.2.2 K8s Cluster -* On the __Develop Node__: -1. Upload the conda archive to NFS. +Before running the example on `k8s-cluster` mode in the __Client Container__, you should: + +1. Upload the conda archive to NFS: ```bash - docker cp :/path/to/environment.tar.gz /bigdl/nfsdata + cp /path/to/environment.tar.gz /bigdl/nfsdata ``` -2. Upload the example Python file to NFS. +2. Upload the Python script (`train.py` in our example) to NFS: ```bash cp /path/to/train.py /bigdl/nfsdata ``` -3. Upload the extra Python dependency files to NFS. +3. Upload the extra Python dependency files (`model.py` in our example) to NFS: ```bash cp /path/to/model.py /bigdl/nfsdata ``` @@ -431,69 +440,95 @@ ${SPARK_HOME}/bin/spark-submit \ --executor-memory 2g \ --driver-cores 2 \ --driver-memory 2g \ - --archives file:///bigdl/nfsdata/environment.tar.gz#environment \ + --archives /bigdl/nfsdata/environment.tar.gz#environment \ --conf spark.pyspark.driver.python=environment/bin/python \ --conf spark.pyspark.python=environment/bin/python \ --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata \ --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \ - --py-files ${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,file:///bigdl/nfsdata/train.py,file:///bigdl/nfsdata/model.py \ - --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \ - --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \ + --py-files ${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/bigdl/nfsdata/train.py,/bigdl/nfsdata/model.py \ + --conf spark.driver.extraClassPath=${BIGDL_HOME}/jars/* \ + --conf spark.executor.extraClassPath=${BIGDL_HOME}/jars/* \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl/nfsdata \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl/nfsdata \ - file:///bigdl/nfsdata/train.py --cluster_mode spark-submit --data_dir /bigdl/nfsdata/dataset + /bigdl/nfsdata/train.py --cluster_mode spark-submit --data_dir /bigdl/nfsdata/dataset ``` In the `spark-submit` script: * `deploy-mode`: set it to `cluster` when running programs on k8s-cluster mode. * `--conf spark.kubernetes.authenticate.driver.serviceAccountName`: the service account for the driver pod. -* `--conf spark.pyspark.driver.python`: set the Python location in conda archive as the driver's Python environment. -* `--conf spark.pyspark.python`: also set the Python location in conda archive as each executor's Python environment. +* `--conf spark.pyspark.driver.python`: set the Python location in the conda archive as the driver's Python environment. +* `--conf spark.pyspark.python`: also set the Python location in the conda archive as each executor's Python environment. * `--conf spark.kubernetes.file.upload.path`: the path to store files at spark submit side in k8s-cluster mode. * `--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName`: specify the claim name of `persistentVolumeClaim` to mount `persistentVolume` into the driver pod. * `--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path`: specify the path to be mounted as `persistentVolumeClaim` into the driver pod. -### 6.3 Use Kubernetes Deployment (with Conda Archive) -BigDL supports users (which want to execute programs directly on __Develop Node__) to run an application by creating a Kubernetes Deployment object. +### 7.3 Use Kubernetes Deployment +BigDL supports users who want to execute programs directly on __Develop Node__ to run an application by creating a Kubernetes Deployment object. +After preparing the [Conda environment](#prepare-environment) on the __Develop Node__, follow the steps below before submitting the application. -Before submitting the Orca application, you should: -* On the __Develop Node__ - 1. Use Conda to install BigDL and needed Python dependency libraries (see __[Section 3](#3-prepare-environment)__), then pack the activate Conda environment to an archive. - ```bash - conda pack -o environment.tar.gz - ``` - 2. Upload Conda archive, example Python files and extra Python dependencies to NFS. - ```bash - # Upload Conda archive - cp /path/to/environment.tar.gz /bigdl/nfsdata - - # Upload example Python files - cp /path/to/train.py /bigdl/nfsdata - - # Uplaod extra Python dependencies - cp /path/to/model.py /bigdl/nfsdata - ``` - -#### 6.3.1 K8s Client -BigDL has provided an example YAML file (see __[orca-tutorial-client.yaml](../../../../../../python/orca/tutorial/pytorch/docker/orca-tutorial-client.yaml)__, which describes a Deployment that runs the `intelanalytics/bigdl-k8s:2.1.0` image) to run the tutorial FashionMNIST program on k8s-client mode: - -__Notes:__ -* Please call `init_orca_context` at very begining part of each Orca program. - ```python - from bigdl.orca import init_orca_context - - init_orca_context(cluster_mode="spark-submit") - ``` -* Spark client needs to specify `spark.pyspark.driver.python`, this python env should be on NFS dir. +1. Download the requirement file(s) from [here](https://github.com/intel-analytics/BigDL/tree/main/python/requirements/orca) and install the required Python libraries of BigDL Orca according to your needs. ```bash - --conf spark.pyspark.driver.python=/bigdl/nfsdata/python_env/bin/python \ + pip install -r /path/to/requirements.txt ``` + Note that you are recommended **NOT** to install BigDL Orca with pip install command in the conda environment if you use spark-submit to avoid possible conflicts. + +2. Pack the current activate conda environment to an archive before: + ```bash + conda pack -o environment.tar.gz + ``` + +3. Upload the conda archive, Python script (`train.py` in our example) and extra Python dependency files (`model.py` in our example) to NFS. + ```bash + cp /path/to/environment.tar.gz /path/to/nfs + + cp /path/to/train.py /path/to/nfs + + cp /path/to/model.py /path/to/nfs + ``` + +4. Set the cluster_mode to "spark-submit" in `init_orca_context`. + ```python + sc = init_orca_context(cluster_mode="spark-submit") + ``` + +We define a Kubernetes Deployment in a YAML file. Some fields of the YAML are explained as follows: + +* `metadata`: a nested object filed that every deployment object must specify. + * `name`: a string that uniquely identifies this object and job. We use "orca-pytorch-job" in our example. +* `restartPolicy`: the restart policy for all containers within the pod. One of Always, OnFailure, Never. Default to Always. +* `containers`: a single application container to run within a pod. + * `name`: the name of the container. Each container in a pod will have a unique name. + * `image`: the name of the BigDL K8s Docker image. Note that you need to change the version accordingly. + * `imagePullPolicy`: the pull policy of the docker image. One of Always, Never and IfNotPresent. Defaults to Always if `:latest` tag is specified, or IfNotPresent otherwise. + * `command`: the command for the containers to run in the pod. + * `args`: the arguments to submit the spark application in the pod. See more details in [`spark-submit`](#use-spark-submit). + * `securityContext`: the security options the container should be run with. + * `env`: a list of environment variables to set in the container, which will be used when submitting the application. Note that you need to change the environment variables including `BIGDL_VERSION` and `BIGDL_HOME` accordingly. + * `name`: the name of the environment variable. + * `value`: the value of the environment variable. + * `volumeMounts`: the paths to mount Volumes into containers. + * `name`: the name of a Volume. + * `mountPath`: the path in the container to mount the Volume to. + * `subPath`: the sub-path within the volume to mount into the container. +* `volumes`: specify the volumes for the pod. We use NFS as the persistentVolumeClaim in our example. + + +#### 7.3.1 K8s Client +BigDL has provided an example [orca-tutorial-k8s-client.yaml](https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/pytorch/docker/orca-tutorial-client.yaml)__ to directly run the Fashion-MNIST example for k8s-client mode. +Note that you need to change the configurations in the YAML file accordingly, including the version of the docker image, RUNTIME_SPARK_MASTER, BIGDL_VERSION and BIGDL_HOME. + +You need to uncompress the conda archive in NFS before submitting the job: +```bash +cd /path/to/nfs +mkdir environment +tar -xzvf environment.tar.gz --directory environment +``` ```bash -# orca-tutorial-client.yaml +# orca-tutorial-k8s-client.yaml apiVersion: batch/v1 kind: Job metadata: @@ -506,7 +541,7 @@ spec: hostNetwork: true containers: - name: spark-k8s-client - image: intelanalytics/bigdl-k8s:2.1.0 + image: intelanalytics/bigdl-k8s:latest imagePullPolicy: IfNotPresent command: ["/bin/sh","-c"] args: [" @@ -514,63 +549,47 @@ spec: ${SPARK_HOME}/bin/spark-submit \ --master ${RUNTIME_SPARK_MASTER} \ --deploy-mode ${SPARK_MODE} \ + --name orca-k8s-client-tutorial \ --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \ - --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \ - --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \ - --name orca-pytorch-tutorial \ --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \ - --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \ - --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata/ \ - --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \ - --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata/ \ - --archives file:///bigdl/nfsdata/environment.tar.gz#python_env \ - --conf spark.pyspark.driver.python=/bigdl/nfsdata/python_env/bin/python \ - --conf spark.pyspark.python=python_env/bin/python \ - --conf spark.executorEnv.PYTHONHOME=python_env \ - --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata/ \ --num-executors 2 \ - --executor-cores 16 \ - --executor-memory 50g \ - --total-executor-cores 32 \ - --driver-cores 4 \ - --driver-memory 50g \ + --executor-cores 4 \ + --executor-memory 2g \ + --total-executor-cores 8 \ + --driver-cores 2 \ + --driver-memory 2g \ + --conf spark.pyspark.driver.python=/bigdl/nfsdata/environment/bin/python \ + --conf spark.pyspark.python=/bigdl/nfsdata/environment/bin/python \ --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \ - --py-files local://${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,local:///bigdl/nfsdata/train.py,local:///bigdl/nfsdata/model.py \ - --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \ + --py-files ${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/bigdl/nfsdata/model.py \ --conf spark.kubernetes.executor.deleteOnTermination=True \ - --conf spark.sql.catalogImplementation='in-memory' \ - --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \ - --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \ - local:///bigdl/nfsdata/train.py + --conf spark.driver.extraClassPath=${BIGDL_HOME}/jars/* \ + --conf spark.executor.extraClassPath=${BIGDL_HOME}/jars/* \ + --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \ + --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata/ \ + --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \ + --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata/ \ + /bigdl/nfsdata/train.py --cluster_mode spark-submit - --data_dir file:///bigdl/nfsdata/dataset + --data_dir /bigdl/nfsdata/dataset "] securityContext: privileged: true env: - name: RUNTIME_K8S_SPARK_IMAGE - value: intelanalytics/bigdl-k8s:2.1.0 + value: intelanalytics/bigdl-k8s:latest - name: RUNTIME_SPARK_MASTER value: k8s://https://: - - name: RUNTIME_DRIVER_PORT - value: !!str 54321 - name: SPARK_MODE value: client - - name: RUNTIME_K8S_SERVICE_ACCOUNT - value: spark - - name: BIGDL_HOME - value: /opt/bigdl-2.1.0 + - name: SPARK_VERSION + value: 3.1.3 - name: SPARK_HOME value: /opt/spark - - name: SPARK_VERSION - value: 3.1.2 - name: BIGDL_VERSION - value: 2.1.0 - resources: - requests: - cpu: 1 - limits: - cpu: 4 + value: 2.2.0-SNAPSHOT + - name: BIGDL_HOME + value: /opt/bigdl-2.2.0-SNAPSHOT volumeMounts: - name: nfs-storage mountPath: /bigdl/nfsdata @@ -583,65 +602,39 @@ spec: claimName: nfsvolumeclaim ``` -In the YAML file: -* `metadata`: A nested object filed that every deployment object must specify a metadata. - * `name`: A string that uniquely identifies this object and job. -* `restartPolicy`: Restart policy for all Containers within the pod. One of Always, OnFailure, Never. Default to Always. -* `containers`: A single application Container that you want to run within a pod. - * `name`: Name of the Container, each Container in a pod must have a unique name. - * `image`: Name of the Container image. - * `imagePullPolicy`: Image pull policy. One of Always, Never and IfNotPresent. Defaults to Always if `:latest` tag is specified, or IfNotPresent otherwise. - * `command`: command for the containers that run in the Pod. - * `args`: arguments to submit the spark application in the Pod. See more details of the `spark-submit` script in __[Section 6.2.1](#621-k8s-client)__. - * `securityContext`: SecurityContext defines the security options the container should be run with. - * `env`: List of environment variables to set in the Container, which will be used when submitting the application. - * `env.name`: Name of the environment variable. - * `env.value`: Value of the environment variable. - * `resources`: Allocate resources in the cluster to each pod. - * `resource.limits`: Limits describes the maximum amount of compute resources allowed. - * `resource.requests`: Requests describes the minimum amount of compute resources required. - * `volumeMounts`: Declare where to mount volumes into containers. - * `name`: Match with the Name of a Volume. - * `mountPath`: Path within the Container at which the volume should be mounted. - * `subPath`: Path within the volume from which the Container's volume should be mounted. - * `volume`: specify the volumes to provide for the Pod. - * `persistentVolumeClaim`: mount a PersistentVolume into a Pod -Create a Pod and run Fashion-MNIST application based on the YAML file. +Submit the application using `kubectl`: ```bash -kubectl apply -f orca-tutorial-client.yaml +kubectl apply -f orca-tutorial-k8s-client.yaml ``` -List all pods to find the driver pod, which will be named as `orca-pytorch-job-xxx`. -```bash -# find out driver pod -kubectl get pods -``` - -View logs from the driver pod to retrive the training stats. -```bash -# retrive training logs -kubectl logs `orca-pytorch-job-xxx` -``` - -After the task finish, you could delete the job as the command below. +Note that you need to delete the job before re-submitting another one: ```bash kubectl delete job orca-pytorch-job ``` -#### 6.3.2 K8s Cluster -BigDL has provided an example YAML file (see __[orca-tutorial-cluster.yaml](../../../../../../python/orca/tutorial/pytorch/docker/orca-tutorial-cluster.yaml)__, which describes a Deployment that runs the `intelanalytics/bigdl-k8s:2.1.0` image) to run the tutorial FashionMNIST program on k8s-cluster mode: +After submitting the job, you can list all the pods and find the driver pod with name `orca-pytorch-job-xxx`: +```bash +kubectl get pods +kubectl get pods | grep orca-pytorch-job +``` -__Notes:__ -* Please call `init_orca_context` at very begining part of each Orca program. - ```python - from bigdl.orca import init_orca_context +Retrieve the logs on the driver pod: +```bash +kubectl logs orca-pytorch-job-xxx +``` - init_orca_context(cluster_mode="spark-submit") - ``` +After the task finishes, delete the job and all related pods if necessary: +```bash +kubectl delete job orca-pytorch-job +``` + +#### 7.3.2 K8s Cluster +BigDL has provided an example [orca-tutorial-k8s-cluster.yaml](https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/pytorch/docker/orca-tutorial-cluster.yaml)__ to run the Fashion-MNIST example for k8s-cluster mode. +Note that you need to change the configurations in the YAML file accordingly, including the version of the docker image, RUNTIME_SPARK_MASTER, BIGDL_VERSION and BIGDL_HOME. ```bash -# orca-tutorial-cluster.yaml +# orca-tutorial-k8s-cluster.yaml apiVersion: batch/v1 kind: Job metadata: @@ -654,376 +647,60 @@ spec: hostNetwork: true containers: - name: spark-k8s-cluster - image: intelanalytics/bigdl-k8s:2.1.0 - imagePullPolicy: IfNotPresent - command: ["/bin/sh","-c"] - args: [" - ${SPARK_HOME}/bin/spark-submit \ - --master ${RUNTIME_SPARK_MASTER} \ - --deploy-mode ${SPARK_MODE} \ - --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \ - --name orca-pytorch-tutorial \ - --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \ - --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \ - --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata/ \ - --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \ - --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata/ \ - --archives file:///bigdl/nfsdata/environment.tar.gz#python_env \ - --conf spark.pyspark.python=python_env/bin/python \ - --conf spark.executorEnv.PYTHONHOME=python_env \ - --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata/ \ - --num_executors 2 \ - --executor-cores 16 \ - --executor-memory 50g \ - --total-executor-cores 32 \ - --driver-cores 4 \ - --driver-memory 50g \ - --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \ - --py-files local://${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,local:///bigdl/nfsdata/train.py,local:///bigdl/nfsdata/model.py \ - --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \ - --conf spark.kubernetes.executor.deleteOnTermination=True \ - --conf spark.sql.catalogImplementation='in-memory' \ - --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \ - --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \ - local:///bigdl/nfsdata/train.py - --cluster_mode spark-submit - --data_dir file:///bigdl/nfsdata/dataset - "] - securityContext: - privileged: true - env: - - name: RUNTIME_K8S_SPARK_IMAGE - value: intelanalytics/bigdl-k8s:2.1.0 - - name: RUNTIME_SPARK_MASTER - value: k8s://https://: - - name: SPARK_MODE - value: cluster - - name: RUNTIME_K8S_SERVICE_ACCOUNT - value: spark - - name: BIGDL_HOME - value: /opt/bigdl-2.1.0 - - name: SPARK_HOME - value: /opt/spark - - name: SPARK_VERSION - value: 3.1.2 - - name: BIGDL_VERSION - value: 2.1.0 - resources: - requests: - cpu: 1 - limits: - cpu: 4 - volumeMounts: - - name: nfs-storage - mountPath: /bigdl/nfsdata - - name: nfs-storage - mountPath: /root/.kube/config - subPath: kubeconfig - volumes: - - name: nfs-storage - persistentVolumeClaim: - claimName: nfsvolumeclaim -``` - -In the YAML file: -* `restartPolicy`: Restart policy for all Containers within the pod. One of Always, OnFailure, Never. Default to Always. -* `containers`: A single application Container that you want to run within a pod. - * `name`: Name of the Container, each Container in a pod must have a unique name. - * `image`: Name of the Container image. - * `imagePullPolicy`: Image pull policy. One of Always, Never and IfNotPresent. Defaults to Always if `:latest` tag is specified, or IfNotPresent otherwise. - * `command`: command for the containers that run in the Pod. - * `args`: arguments to submit the spark application in the Pod. See more details of the `spark-submit` script in __[Section 6.2.2](#622-k8s-cluster)__. - * `securityContext`: SecurityContext defines the security options the container should be run with. - * `env`: List of environment variables to set in the Container, which will be used when submitting the application. - * `env.name`: Name of the environment variable. - * `env.value`: Value of the environment variable. - * `resources`: Allocate resources in the cluster to each pod. - * `resource.limits`: Limits describes the maximum amount of compute resources allowed. - * `resource.requests`: Requests describes the minimum amount of compute resources required. - * `volumeMounts`: Declare where to mount volumes into containers. - * `name`: Match with the Name of a Volume. - * `mountPath`: Path within the Container at which the volume should be mounted. - * `subPath`: Path within the volume from which the Container's volume should be mounted. - * `volume`: specify the volumes to provide for the Pod. - * `persistentVolumeClaim`: mount a PersistentVolume into a Pod - -Create a Pod and run Fashion-MNIST application based on the YAML file. -```bash -kubectl apply -f orca-tutorial-cluster.yaml -``` - -List all pods to find the driver pod (since the client pod only returns training status), which will be named as `orca-pytorch-job-driver`. -```bash -# checkout training status -kubectl logs `orca-pytorch-job-xxx` - -# find out driver pod -kubectl get pods -``` - -View logs from the driver pod to retrive the training stats. -```bash -# retrive training logs -kubectl logs `orca-pytorch-job-driver` -``` - -After the task finish, you could delete the job as the command below. -```bash -kubectl delete job orca-pytorch-job -``` - - -### 6.4 Use Kubernetes Deployment (without Integrated Image) -BigDL also supports uses to skip preparing envionment through providing a container image (`intelanalytics/bigdl-k8s:orca-2.1.0`) which has integrated all BigDL required environments. - -__Notes:__ -* The image will be pulled automatically when you deploy pods with the YAML file. -* Conda archive is no longer needed in this method, please skip __[Section 3](#3-prepare-environment)__, since BigDL has integrated environment in `intelanalytics/bigdl-k8s:orca-2.1.0`. -* If you need to install extra Python libraries which may not included in the image, please submit applications with Conda archive (refer to __[Section 6.3](#63-use-kubernetes-deployment)__). - -Before submitting the example application, you should: -* On the __Develop Node__ - * Download dataset and upload it to NFS. - ```bash - mv /path/to/dataset /bigdl/nfsdata - ``` - * Upload example Python files and extra Python dependencies to NFS. - ```bash - # Upload example Python files - cp /path/to/train.py /bigdl/nfsdata - - # Uplaod extra Python dependencies - cp /path/to/model.py /bigdl/nfsdata - ``` - -#### 6.4.1 K8s Client -BigDL has provided an example YAML file (see __[integrated_image_client.yaml](../../../../../../python/orca/tutorial/pytorch/docker/integrate_image_client.yaml)__, which describes a deployment that runs the `intelanalytics/bigdl-k8s:orca-2.1.0` image) to run the tutorial FashionMNIST program on k8s-client mode: - -__Notes:__ -* Please call `init_orca_context` at very begining part of each Orca program. - ```python - from bigdl.orca import init_orca_context - - init_orca_context(cluster_mode="spark-submit") - ``` -* Spark client needs to specify `spark.pyspark.driver.python`, this python env should be on NFS dir. - ```bash - --conf spark.pyspark.driver.python=/bigdl/nfsdata/orca_env/bin/python \ - ``` - -```bash -#integrate_image_client.yaml -apiVersion: batch/v1 -kind: Job -metadata: - name: orca-integrate-job -spec: - template: - spec: - serviceAccountName: spark - restartPolicy: Never - hostNetwork: true - containers: - - name: spark-k8s-client - image: intelanalytics/bigdl-spark-3.1.2:orca-2.1.0 + image: intelanalytics/bigdl-k8s:latest imagePullPolicy: IfNotPresent command: ["/bin/sh","-c"] args: [" export RUNTIME_DRIVER_HOST=$( hostname -I | awk '{print $1}' ); ${SPARK_HOME}/bin/spark-submit \ --master ${RUNTIME_SPARK_MASTER} \ + --name orca-k8s-cluster-tutorial \ --deploy-mode ${SPARK_MODE} \ --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \ - --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \ - --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \ - --name orca-integrate-pod \ --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \ - --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \ - --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \ - --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata \ - --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \ - --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata \ - --conf spark.pyspark.driver.python=python \ - --conf spark.pyspark.python=/usr/local/envs/bigdl/bin/python \ - --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata/ \ - --executor-cores 10 \ - --executor-memory 50g \ - --num-executors 4 \ - --total-executor-cores 40 \ - --driver-cores 10 \ - --driver-memory 50g \ + --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \ + --num-executors 2 \ + --executor-cores 4 \ + --total-executor-cores 8 \ + --executor-memory 2g \ + --driver-cores 2 \ + --driver-memory 2g \ + --archives /bigdl/nfsdata/environment.tar.gz#environment \ + --conf spark.pyspark.driver.python=environment/bin/python \ + --conf spark.pyspark.python=environment/bin/python \ + --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata \ --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \ - --py-files local://${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,local:///bigdl/nfsdata/train.py,local:///bigdl/nfsdata/model.py \ - --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \ - --conf spark.sql.catalogImplementation='in-memory' \ - --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \ - --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \ - local:///bigdl/nfsdata/train.py + --py-files ${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/bigdl/nfsdata/train.py,/bigdl/nfsdata/model.py \ + --conf spark.driver.extraClassPath=${BIGDL_HOME}/jars/* \ + --conf spark.executor.extraClassPath=${BIGDL_HOME}/jars/* \ + --conf spark.kubernetes.executor.deleteOnTermination=True \ + --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \ + --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata/ \ + --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \ + --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata/ \ + /bigdl/nfsdata/train.py --cluster_mode spark-submit - --data_dir file:///bigdl/nfsdata/dataset + --data_dir /bigdl/nfsdata/dataset "] securityContext: privileged: true env: - name: RUNTIME_K8S_SPARK_IMAGE - value: intelanalytics/bigdl-spark-3.1.2:orca-2.1.0 + value: intelanalytics/bigdl-k8s:latest - name: RUNTIME_SPARK_MASTER value: k8s://https://: - - name: RUNTIME_DRIVER_PORT - value: !!str 54321 - - name: SPARK_MODE - value: client - name: RUNTIME_K8S_SERVICE_ACCOUNT value: spark - - name: BIGDL_HOME - value: /opt/bigdl-2.1.0 - - name: SPARK_HOME - value: /opt/spark - - name: SPARK_VERSION - value: 3.1.2 - - name: BIGDL_VERSION - value: 2.1.0 - resources: - requests: - cpu: 1 - limits: - cpu: 4 - volumeMounts: - - name: nfs-storage - mountPath: /bigdl/nfsdata - - name: nfs-storage - mountPath: /root/.kube/config - subPath: kubeconfig - volumes: - - name: nfs-storage - persistentVolumeClaim: - claimName: nfsvolumeclaim -``` - -In the YAML file: -* `restartPolicy`: Restart policy for all Containers within the pod. One of Always, OnFailure, Never. Default to Always. -* `containers`: A single application Container that you want to run within a pod. - * `name`: Name of the Container, each Container in a pod must have a unique name. - * `image`: Name of the Container image. - * `imagePullPolicy`: Image pull policy. One of Always, Never and IfNotPresent. Defaults to Always if `:latest` tag is specified, or IfNotPresent otherwise. - * `command`: command for the containers that run in the Pod. - * `args`: arguments to submit the spark application in the Pod. See more details of the `spark-submit` script in __[Section 6.2.1](#621-k8s-client)__. - * `securityContext`: SecurityContext defines the security options the container should be run with. - * `env`: List of environment variables to set in the Container, which will be used when submitting the application. - * `env.name`: Name of the environment variable. - * `env.value`: Value of the environment variable. - * `resources`: Allocate resources in the cluster to each pod. - * `resource.limits`: Limits describes the maximum amount of compute resources allowed. - * `resource.requests`: Requests describes the minimum amount of compute resources required. - * `volumeMounts`: Declare where to mount volumes into containers. - * `name`: Match with the Name of a Volume. - * `mountPath`: Path within the Container at which the volume should be mounted. - * `subPath`: Path within the volume from which the Container's volume should be mounted. - * `volume`: specify the volumes to provide for the Pod. - * `persistentVolumeClaim`: mount a PersistentVolume into a Pod - -Create a Pod and run Fashion-MNIST application based on the YAML file. -```bash -kubectl apply -f integrate_image_client.yaml -``` - -List all pods to find the driver pod, which will be named as `orca-integrate-job-xxx`. -```bash -# find out driver pod -kubectl get pods -``` - -View logs from the driver pod to retrive the training stats. -```bash -# retrive training logs -kubectl logs `orca-integrate-job-xxx` -``` - -After the task finish, you could delete the job as the command below. -```bash -kubectl delete job orca-integrate-job -``` - -#### 6.4.2 K8s Cluster -BigDL has provided an example YAML file (see __[integrate_image_cluster.yaml](../../../../../../python/orca/tutorial/pytorch/docker/integrate_image_cluster.yaml)__, which describes a deployment that runs the `intelanalytics/bigdl-k8s:orca-2.1.0` image) to run the tutorial FashionMNIST program on k8s-cluster mode: - -__Notes:__ -* Please call `init_orca_context` at very begining part of each Orca program. - ```python - from bigdl.orca import init_orca_context - - init_orca_context(cluster_mode="spark-submit") - ``` - -```bash -# integrate_image_cluster.yaml -apiVersion: batch/v1 -kind: Job -metadata: - name: orca-integrate-job -spec: - template: - spec: - serviceAccountName: spark - restartPolicy: Never - hostNetwork: true - containers: - - name: spark-k8s-cluster - image: intelanalytics/bigdl-spark-3.1.2:orca-2.1.0 - imagePullPolicy: IfNotPresent - command: ["/bin/sh","-c"] - args: [" - ${SPARK_HOME}/bin/spark-submit \ - --master ${RUNTIME_SPARK_MASTER} \ - --deploy-mode ${SPARK_MODE} \ - --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \ - --name orca-integrate-pod \ - --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \ - --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \ - --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata \ - --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \ - --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl/nfsdata \ - --conf spark.kubernetes.file.upload.path=/bigdl/nfsdata/ \ - --executor-cores 10 \ - --executor-memory 50g \ - --num-executors 4 \ - --total-executor-cores 40 \ - --driver-cores 10 \ - --driver-memory 50g \ - --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \ - --py-files local://${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,local:///bigdl/nfsdata/train.py,local:///bigdl/nfsdata/model.py \ - --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \ - --conf spark.sql.catalogImplementation='in-memory' \ - --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \ - --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \ - local:///bigdl/nfsdata/train.py - --cluster_mode spark-submit - --data_dir file:///bigdl/nfsdata/dataset - "] - securityContext: - privileged: true - env: - - name: RUNTIME_K8S_SPARK_IMAGE - value: intelanalytics/bigdl-spark-3.1.2:orca-2.1.0 - - name: RUNTIME_SPARK_MASTER - value: k8s://https://: - name: SPARK_MODE value: cluster - - name: RUNTIME_K8S_SERVICE_ACCOUNT - value: spark - - name: BIGDL_HOME - value: /opt/bigdl-2.1.0 + - name: SPARK_VERSION + value: 3.1.3 - name: SPARK_HOME value: /opt/spark - - name: SPARK_VERSION - value: 3.1.2 - name: BIGDL_VERSION - value: 2.1.0 - resources: - requests: - cpu: 1 - limits: - cpu: 4 + value: 2.2.0-SNAPSHOT + - name: BIGDL_HOME + value: /opt/bigdl-2.2.0-SNAPSHOT volumeMounts: - name: nfs-storage mountPath: /bigdl/nfsdata @@ -1036,49 +713,30 @@ spec: claimName: nfsvolumeclaim ``` -In the YAML file: -* `restartPolicy`: Restart policy for all Containers within the pod. One of Always, OnFailure, Never. Default to Always. -* `containers`: A single application Container that you want to run within a pod. - * `name`: Name of the Container, each Container in a pod must have a unique name. - * `image`: Name of the Container image. - * `imagePullPolicy`: Image pull policy. One of Always, Never and IfNotPresent. Defaults to Always if `:latest` tag is specified, or IfNotPresent otherwise. - * `command`: command for the containers that run in the Pod. - * `args`: arguments to submit the spark application in the Pod. See more details of the `spark-submit` script in __[Section 6.2.2](#622-k8s-cluster)__. - * `securityContext`: SecurityContext defines the security options the container should be run with. - * `env`: List of environment variables to set in the Container, which will be used when submitting the application. - * `env.name`: Name of the environment variable. - * `env.value`: Value of the environment variable. - * `resources`: Allocate resources in the cluster to each pod. - * `resource.limits`: Limits describes the maximum amount of compute resources allowed. - * `resource.requests`: Requests describes the minimum amount of compute resources required. - * `volumeMounts`: Declare where to mount volumes into containers. - * `name`: Match with the Name of a Volume. - * `mountPath`: Path within the Container at which the volume should be mounted. - * `subPath`: Path within the volume from which the Container's volume should be mounted. - * `volume`: specify the volumes to provide for the Pod. - * `persistentVolumeClaim`: mount a PersistentVolume into a Pod - -Create a Pod and run Fashion-MNIST application based on the YAML file. +Submit the application using `kubectl`: ```bash -kubectl apply -f integrate_image_cluster.yaml +kubectl apply -f orca-tutorial-k8s-cluster.yaml ``` -List all pods to find the driver pod (since the client pod only returns training status), which will be named as `orca-integrate-job-driver`. +Note that you need to delete the job before re-submitting another one: ```bash -# checkout training status -kubectl logs `orca-integrate-job-xxx` +kubectl delete job orca-pytorch-job +``` -# find out driver pod +After submitting the job, you can list all the pods and find the driver pod with name `orca-k8s-cluster-tutorial-xxx-driver`. +```bash kubectl get pods +kubectl get pods | grep orca-k8s-cluster-tutorial +# Then find the pod of the driver: orca-k8s-cluster-tutorial-xxx-driver ``` -View logs from the driver pod to retrive the training stats. +Retrieve the logs on the driver pod: ```bash -# retrive training logs -kubectl logs `orca-integrate-job-driver` +kubectl logs orca-k8s-cluster-tutorial-xxx-driver ``` -After the task finish, you could delete the job as the command below. +After the task finishes, delete the job and all related pods if necessary: ```bash -kubectl delete job orca-integrate-job +kubectl delete job orca-pytorch-job +kubectl delete pod orca-k8s-cluster-tutorial-xxx-driver ``` diff --git a/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md b/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md index 20a63fc4..7bf1746f 100644 --- a/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md +++ b/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md @@ -251,7 +251,7 @@ Pack the current activate conda environment to an archive on the __Client Node__ conda pack -o environment.tar.gz ``` -Some runtime configurations for Spark are as follows: +Some runtime configurations for `bigdl-submit` are as follows: * `--master`: the spark master, set it to "yarn". * `--num_executors`: the number of executors. @@ -282,7 +282,7 @@ bigdl-submit \ In the `bigdl-submit` script: * `--deploy-mode`: set it to `client` when running programs on yarn-client mode. * `--conf spark.pyspark.driver.python`: set the activate Python location on __Client Node__ as the driver's Python environment. -* `--conf spark.pyspark.python`: set the Python location in conda archive as each executor's Python environment. +* `--conf spark.pyspark.python`: set the Python location in the conda archive as each executor's Python environment. #### 5.2.2 Yarn Cluster @@ -304,42 +304,42 @@ bigdl-submit \ ``` In the `bigdl-submit` script: * `--deploy-mode`: set it to `cluster` when running programs on yarn-cluster mode. -* `--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON`: set the Python location in conda archive as the Python environment of the Application Master. -* `--conf spark.executorEnv.PYSPARK_PYTHON`: also set the Python location in conda archive as each executor's Python environment. The Application Master and the executors will all use the archive for the Python environment. +* `--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON`: set the Python location in the conda archive as the Python environment of the Application Master. +* `--conf spark.executorEnv.PYSPARK_PYTHON`: also set the Python location in the conda archive as each executor's Python environment. The Application Master and the executors will all use the archive for the Python environment. ### 5.3 Use `spark-submit` If you prefer to use `spark-submit` instead of `bigdl-submit`, please follow the steps below to prepare the environment on the __Client Node__. -1. Set the cluster_mode to "spark-submit" in `init_orca_context`. - ```python - sc = init_orca_context(cluster_mode="spark-submit") - ``` - -2. Download the requirement file(s) from [here](https://github.com/intel-analytics/BigDL/tree/main/python/requirements/orca) and install the required Python libraries of BigDL Orca according to your needs. +1. Download the requirement file(s) from [here](https://github.com/intel-analytics/BigDL/tree/main/python/requirements/orca) and install the required Python libraries of BigDL Orca according to your needs. ```bash pip install -r /path/to/requirements.txt ``` Note that you are recommended **NOT** to install BigDL Orca with pip install command in the conda environment if you use spark-submit to avoid possible conflicts. -3. Pack the current activate conda environment to an archive before submitting the example: +2. Pack the current activate conda environment to an archive: ```bash conda pack -o environment.tar.gz ``` -4. Download the BigDL assembly package from [here](../Overview/install.html#download-bigdl-orca) and unzip it. Then setup the environment variables `${BIGDL_HOME}` and `${BIGDL_VERSION}`. +3. Download the BigDL assembly package from [here](../Overview/install.html#download-bigdl-orca) and unzip it. Then setup the environment variables `${BIGDL_HOME}` and `${BIGDL_VERSION}`. ```bash export BIGDL_VERSION="downloaded BigDL version" export BIGDL_HOME=/path/to/unzipped_BigDL # the folder path where you extract the BigDL package ``` -5. Download and extract [Spark](https://archive.apache.org/dist/spark/). BigDL is currently released for [Spark 2.4](https://archive.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz) and [Spark 3.1](https://archive.apache.org/dist/spark/spark-3.1.3/spark-3.1.3-bin-hadoop2.7.tgz). Make sure the version of your downloaded Spark matches the one that your downloaded BigDL is released with. Then setup the environment variables `${SPARK_HOME}` and `${SPARK_VERSION}`. +4. Download and extract [Spark](https://archive.apache.org/dist/spark/). BigDL is currently released for [Spark 2.4](https://archive.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz) and [Spark 3.1](https://archive.apache.org/dist/spark/spark-3.1.3/spark-3.1.3-bin-hadoop2.7.tgz). Make sure the version of your downloaded Spark matches the one that your downloaded BigDL is released with. Then setup the environment variables `${SPARK_HOME}` and `${SPARK_VERSION}`. ```bash export SPARK_VERSION="downloaded Spark version" export SPARK_HOME=/path/to/uncompressed_spark # the folder path where you extract the Spark package ``` -Some runtime configurations for Spark are as follows: +5. Set the cluster_mode to "spark-submit" in `init_orca_context`: + ```python + sc = init_orca_context(cluster_mode="spark-submit") + ``` + +Some runtime configurations for `spark-submit` are as follows: * `--master`: the spark master, set it to "yarn". * `--num_executors`: the number of executors. @@ -374,7 +374,7 @@ ${SPARK_HOME}/bin/spark-submit \ In the `spark-submit` script: * `--deploy-mode`: set it to `client` when running programs on yarn-client mode. * `--conf spark.pyspark.driver.python`: set the activate Python location on __Client Node__ as the driver's Python environment. -* `--conf spark.pyspark.python`: set the Python location in conda archive as each executor's Python environment. +* `--conf spark.pyspark.python`: set the Python location in the conda archive as each executor's Python environment. #### 5.3.2 Yarn Cluster Submit and run the program for `yarn-cluster` mode following the `spark-submit` script below: @@ -397,5 +397,5 @@ ${SPARK_HOME}/bin/spark-submit \ ``` In the `spark-submit` script: * `--deploy-mode`: set it to `cluster` when running programs on yarn-cluster mode. -* `--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON`: set the Python location in conda archive as the Python environment of the Application Master. -* `--conf spark.executorEnv.PYSPARK_PYTHON`: also set the Python location in conda archive as each executor's Python environment. The Application Master and the executors will all use the archive for the Python environment. +* `--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON`: set the Python location in the conda archive as the Python environment of the Application Master. +* `--conf spark.executorEnv.PYSPARK_PYTHON`: also set the Python location in the conda archive as each executor's Python environment. The Application Master and the executors will all use the archive for the Python environment.