From cdd1f8421efe052a7e82d3bdf812a2f7c5966f54 Mon Sep 17 00:00:00 2001 From: Kai Huang Date: Mon, 21 Nov 2022 19:01:30 +0800 Subject: [PATCH] Update k8s tutorial (part1) (#6696) * update 1.1 * update 3 * update 1.2 1.3 --- .../source/doc/Orca/Tutorial/k8s.md | 161 ++++++------------ .../source/doc/Orca/Tutorial/yarn.md | 16 +- 2 files changed, 66 insertions(+), 111 deletions(-) diff --git a/docs/readthedocs/source/doc/Orca/Tutorial/k8s.md b/docs/readthedocs/source/doc/Orca/Tutorial/k8s.md index c1d222e6..6ca32661 100644 --- a/docs/readthedocs/source/doc/Orca/Tutorial/k8s.md +++ b/docs/readthedocs/source/doc/Orca/Tutorial/k8s.md @@ -1,10 +1,12 @@ # Run on Kubernetes Clusters -This tutorial provides a step-by-step guide on how to run BigDL-Orca programs on Kubernetes (K8s) clusters, using a [PyTorch Fashin-MNIST program](https://github.com/intel-analytics/BigDL/tree/main/docs/docs/tutorials/tutorial_example/Fashion_MNIST/) as a working example. +This tutorial provides a step-by-step guide on how to run BigDL-Orca programs on Kubernetes (K8s) clusters, using a [PyTorch Fashin-MNIST program](https://github.com/intel-analytics/BigDL/tree/main/python/orca/tutorial/pytorch/FashionMNIST) as a working example. +The **Client Container** that appears in this tutorial refer to the docker container where you launch or submit your applications. -# 1. Key Concepts -## 1.1 Init_orca_context +--- +## 1. Basic Concepts +### 1.1 init_orca_context A BigDL Orca program usually starts with the initialization of OrcaContext. For every BigDL Orca program, you should call `init_orca_context` at the beginning of the program as below: ```python @@ -16,34 +18,34 @@ init_orca_context(cluster_mode, master, container_image, ``` In `init_orca_context`, you may specify necessary runtime configurations for running the example on K8s, including: -* `cluster_mode`: a String that specifies the underlying cluster; valid value includes `"local"`, `"yarn-client"`, `"yarn-cluster"`, __`"k8s-client"`__, __`"k8s-cluster"`__, `"bigdl-submit"`, `"spark-submit"`, etc. -* `master`: a URL format to specify the master address of K8s cluster. -* `container_image`: a String that specifies the name of docker container image for executors. -* `cores`: an Integer that specifies the number of cores for each executor (default to be `2`). -* `memory`: a String that specifies the memory for each executor (default to be `"2g"`). -* `num_nodes`: an Integer that specifies the number of executors (default to be `1`). -* `driver_cores`: an Integer that specifies the number of cores for the driver node (default to be `4`). -* `driver_memory`: a String that specifies the memory for the driver node (default to be `"1g"`). -* `extra_python_lib`: a String that specifies the path to extra Python packages, one of `.py`, `.zip` or `.egg` files (default to be `None`). -* `penv_archive`: a String that specifies the path to a packed Conda archive (default to be `None`). -* `jars`: a String that specifies the path to needed jars files (default to be `None`). -* `conf`: a Key-Value format to append extra conf for Spark (default to be `None`). +* `cluster_mode`: one of `"k8s-client"`, `"k8s-cluster"` or `"spark-submit"` when you run on K8s clusters. +* `master`: a URL format to specify the master address of the K8s cluster. +* `container_image`: a string that specifies the name of docker container image for executors. +* `cores`: an integer that specifies the number of cores for each executor (default to be `2`). +* `memory`: a string that specifies the memory for each executor (default to be `"2g"`). +* `num_nodes`: an integer that specifies the number of executors (default to be `1`). +* `driver_cores`: an integer that specifies the number of cores for the driver node (default to be `4`). +* `driver_memory`: a string that specifies the memory for the driver node (default to be `"1g"`). +* `extra_python_lib`: a string that specifies the path to extra Python packages, separated by comma (default to be `None`). `.py`, `.zip` or `.egg` files are supported. +* `penv_archive`: a string that specifies the path to a packed conda archive (default to be `None`). +* `jars`: a string that specifies the path to extra jars files, separated by comma (default to be `None`). +* `conf`: a dictionary to append extra conf for Spark (default to be `None`). __Note__: -* All arguments __except__ `cluster_mode` will be ignored when using `spark-submit` and Kubernetes deployment to submit and run Orca programs. +* All arguments __except__ `cluster_mode` will be ignored when using [`spark-submit`](#use-spark-submit) and [Kubernetes deployment](#use-kubernetes-deployment-with-conda-archive) to submit and run Orca programs, in which case you are supposed to specify these configurations via the submit command or the YAML file. -After the Orca programs finish, you should call `stop_orca_context` at the end of the program to release resources and shutdown the underlying distributed runtime engine (such as Spark or Ray). +After Orca programs finish, you should always call `stop_orca_context` at the end of the program to release resources and shutdown the underlying distributed runtime engine (such as Spark or Ray). ```python from bigdl.orca import stop_orca_context stop_orca_context() ``` -For more details, please see [OrcaContext](https://bigdl.readthedocs.io/en/latest/doc/Orca/Overview/orca-context.html). +For more details, please see [OrcaContext](../Overview/orca-context.md). -## 1.2 K8s Client&Cluster -The difference between k8s-client and k8s-cluster is where you run your Spark driver. +### 1.2 K8s Client&Cluster +The difference between k8s-client mode and k8s-cluster mode is where you run your Spark driver. For k8s-client, the Spark driver runs in the client process (outside the K8s cluster), while for k8s-cluster the Spark driver runs inside the K8s cluster. @@ -51,22 +53,19 @@ Please see more details in [K8s-Cluster](https://spark.apache.org/docs/latest/ru -## 1.3 Load Data from Network File Systems (NFS) -When you are running programs on K8s, please load data from volumes and we use NFS in this tutorial as an example. +### 1.3 Load Data from Volumes +When you are running programs on K8s, please load data from [Volumes](https://kubernetes.io/docs/concepts/storage/volumes/) accessible to all K8s pods. We use Network File Systems (NFS) in this tutorial as an example. -After mounting the volume (NFS) into BigDL container (see __[Section 2.2](#22-create-a-k8s-client-container)__), the Fashion-MNIST example could load data from NFS. +After mounting the Volume (NFS) into the BigDL container (see __[Section 2.2](#create-a-k8s-client-container)__), the Fashion-MNIST example could load data from NFS as local storage. ```python import torch import torchvision import torchvision.transforms as transforms -from bigdl.orca.data.file import get_remote_file_to_local def train_data_creator(config, batch_size): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))]) - - get_remote_file_to_local(remote_path="/path/to/nfsdata", local_path="/tmp/dataset") trainset = torchvision.datasets.FashionMNIST(root="/bigdl/nfsdata/dataset", train=True, download=False, transform=transform) @@ -77,9 +76,9 @@ def train_data_creator(config, batch_size): ``` - -# 2. Create & Lunch BigDL K8s Container -## 2.1 Pull Docker Image +--- +## 2. Create & Launch BigDL K8s Container +### 2.1 Pull Docker Image Please pull the BigDL 2.1.0 `bigdl-k8s` image from [Docker Hub](https://hub.docker.com/r/intelanalytics/bigdl-k8s/tags) as follows: ```bash sudo docker pull intelanalytics/bigdl-k8s:2.1.0 @@ -93,7 +92,7 @@ __Note:__ * The 2.1.0 and latest BigDL image is built on top of Spark 3.1.2. -## 2.2 Create a K8s Client Container +### 2.2 Create a K8s Client Container Please launch the __Client Container__ following the script below: ```bash sudo docker run -itd --net=host \ @@ -162,80 +161,31 @@ In the script: * `RUNTIME_DRIVER_MEMORY`: a String that specifies the memory for the driver node; -## 2.3 Launch the K8s Client Container -Once the container is created, docker image would return a `containerID`, please launch the container following the command below: +### 2.3 Enter the K8s Client Container +Once the container is created, a `containerID` would be returned and with which you can enter the container following the command below: ```bash sudo docker exec -it bash ``` +--- +## 3. Prepare Environment +In the launched BigDL K8s **Client Container**, please setup the environment following the steps below: -# 3. Prepare Environment -In the launched BigDL K8s Container, please setup environment following steps below: -## 3.1 Install Python Libraries -### 3.1.1 Install Conda -Please use conda to prepare the Python environment on the __Client Container__ (where you submit applications), you could download and install Conda following [Conda User Guide](https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html) or executing the command as below. +- See [here](../Overview/install.md#install-anaconda) to install conda and prepare the Python environment. -```bash -# Download Anaconda installation script -wget -P /tmp https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh +- See [here](../Overview/install.md#install-bigdl-orca) to install BigDL Orca in the created conda environment. -# Execute the script to install conda -bash /tmp/Anaconda3-2020.02-Linux-x86_64.sh - -# Please type this command in your terminal to activate Conda environment -source ~/.bashrc -``` - -### 3.1.2 Use Conda to Install BigDL and Other Python Libraries -Create a Conda environment, install BigDL and all needed Python libraries in the activate Conda: -```bash -# "env" is conda environment name, you can use any name you like. -# Please change Python version to 3.8 if you need a Python 3.8 environment. -conda create -n env python=3.7 -conda activate env -``` - -Please install the 2.1.0 release version of BigDL (built on top of Spark 3.1.2) as follows: -```bash -pip install bigdl-spark3 -``` - -When you are running in the latest BigDL image, please install the nightly build of BigDL as follows: -```bash -pip install --pre --upgrade bigdl-spark3 -``` - -Please install torch and torchvision to run the Fashion-MNIST example: +- You should install all the other Python libraries that you need in your program in the conda environment as well. `torch` and `torchvision` are needed to run the Fashion-MNIST example: ```bash pip install torch torchvision ``` -__Notes:__ -* Using Conda to install BigDL will automatically install libraries including `pyspark==3.1.2`, and etc. -* You can install BigDL Orca built on top of Spark 3.1.2 as follows: - ```bash - # Install the latest release version - pip install bigdl-orca-spark3 - - # Install the latest nightly build version - pip install --pre --upgrade bigdl-spark3-orca - - # You need to install torch and torchvision manually - pip install torch torchvision - ``` - - Installing bigdl-orca-spark3 will automatically install `pyspark==3.1.2`. - -* You also need to install any additional python libraries that your application depends on in the Conda environment. - -* It's only need for you to install needed Python libraries using Conda, since the BigDL K8s container has already setup `JAVA_HOME`, `BIGDL_HOME`, `SPARK_HOME`, `SPARK_VERSION`, etc. - -Please see more details in [Python User Guide](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/python.html). +- For more details, please see [Python User Guide](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/python.html). - -# 4. Prepare Dataset +--- +## 4. Prepare Dataset To run the example provided by this tutorial on K8s, you should upload the dataset to to a K8s volumn (e.g. NFS). Please download the Fashion-MNIST dataset manually on your __Develop Node__ (where you launch the container image). @@ -254,7 +204,8 @@ gzip -dk /bigdl/nfsdata/dataset/FashionMNIST/raw/* __Note:__ PyTorch requires tge directory of dataset where `FashionMNIST/raw/train-images-idx3-ubyte` and `FashionMNIST/raw/t10k-images-idx3-ubyte` exist. -# 5. Prepare Custom Modules +--- +## 5. Prepare Custom Modules Spark allows to upload Python files(`.py`), and zipped Python packages(`.zip`) to the executors by setting `--py-files` option in Spark scripts or `extra_python_lib` in `init_orca_context`. The FasionMNIST example needs to import modules from `model.py`. @@ -308,16 +259,16 @@ __Notes:__ ``` - -# 6. Run Jobs on K8s +--- +## 6. Run Jobs on K8s In the following part, we will show you how to submit and run the Orca example on K8s: * Use `python` command * Use `spark-submit` script * Use Kubernetes Deployment (with Conda Archive) * Use Kubernetes Deployment (with Integrated Image) -## 6.1 Use `python` command -### 6.1.1 K8s-Client +### 6.1 Use `python` command +#### 6.1.1 K8s-Client Before running the example on `k8s-client` mode, you should: * On the __Client Container__: 1. Please call `init_orca_context` at very begining part of each Orca program. @@ -350,7 +301,7 @@ In the script: * `remote_dir`: directory on NFS for loading the dataset. -### 6.1.2 K8s-Cluster +#### 6.1.2 K8s-Cluster Before running the example on `k8s-cluster` mode, you should: * On the __Client Container__: 1. Please call `init_orca_context` at very begining part of each Orca program. @@ -421,8 +372,8 @@ Please retreive training stats on the __Develop Node__ following the command bel ``` -## 6.2 Use `spark-submit` Script -### 6.2.1 K8s Client +### 6.2 Use `spark-submit` +#### 6.2.1 K8s Client Before submitting the example on `k8s-client` mode, you should: * On the __Client Container__: 1. Please call `init_orca_context` at very begining part of each Orca program. @@ -484,7 +435,7 @@ In the script: * `remote_dir`: directory on NFS for loading the dataset. -### 6.2.2 K8s Cluster +#### 6.2.2 K8s Cluster Before submitting the example on `k8s-cluster` mode, you should: * On the __Client Container__: 1. Please call `init_orca_context` at very begining part of each Orca program. @@ -577,7 +528,7 @@ Please retrieve training stats on the __Develop Node__ following the commands be ``` -## 6.3 Use Kubernetes Deployment (with Conda Archive) +### 6.3 Use Kubernetes Deployment (with Conda Archive) BigDL supports users (which want to execute programs directly on __Develop Node__) to run an application by creating a Kubernetes Deployment object. Before submitting the Orca application, you should: @@ -598,7 +549,7 @@ Before submitting the Orca application, you should: cp /path/to/model.py /bigdl/nfsdata ``` -### 6.3.1 K8s Client +#### 6.3.1 K8s Client BigDL has provided an example YAML file (see __[orca-tutorial-client.yaml](../../../../../../python/orca/tutorial/pytorch/docker/orca-tutorial-client.yaml)__, which describes a Deployment that runs the `intelanalytics/bigdl-k8s:2.1.0` image) to run the tutorial FashionMNIST program on k8s-client mode: __Notes:__ @@ -750,7 +701,7 @@ After the task finish, you could delete the job as the command below. kubectl delete job orca-pytorch-job ``` -### 6.3.2 K8s Cluster +#### 6.3.2 K8s Cluster BigDL has provided an example YAML file (see __[orca-tutorial-cluster.yaml](../../../../../../python/orca/tutorial/pytorch/docker/orca-tutorial-cluster.yaml)__, which describes a Deployment that runs the `intelanalytics/bigdl-k8s:2.1.0` image) to run the tutorial FashionMNIST program on k8s-cluster mode: __Notes:__ @@ -894,7 +845,7 @@ kubectl delete job orca-pytorch-job ``` -## 6.4 Use Kubernetes Deployment (without Integrared Image) +### 6.4 Use Kubernetes Deployment (without Integrated Image) BigDL also supports uses to skip preparing envionment through providing a container image (`intelanalytics/bigdl-k8s:orca-2.1.0`) which has integrated all BigDL required environments. __Notes:__ @@ -917,7 +868,7 @@ Before submitting the example application, you should: cp /path/to/model.py /bigdl/nfsdata ``` -### 6.4.1 K8s Client +#### 6.4.1 K8s Client BigDL has provided an example YAML file (see __[integrated_image_client.yaml](../../../../../../python/orca/tutorial/pytorch/docker/integrate_image_client.yaml)__, which describes a deployment that runs the `intelanalytics/bigdl-k8s:orca-2.1.0` image) to run the tutorial FashionMNIST program on k8s-client mode: __Notes:__ @@ -1065,7 +1016,7 @@ After the task finish, you could delete the job as the command below. kubectl delete job orca-integrate-job ``` -### 6.4.2 K8s Cluster +#### 6.4.2 K8s Cluster BigDL has provided an example YAML file (see __[integrate_image_cluster.yaml](../../../../../../python/orca/tutorial/pytorch/docker/integrate_image_cluster.yaml)__, which describes a deployment that runs the `intelanalytics/bigdl-k8s:orca-2.1.0` image) to run the tutorial FashionMNIST program on k8s-cluster mode: __Notes:__ diff --git a/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md b/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md index 986b2256..b072a387 100644 --- a/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md +++ b/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md @@ -12,7 +12,8 @@ A BigDL Orca program usually starts with the initialization of OrcaContext. For ```python from bigdl.orca import init_orca_context -sc = init_orca_context(cluster_mode, cores, memory, num_nodes, driver_cores, driver_memory, extra_python_lib, conf) +sc = init_orca_context(cluster_mode, cores, memory, num_nodes, + driver_cores, driver_memory, extra_python_lib, conf) ``` In `init_orca_context`, you may specify necessary runtime configurations for running the example on YARN, including: @@ -22,7 +23,7 @@ In `init_orca_context`, you may specify necessary runtime configurations for run * `num_nodes`: an integer that specifies the number of executors (default to be `1`). * `driver_cores`: an integer that specifies the number of cores for the driver node (default to be `4`). * `driver_memory`: a string that specifies the memory for the driver node (default to be `"1g"`). -* `extra_python_lib`: a string that specifies the path to extra Python packages (default to be `None`). `.py`, `.zip` or `.egg` files are supported. +* `extra_python_lib`: a string that specifies the path to extra Python packages, separated by comma (default to be `None`). `.py`, `.zip` or `.egg` files are supported. * `conf`: a dictionary to append extra conf for Spark (default to be `None`). __Note__: @@ -48,19 +49,19 @@ For more details, please see [Launching Spark on YARN](https://spark.apache.org/ __Note__: * When you run programs on YARN, you are highly recommended to load/write data from/to a distributed storage (e.g. [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) or [S3](https://aws.amazon.com/s3/)) instead of the local file system. -The Fashion-MNIST example in this tutorial uses a utility function `get_remote_file_to_local` provided by BigDL to download datasets and create the PyTorch DataLoader on each executor. +The Fashion-MNIST example in this tutorial uses a utility function `get_remote_dir_to_local` provided by BigDL to download datasets and create the PyTorch DataLoader on each executor. ```python import torch import torchvision import torchvision.transforms as transforms -from bigdl.orca.data.file import get_remote_file_to_local +from bigdl.orca.data.file import get_remote_dir_to_local def train_data_creator(config, batch_size): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))]) - get_remote_file_to_local(remote_path="hdfs://path/to/dataset", local_path="/tmp/dataset") + get_remote_dir_to_local(remote_path="hdfs://path/to/dataset", local_path="/tmp/dataset") trainset = torchvision.datasets.FashionMNIST(root="/tmp/dataset", train=True, download=False, transform=transform) @@ -88,7 +89,10 @@ export HADOOP_CONF_DIR=/path/to/hadoop/conf - See [here](../Overview/install.md#install-bigdl-orca) to install BigDL Orca in the created conda environment. -- You should install all the other Python libraries that you need in your program in the conda environment as well. +- You should install all the other Python libraries that you need in your program in the conda environment as well. `torch` and `torchvision` are needed to run the Fashion-MNIST example: +```bash +pip install torch torchvision +``` - For more details, please see [Python User Guide](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/python.html).