Update k8s tutorial (part1) (#6696)

* update 1.1

* update 3

* update 1.2 1.3
This commit is contained in:
Kai Huang 2022-11-21 19:01:30 +08:00 committed by GitHub
parent 78ee9b23f6
commit cdd1f8421e
2 changed files with 66 additions and 111 deletions

View file

@ -1,10 +1,12 @@
# Run on Kubernetes Clusters
This tutorial provides a step-by-step guide on how to run BigDL-Orca programs on Kubernetes (K8s) clusters, using a [PyTorch Fashin-MNIST program](https://github.com/intel-analytics/BigDL/tree/main/docs/docs/tutorials/tutorial_example/Fashion_MNIST/) as a working example.
This tutorial provides a step-by-step guide on how to run BigDL-Orca programs on Kubernetes (K8s) clusters, using a [PyTorch Fashin-MNIST program](https://github.com/intel-analytics/BigDL/tree/main/python/orca/tutorial/pytorch/FashionMNIST) as a working example.
The **Client Container** that appears in this tutorial refer to the docker container where you launch or submit your applications.
# 1. Key Concepts
## 1.1 Init_orca_context
---
## 1. Basic Concepts
### 1.1 init_orca_context
A BigDL Orca program usually starts with the initialization of OrcaContext. For every BigDL Orca program, you should call `init_orca_context` at the beginning of the program as below:
```python
@ -16,34 +18,34 @@ init_orca_context(cluster_mode, master, container_image,
```
In `init_orca_context`, you may specify necessary runtime configurations for running the example on K8s, including:
* `cluster_mode`: a String that specifies the underlying cluster; valid value includes `"local"`, `"yarn-client"`, `"yarn-cluster"`, __`"k8s-client"`__, __`"k8s-cluster"`__, `"bigdl-submit"`, `"spark-submit"`, etc.
* `master`: a URL format to specify the master address of K8s cluster.
* `container_image`: a String that specifies the name of docker container image for executors.
* `cores`: an Integer that specifies the number of cores for each executor (default to be `2`).
* `memory`: a String that specifies the memory for each executor (default to be `"2g"`).
* `num_nodes`: an Integer that specifies the number of executors (default to be `1`).
* `driver_cores`: an Integer that specifies the number of cores for the driver node (default to be `4`).
* `driver_memory`: a String that specifies the memory for the driver node (default to be `"1g"`).
* `extra_python_lib`: a String that specifies the path to extra Python packages, one of `.py`, `.zip` or `.egg` files (default to be `None`).
* `penv_archive`: a String that specifies the path to a packed Conda archive (default to be `None`).
* `jars`: a String that specifies the path to needed jars files (default to be `None`).
* `conf`: a Key-Value format to append extra conf for Spark (default to be `None`).
* `cluster_mode`: one of `"k8s-client"`, `"k8s-cluster"` or `"spark-submit"` when you run on K8s clusters.
* `master`: a URL format to specify the master address of the K8s cluster.
* `container_image`: a string that specifies the name of docker container image for executors.
* `cores`: an integer that specifies the number of cores for each executor (default to be `2`).
* `memory`: a string that specifies the memory for each executor (default to be `"2g"`).
* `num_nodes`: an integer that specifies the number of executors (default to be `1`).
* `driver_cores`: an integer that specifies the number of cores for the driver node (default to be `4`).
* `driver_memory`: a string that specifies the memory for the driver node (default to be `"1g"`).
* `extra_python_lib`: a string that specifies the path to extra Python packages, separated by comma (default to be `None`). `.py`, `.zip` or `.egg` files are supported.
* `penv_archive`: a string that specifies the path to a packed conda archive (default to be `None`).
* `jars`: a string that specifies the path to extra jars files, separated by comma (default to be `None`).
* `conf`: a dictionary to append extra conf for Spark (default to be `None`).
__Note__:
* All arguments __except__ `cluster_mode` will be ignored when using `spark-submit` and Kubernetes deployment to submit and run Orca programs.
* All arguments __except__ `cluster_mode` will be ignored when using [`spark-submit`](#use-spark-submit) and [Kubernetes deployment](#use-kubernetes-deployment-with-conda-archive) to submit and run Orca programs, in which case you are supposed to specify these configurations via the submit command or the YAML file.
After the Orca programs finish, you should call `stop_orca_context` at the end of the program to release resources and shutdown the underlying distributed runtime engine (such as Spark or Ray).
After Orca programs finish, you should always call `stop_orca_context` at the end of the program to release resources and shutdown the underlying distributed runtime engine (such as Spark or Ray).
```python
from bigdl.orca import stop_orca_context
stop_orca_context()
```
For more details, please see [OrcaContext](https://bigdl.readthedocs.io/en/latest/doc/Orca/Overview/orca-context.html).
For more details, please see [OrcaContext](../Overview/orca-context.md).
## 1.2 K8s Client&Cluster
The difference between k8s-client and k8s-cluster is where you run your Spark driver.
### 1.2 K8s Client&Cluster
The difference between k8s-client mode and k8s-cluster mode is where you run your Spark driver.
For k8s-client, the Spark driver runs in the client process (outside the K8s cluster), while for k8s-cluster the Spark driver runs inside the K8s cluster.
@ -51,22 +53,19 @@ Please see more details in [K8s-Cluster](https://spark.apache.org/docs/latest/ru
## 1.3 Load Data from Network File Systems (NFS)
When you are running programs on K8s, please load data from volumes and we use NFS in this tutorial as an example.
### 1.3 Load Data from Volumes
When you are running programs on K8s, please load data from [Volumes](https://kubernetes.io/docs/concepts/storage/volumes/) accessible to all K8s pods. We use Network File Systems (NFS) in this tutorial as an example.
After mounting the volume (NFS) into BigDL container (see __[Section 2.2](#22-create-a-k8s-client-container)__), the Fashion-MNIST example could load data from NFS.
After mounting the Volume (NFS) into the BigDL container (see __[Section 2.2](#create-a-k8s-client-container)__), the Fashion-MNIST example could load data from NFS as local storage.
```python
import torch
import torchvision
import torchvision.transforms as transforms
from bigdl.orca.data.file import get_remote_file_to_local
def train_data_creator(config, batch_size):
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))])
get_remote_file_to_local(remote_path="/path/to/nfsdata", local_path="/tmp/dataset")
trainset = torchvision.datasets.FashionMNIST(root="/bigdl/nfsdata/dataset", train=True,
download=False, transform=transform)
@ -77,9 +76,9 @@ def train_data_creator(config, batch_size):
```
# 2. Create & Lunch BigDL K8s Container
## 2.1 Pull Docker Image
---
## 2. Create & Launch BigDL K8s Container
### 2.1 Pull Docker Image
Please pull the BigDL 2.1.0 `bigdl-k8s` image from [Docker Hub](https://hub.docker.com/r/intelanalytics/bigdl-k8s/tags) as follows:
```bash
sudo docker pull intelanalytics/bigdl-k8s:2.1.0
@ -93,7 +92,7 @@ __Note:__
* The 2.1.0 and latest BigDL image is built on top of Spark 3.1.2.
## 2.2 Create a K8s Client Container
### 2.2 Create a K8s Client Container
Please launch the __Client Container__ following the script below:
```bash
sudo docker run -itd --net=host \
@ -162,80 +161,31 @@ In the script:
* `RUNTIME_DRIVER_MEMORY`: a String that specifies the memory for the driver node;
## 2.3 Launch the K8s Client Container
Once the container is created, docker image would return a `containerID`, please launch the container following the command below:
### 2.3 Enter the K8s Client Container
Once the container is created, a `containerID` would be returned and with which you can enter the container following the command below:
```bash
sudo docker exec -it <containerID> bash
```
---
## 3. Prepare Environment
In the launched BigDL K8s **Client Container**, please setup the environment following the steps below:
# 3. Prepare Environment
In the launched BigDL K8s Container, please setup environment following steps below:
## 3.1 Install Python Libraries
### 3.1.1 Install Conda
Please use conda to prepare the Python environment on the __Client Container__ (where you submit applications), you could download and install Conda following [Conda User Guide](https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html) or executing the command as below.
- See [here](../Overview/install.md#install-anaconda) to install conda and prepare the Python environment.
```bash
# Download Anaconda installation script
wget -P /tmp https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
- See [here](../Overview/install.md#install-bigdl-orca) to install BigDL Orca in the created conda environment.
# Execute the script to install conda
bash /tmp/Anaconda3-2020.02-Linux-x86_64.sh
# Please type this command in your terminal to activate Conda environment
source ~/.bashrc
```
### 3.1.2 Use Conda to Install BigDL and Other Python Libraries
Create a Conda environment, install BigDL and all needed Python libraries in the activate Conda:
```bash
# "env" is conda environment name, you can use any name you like.
# Please change Python version to 3.8 if you need a Python 3.8 environment.
conda create -n env python=3.7
conda activate env
```
Please install the 2.1.0 release version of BigDL (built on top of Spark 3.1.2) as follows:
```bash
pip install bigdl-spark3
```
When you are running in the latest BigDL image, please install the nightly build of BigDL as follows:
```bash
pip install --pre --upgrade bigdl-spark3
```
Please install torch and torchvision to run the Fashion-MNIST example:
- You should install all the other Python libraries that you need in your program in the conda environment as well. `torch` and `torchvision` are needed to run the Fashion-MNIST example:
```bash
pip install torch torchvision
```
__Notes:__
* Using Conda to install BigDL will automatically install libraries including `pyspark==3.1.2`, and etc.
* You can install BigDL Orca built on top of Spark 3.1.2 as follows:
```bash
# Install the latest release version
pip install bigdl-orca-spark3
# Install the latest nightly build version
pip install --pre --upgrade bigdl-spark3-orca
# You need to install torch and torchvision manually
pip install torch torchvision
```
Installing bigdl-orca-spark3 will automatically install `pyspark==3.1.2`.
* You also need to install any additional python libraries that your application depends on in the Conda environment.
* It's only need for you to install needed Python libraries using Conda, since the BigDL K8s container has already setup `JAVA_HOME`, `BIGDL_HOME`, `SPARK_HOME`, `SPARK_VERSION`, etc.
Please see more details in [Python User Guide](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/python.html).
- For more details, please see [Python User Guide](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/python.html).
# 4. Prepare Dataset
---
## 4. Prepare Dataset
To run the example provided by this tutorial on K8s, you should upload the dataset to to a K8s volumn (e.g. NFS).
Please download the Fashion-MNIST dataset manually on your __Develop Node__ (where you launch the container image).
@ -254,7 +204,8 @@ gzip -dk /bigdl/nfsdata/dataset/FashionMNIST/raw/*
__Note:__ PyTorch requires tge directory of dataset where `FashionMNIST/raw/train-images-idx3-ubyte` and `FashionMNIST/raw/t10k-images-idx3-ubyte` exist.
# 5. Prepare Custom Modules
---
## 5. Prepare Custom Modules
Spark allows to upload Python files(`.py`), and zipped Python packages(`.zip`) to the executors by setting `--py-files` option in Spark scripts or `extra_python_lib` in `init_orca_context`.
The FasionMNIST example needs to import modules from `model.py`.
@ -308,16 +259,16 @@ __Notes:__
```
# 6. Run Jobs on K8s
---
## 6. Run Jobs on K8s
In the following part, we will show you how to submit and run the Orca example on K8s:
* Use `python` command
* Use `spark-submit` script
* Use Kubernetes Deployment (with Conda Archive)
* Use Kubernetes Deployment (with Integrated Image)
## 6.1 Use `python` command
### 6.1.1 K8s-Client
### 6.1 Use `python` command
#### 6.1.1 K8s-Client
Before running the example on `k8s-client` mode, you should:
* On the __Client Container__:
1. Please call `init_orca_context` at very begining part of each Orca program.
@ -350,7 +301,7 @@ In the script:
* `remote_dir`: directory on NFS for loading the dataset.
### 6.1.2 K8s-Cluster
#### 6.1.2 K8s-Cluster
Before running the example on `k8s-cluster` mode, you should:
* On the __Client Container__:
1. Please call `init_orca_context` at very begining part of each Orca program.
@ -421,8 +372,8 @@ Please retreive training stats on the __Develop Node__ following the command bel
```
## 6.2 Use `spark-submit` Script
### 6.2.1 K8s Client
### 6.2 Use `spark-submit`
#### 6.2.1 K8s Client
Before submitting the example on `k8s-client` mode, you should:
* On the __Client Container__:
1. Please call `init_orca_context` at very begining part of each Orca program.
@ -484,7 +435,7 @@ In the script:
* `remote_dir`: directory on NFS for loading the dataset.
### 6.2.2 K8s Cluster
#### 6.2.2 K8s Cluster
Before submitting the example on `k8s-cluster` mode, you should:
* On the __Client Container__:
1. Please call `init_orca_context` at very begining part of each Orca program.
@ -577,7 +528,7 @@ Please retrieve training stats on the __Develop Node__ following the commands be
```
## 6.3 Use Kubernetes Deployment (with Conda Archive)
### 6.3 Use Kubernetes Deployment (with Conda Archive)
BigDL supports users (which want to execute programs directly on __Develop Node__) to run an application by creating a Kubernetes Deployment object.
Before submitting the Orca application, you should:
@ -598,7 +549,7 @@ Before submitting the Orca application, you should:
cp /path/to/model.py /bigdl/nfsdata
```
### 6.3.1 K8s Client
#### 6.3.1 K8s Client
BigDL has provided an example YAML file (see __[orca-tutorial-client.yaml](../../../../../../python/orca/tutorial/pytorch/docker/orca-tutorial-client.yaml)__, which describes a Deployment that runs the `intelanalytics/bigdl-k8s:2.1.0` image) to run the tutorial FashionMNIST program on k8s-client mode:
__Notes:__
@ -750,7 +701,7 @@ After the task finish, you could delete the job as the command below.
kubectl delete job orca-pytorch-job
```
### 6.3.2 K8s Cluster
#### 6.3.2 K8s Cluster
BigDL has provided an example YAML file (see __[orca-tutorial-cluster.yaml](../../../../../../python/orca/tutorial/pytorch/docker/orca-tutorial-cluster.yaml)__, which describes a Deployment that runs the `intelanalytics/bigdl-k8s:2.1.0` image) to run the tutorial FashionMNIST program on k8s-cluster mode:
__Notes:__
@ -894,7 +845,7 @@ kubectl delete job orca-pytorch-job
```
## 6.4 Use Kubernetes Deployment (without Integrared Image)
### 6.4 Use Kubernetes Deployment (without Integrated Image)
BigDL also supports uses to skip preparing envionment through providing a container image (`intelanalytics/bigdl-k8s:orca-2.1.0`) which has integrated all BigDL required environments.
__Notes:__
@ -917,7 +868,7 @@ Before submitting the example application, you should:
cp /path/to/model.py /bigdl/nfsdata
```
### 6.4.1 K8s Client
#### 6.4.1 K8s Client
BigDL has provided an example YAML file (see __[integrated_image_client.yaml](../../../../../../python/orca/tutorial/pytorch/docker/integrate_image_client.yaml)__, which describes a deployment that runs the `intelanalytics/bigdl-k8s:orca-2.1.0` image) to run the tutorial FashionMNIST program on k8s-client mode:
__Notes:__
@ -1065,7 +1016,7 @@ After the task finish, you could delete the job as the command below.
kubectl delete job orca-integrate-job
```
### 6.4.2 K8s Cluster
#### 6.4.2 K8s Cluster
BigDL has provided an example YAML file (see __[integrate_image_cluster.yaml](../../../../../../python/orca/tutorial/pytorch/docker/integrate_image_cluster.yaml)__, which describes a deployment that runs the `intelanalytics/bigdl-k8s:orca-2.1.0` image) to run the tutorial FashionMNIST program on k8s-cluster mode:
__Notes:__

View file

@ -12,7 +12,8 @@ A BigDL Orca program usually starts with the initialization of OrcaContext. For
```python
from bigdl.orca import init_orca_context
sc = init_orca_context(cluster_mode, cores, memory, num_nodes, driver_cores, driver_memory, extra_python_lib, conf)
sc = init_orca_context(cluster_mode, cores, memory, num_nodes,
driver_cores, driver_memory, extra_python_lib, conf)
```
In `init_orca_context`, you may specify necessary runtime configurations for running the example on YARN, including:
@ -22,7 +23,7 @@ In `init_orca_context`, you may specify necessary runtime configurations for run
* `num_nodes`: an integer that specifies the number of executors (default to be `1`).
* `driver_cores`: an integer that specifies the number of cores for the driver node (default to be `4`).
* `driver_memory`: a string that specifies the memory for the driver node (default to be `"1g"`).
* `extra_python_lib`: a string that specifies the path to extra Python packages (default to be `None`). `.py`, `.zip` or `.egg` files are supported.
* `extra_python_lib`: a string that specifies the path to extra Python packages, separated by comma (default to be `None`). `.py`, `.zip` or `.egg` files are supported.
* `conf`: a dictionary to append extra conf for Spark (default to be `None`).
__Note__:
@ -48,19 +49,19 @@ For more details, please see [Launching Spark on YARN](https://spark.apache.org/
__Note__:
* When you run programs on YARN, you are highly recommended to load/write data from/to a distributed storage (e.g. [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) or [S3](https://aws.amazon.com/s3/)) instead of the local file system.
The Fashion-MNIST example in this tutorial uses a utility function `get_remote_file_to_local` provided by BigDL to download datasets and create the PyTorch DataLoader on each executor.
The Fashion-MNIST example in this tutorial uses a utility function `get_remote_dir_to_local` provided by BigDL to download datasets and create the PyTorch DataLoader on each executor.
```python
import torch
import torchvision
import torchvision.transforms as transforms
from bigdl.orca.data.file import get_remote_file_to_local
from bigdl.orca.data.file import get_remote_dir_to_local
def train_data_creator(config, batch_size):
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))])
get_remote_file_to_local(remote_path="hdfs://path/to/dataset", local_path="/tmp/dataset")
get_remote_dir_to_local(remote_path="hdfs://path/to/dataset", local_path="/tmp/dataset")
trainset = torchvision.datasets.FashionMNIST(root="/tmp/dataset", train=True,
download=False, transform=transform)
@ -88,7 +89,10 @@ export HADOOP_CONF_DIR=/path/to/hadoop/conf
- See [here](../Overview/install.md#install-bigdl-orca) to install BigDL Orca in the created conda environment.
- You should install all the other Python libraries that you need in your program in the conda environment as well.
- You should install all the other Python libraries that you need in your program in the conda environment as well. `torch` and `torchvision` are needed to run the Fashion-MNIST example:
```bash
pip install torch torchvision
```
- For more details, please see [Python User Guide](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/python.html).