Update k8s tutorial (part1) (#6696)

* update 1.1 * update 3 * update 1.2 1.3
2022-11-21 19:01:30 +08:00 · 2022-11-21 19:01:30 +08:00 · cdd1f8421e
commit cdd1f8421e
parent 78ee9b23f6
2 changed files with 66 additions and 111 deletions
--- a/docs/readthedocs/source/doc/Orca/Tutorial/k8s.md
+++ b/docs/readthedocs/source/doc/Orca/Tutorial/k8s.md
@ -1,10 +1,12 @@
 # Run on Kubernetes Clusters

-This tutorial provides a step-by-step guide on how to run BigDL-Orca programs on Kubernetes (K8s) clusters, using a [PyTorch Fashin-MNIST program](https://github.com/intel-analytics/BigDL/tree/main/docs/docs/tutorials/tutorial_example/Fashion_MNIST/) as a working example.
+This tutorial provides a step-by-step guide on how to run BigDL-Orca programs on Kubernetes (K8s) clusters, using a [PyTorch Fashin-MNIST program](https://github.com/intel-analytics/BigDL/tree/main/python/orca/tutorial/pytorch/FashionMNIST) as a working example.

+The **Client Container** that appears in this tutorial refer to the docker container where you launch or submit your applications.

-# 1. Key Concepts
-## 1.1 Init_orca_context
+---
+## 1. Basic Concepts
+### 1.1 init_orca_context
 A BigDL Orca program usually starts with the initialization of OrcaContext. For every BigDL Orca program, you should call `init_orca_context` at the beginning of the program as below:

 ```python
@ -16,34 +18,34 @@ init_orca_context(cluster_mode, master, container_image,
 ```

 In `init_orca_context`, you may specify necessary runtime configurations for running the example on K8s, including:
-* `cluster_mode`: a String that specifies the underlying cluster; valid value includes `"local"`, `"yarn-client"`, `"yarn-cluster"`, __`"k8s-client"`__, __`"k8s-cluster"`__, `"bigdl-submit"`, `"spark-submit"`, etc.
-* `master`: a URL format to specify the master address of K8s cluster.
-* `container_image`: a String that specifies the name of docker container image for executors.
-* `cores`: an Integer that specifies the number of cores for each executor (default to be `2`).
-* `memory`: a String that specifies the memory for each executor (default to be `"2g"`).
-* `num_nodes`: an Integer that specifies the number of executors (default to be `1`).
-* `driver_cores`: an Integer that specifies the number of cores for the driver node (default to be `4`).
-* `driver_memory`: a String that specifies the memory for the driver node (default to be `"1g"`).
-* `extra_python_lib`: a String that specifies the path to extra Python packages, one of `.py`, `.zip` or `.egg` files (default to be `None`).
-* `penv_archive`: a String that specifies the path to a packed Conda archive (default to be `None`).
-* `jars`: a String that specifies the path to needed jars files (default to be `None`).
-* `conf`: a Key-Value format to append extra conf for Spark (default to be `None`).
+* `cluster_mode`: one of `"k8s-client"`, `"k8s-cluster"` or `"spark-submit"` when you run on K8s clusters.
+* `master`: a URL format to specify the master address of the K8s cluster.
+* `container_image`: a string that specifies the name of docker container image for executors.
+* `cores`: an integer that specifies the number of cores for each executor (default to be `2`).
+* `memory`: a string that specifies the memory for each executor (default to be `"2g"`).
+* `num_nodes`: an integer that specifies the number of executors (default to be `1`).
+* `driver_cores`: an integer that specifies the number of cores for the driver node (default to be `4`).
+* `driver_memory`: a string that specifies the memory for the driver node (default to be `"1g"`).
+* `extra_python_lib`: a string that specifies the path to extra Python packages, separated by comma (default to be `None`). `.py`, `.zip` or `.egg` files are supported.
+* `penv_archive`: a string that specifies the path to a packed conda archive (default to be `None`).
+* `jars`: a string that specifies the path to extra jars files, separated by comma (default to be `None`).
+* `conf`: a dictionary to append extra conf for Spark (default to be `None`).

 __Note__: 
-* All arguments __except__ `cluster_mode` will be ignored when using `spark-submit` and Kubernetes deployment to submit and run Orca programs.
+* All arguments __except__ `cluster_mode` will be ignored when using [`spark-submit`](#use-spark-submit) and [Kubernetes deployment](#use-kubernetes-deployment-with-conda-archive) to submit and run Orca programs, in which case you are supposed to specify these configurations via the submit command or the YAML file.

-After the Orca programs finish, you should call `stop_orca_context` at the end of the program to release resources and shutdown the underlying distributed runtime engine (such as Spark or Ray).
+After Orca programs finish, you should always call `stop_orca_context` at the end of the program to release resources and shutdown the underlying distributed runtime engine (such as Spark or Ray).
 ```python
 from bigdl.orca import stop_orca_context

 stop_orca_context()
 ```

-For more details, please see [OrcaContext](https://bigdl.readthedocs.io/en/latest/doc/Orca/Overview/orca-context.html).
+For more details, please see [OrcaContext](../Overview/orca-context.md).


-## 1.2 K8s Client&Cluster
-The difference between k8s-client and k8s-cluster is where you run your Spark driver. 
+### 1.2 K8s Client&Cluster
+The difference between k8s-client mode and k8s-cluster mode is where you run your Spark driver. 

 For k8s-client, the Spark driver runs in the client process (outside the K8s cluster), while for k8s-cluster the Spark driver runs inside the K8s cluster.

@ -51,22 +53,19 @@ Please see more details in [K8s-Cluster](https://spark.apache.org/docs/latest/ru



-## 1.3 Load Data from Network File Systems (NFS)
-When you are running programs on K8s, please load data from volumes and we use NFS in this tutorial as an example.
+### 1.3 Load Data from Volumes
+When you are running programs on K8s, please load data from [Volumes](https://kubernetes.io/docs/concepts/storage/volumes/) accessible to all K8s pods. We use Network File Systems (NFS) in this tutorial as an example.

-After mounting the volume (NFS) into BigDL container (see __[Section 2.2](#22-create-a-k8s-client-container)__), the Fashion-MNIST example could load data from NFS.
+After mounting the Volume (NFS) into the BigDL container (see __[Section 2.2](#create-a-k8s-client-container)__), the Fashion-MNIST example could load data from NFS as local storage.

 ```python
 import torch
 import torchvision
 import torchvision.transforms as transforms
-from bigdl.orca.data.file import get_remote_file_to_local

 def train_data_creator(config, batch_size):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (0.5,))])
-    
-    get_remote_file_to_local(remote_path="/path/to/nfsdata", local_path="/tmp/dataset")

    trainset = torchvision.datasets.FashionMNIST(root="/bigdl/nfsdata/dataset", train=True, 
                                                 download=False, transform=transform)
@ -77,9 +76,9 @@ def train_data_creator(config, batch_size):
 ```


-
-# 2. Create & Lunch BigDL K8s Container 
-## 2.1 Pull Docker Image
+---
+## 2. Create & Launch BigDL K8s Container 
+### 2.1 Pull Docker Image
 Please pull the BigDL 2.1.0 `bigdl-k8s` image from [Docker Hub](https://hub.docker.com/r/intelanalytics/bigdl-k8s/tags) as follows:
 ```bash
 sudo docker pull intelanalytics/bigdl-k8s:2.1.0
@ -93,7 +92,7 @@ __Note:__
 * The 2.1.0 and latest BigDL image is built on top of Spark 3.1.2.


-## 2.2 Create a K8s Client Container
+### 2.2 Create a K8s Client Container
 Please launch the __Client Container__ following the script below:
 ```bash
 sudo docker run -itd --net=host \
@ -162,80 +161,31 @@ In the script:
 * `RUNTIME_DRIVER_MEMORY`: a String that specifies the memory for the driver node;


-## 2.3 Launch the K8s Client Container
-Once the container is created, docker image would return a `containerID`, please launch the container following the command below:
+### 2.3 Enter the K8s Client Container
+Once the container is created, a `containerID` would be returned and with which you can enter the container following the command below:
 ```bash
 sudo docker exec -it <containerID> bash
 ```


+---
+## 3. Prepare Environment
+In the launched BigDL K8s **Client Container**, please setup the environment following the steps below:

-# 3. Prepare Environment
-In the launched BigDL K8s Container, please setup environment following steps below:
-## 3.1 Install Python Libraries
-### 3.1.1 Install Conda
-Please use conda to prepare the Python environment on the __Client Container__ (where you submit applications), you could download and install Conda following [Conda User Guide](https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html) or executing the command as below.
+- See [here](../Overview/install.md#install-anaconda) to install conda and prepare the Python environment.

-```bash
-# Download Anaconda installation script 
-wget -P /tmp https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
+- See [here](../Overview/install.md#install-bigdl-orca) to install BigDL Orca in the created conda environment.

-# Execute the script to install conda
-bash /tmp/Anaconda3-2020.02-Linux-x86_64.sh
-
-# Please type this command in your terminal to activate Conda environment
-source ~/.bashrc
-```
-
-### 3.1.2 Use Conda to Install BigDL and Other Python Libraries
-Create a Conda environment, install BigDL and all needed Python libraries in the activate Conda:
-```bash
-# "env" is conda environment name, you can use any name you like.
-# Please change Python version to 3.8 if you need a Python 3.8 environment.
-conda create -n env python=3.7 
-conda activate env
-```
-
-Please install the 2.1.0 release version of BigDL (built on top of Spark 3.1.2) as follows:
-```bash
-pip install bigdl-spark3
-```
-
-When you are running in the latest BigDL image, please install the nightly build of BigDL as follows:
-```bash
-pip install --pre --upgrade bigdl-spark3
-```
-
-Please install torch and torchvision to run the Fashion-MNIST example:
+- You should install all the other Python libraries that you need in your program in the conda environment as well. `torch` and `torchvision` are needed to run the Fashion-MNIST example:
 ```bash
 pip install torch torchvision
 ```

-__Notes:__
-* Using Conda to install BigDL will automatically install libraries including `pyspark==3.1.2`, and etc.
-* You can install BigDL Orca built on top of Spark 3.1.2 as follows:
-    ```bash
-    # Install the latest release version
-    pip install bigdl-orca-spark3
-
-    # Install the latest nightly build version
-    pip install --pre --upgrade bigdl-spark3-orca
-
-    # You need to install torch and torchvision manually
-    pip install torch torchvision
-    ```
-
-    Installing bigdl-orca-spark3 will automatically install `pyspark==3.1.2`.
-
-* You also need to install any additional python libraries that your application depends on in the Conda environment.
-
-* It's only need for you to install needed Python libraries using Conda, since the BigDL K8s container has already setup `JAVA_HOME`, `BIGDL_HOME`, `SPARK_HOME`, `SPARK_VERSION`, etc.
-
-Please see more details in [Python User Guide](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/python.html).
+- For more details, please see [Python User Guide](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/python.html).


-
-# 4. Prepare Dataset
+---
+## 4. Prepare Dataset
 To run the example provided by this tutorial on K8s, you should upload the dataset to to a K8s volumn (e.g. NFS).

 Please download the Fashion-MNIST dataset manually on your __Develop Node__ (where you launch the container image). 
@ -254,7 +204,8 @@ gzip -dk /bigdl/nfsdata/dataset/FashionMNIST/raw/*
 __Note:__ PyTorch requires tge directory of dataset where `FashionMNIST/raw/train-images-idx3-ubyte` and `FashionMNIST/raw/t10k-images-idx3-ubyte` exist.


-# 5. Prepare Custom Modules
+---
+## 5. Prepare Custom Modules
 Spark allows to upload Python files(`.py`), and zipped Python packages(`.zip`) to the executors by setting `--py-files` option in Spark scripts or `extra_python_lib` in `init_orca_context`. 

 The FasionMNIST example needs to import modules from `model.py`.
@ -308,16 +259,16 @@ __Notes:__
        ```


-
-# 6. Run Jobs on K8s
+---
+## 6. Run Jobs on K8s
 In the following part, we will show you how to submit and run the Orca example on K8s:
 * Use `python` command
 * Use `spark-submit` script
 * Use Kubernetes Deployment (with Conda Archive)
 * Use Kubernetes Deployment (with Integrated Image)

-## 6.1 Use `python` command
-### 6.1.1 K8s-Client
+### 6.1 Use `python` command
+#### 6.1.1 K8s-Client
 Before running the example on `k8s-client` mode, you should:
 * On the __Client Container__:
    1. Please call `init_orca_context` at very begining part of each Orca program.
@ -350,7 +301,7 @@ In the script:
 * `remote_dir`: directory on NFS for loading the dataset.


-### 6.1.2 K8s-Cluster
+#### 6.1.2 K8s-Cluster
 Before running the example on `k8s-cluster` mode, you should:
 * On the __Client Container__:
    1. Please call `init_orca_context` at very begining part of each Orca program.
@ -421,8 +372,8 @@ Please retreive training stats on the __Develop Node__ following the command bel
    ```


-## 6.2 Use `spark-submit` Script
-### 6.2.1 K8s Client
+### 6.2 Use `spark-submit`
+#### 6.2.1 K8s Client
 Before submitting the example on `k8s-client` mode, you should:
 * On the __Client Container__:
    1. Please call `init_orca_context` at very begining part of each Orca program.
@ -484,7 +435,7 @@ In the script:
 * `remote_dir`: directory on NFS for loading the dataset.


-### 6.2.2 K8s Cluster
+#### 6.2.2 K8s Cluster
 Before submitting the example on `k8s-cluster` mode, you should:
 * On the __Client Container__:
    1. Please call `init_orca_context` at very begining part of each Orca program.
@ -577,7 +528,7 @@ Please retrieve training stats on the __Develop Node__ following the commands be
    ```


-## 6.3 Use Kubernetes Deployment (with Conda Archive)
+### 6.3 Use Kubernetes Deployment (with Conda Archive)
 BigDL supports users (which want to execute programs directly on __Develop Node__) to run an application by creating a Kubernetes Deployment object.

 Before submitting the Orca application, you should:
@ -598,7 +549,7 @@ Before submitting the Orca application, you should:
        cp /path/to/model.py /bigdl/nfsdata
        ```

-### 6.3.1 K8s Client
+#### 6.3.1 K8s Client
 BigDL has provided an example YAML file (see __[orca-tutorial-client.yaml](../../../../../../python/orca/tutorial/pytorch/docker/orca-tutorial-client.yaml)__, which describes a Deployment that runs the `intelanalytics/bigdl-k8s:2.1.0` image) to run the tutorial FashionMNIST program on k8s-client mode:

 __Notes:__ 
@ -750,7 +701,7 @@ After the task finish, you could delete the job as the command below.
 kubectl delete job orca-pytorch-job
 ```

-### 6.3.2 K8s Cluster
+#### 6.3.2 K8s Cluster
 BigDL has provided an example YAML file (see __[orca-tutorial-cluster.yaml](../../../../../../python/orca/tutorial/pytorch/docker/orca-tutorial-cluster.yaml)__, which describes a Deployment that runs the `intelanalytics/bigdl-k8s:2.1.0` image) to run the tutorial FashionMNIST program on k8s-cluster mode:

 __Notes:__ 
@ -894,7 +845,7 @@ kubectl delete job orca-pytorch-job
 ```


-## 6.4 Use Kubernetes Deployment (without Integrared Image)
+### 6.4 Use Kubernetes Deployment (without Integrated Image)
 BigDL also supports uses to skip preparing envionment through providing a container image (`intelanalytics/bigdl-k8s:orca-2.1.0`) which has integrated all BigDL required environments.

 __Notes:__
@ -917,7 +868,7 @@ Before submitting the example application, you should:
        cp /path/to/model.py /bigdl/nfsdata
        ```

-### 6.4.1 K8s Client
+#### 6.4.1 K8s Client
 BigDL has provided an example YAML file (see __[integrated_image_client.yaml](../../../../../../python/orca/tutorial/pytorch/docker/integrate_image_client.yaml)__, which describes a deployment that runs the `intelanalytics/bigdl-k8s:orca-2.1.0` image) to run the tutorial FashionMNIST program on k8s-client mode:

 __Notes:__
@ -1065,7 +1016,7 @@ After the task finish, you could delete the job as the command below.
 kubectl delete job orca-integrate-job
 ```

-### 6.4.2 K8s Cluster
+#### 6.4.2 K8s Cluster
 BigDL has provided an example YAML file (see __[integrate_image_cluster.yaml](../../../../../../python/orca/tutorial/pytorch/docker/integrate_image_cluster.yaml)__, which describes a deployment that runs the `intelanalytics/bigdl-k8s:orca-2.1.0` image) to run the tutorial FashionMNIST program on k8s-cluster mode:

 __Notes:__
--- a/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md
+++ b/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md
@ -12,7 +12,8 @@ A BigDL Orca program usually starts with the initialization of OrcaContext. For
 ```python
 from bigdl.orca import init_orca_context

-sc = init_orca_context(cluster_mode, cores, memory, num_nodes, driver_cores, driver_memory, extra_python_lib, conf)
+sc = init_orca_context(cluster_mode, cores, memory, num_nodes,
+                       driver_cores, driver_memory, extra_python_lib, conf)
 ```

 In `init_orca_context`, you may specify necessary runtime configurations for running the example on YARN, including:
@ -22,7 +23,7 @@ In `init_orca_context`, you may specify necessary runtime configurations for run
 * `num_nodes`: an integer that specifies the number of executors (default to be `1`).
 * `driver_cores`: an integer that specifies the number of cores for the driver node (default to be `4`).
 * `driver_memory`: a string that specifies the memory for the driver node (default to be `"1g"`).
-* `extra_python_lib`: a string that specifies the path to extra Python packages (default to be `None`). `.py`, `.zip` or `.egg` files are supported.
+* `extra_python_lib`: a string that specifies the path to extra Python packages, separated by comma (default to be `None`). `.py`, `.zip` or `.egg` files are supported.
 * `conf`: a dictionary to append extra conf for Spark (default to be `None`).

 __Note__: 
@ -48,19 +49,19 @@ For more details, please see [Launching Spark on YARN](https://spark.apache.org/
 __Note__:
 * When you run programs on YARN, you are highly recommended to load/write data from/to a distributed storage (e.g. [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) or [S3](https://aws.amazon.com/s3/)) instead of the local file system.

-The Fashion-MNIST example in this tutorial uses a utility function `get_remote_file_to_local` provided by BigDL to download datasets and create the PyTorch DataLoader on each executor.
+The Fashion-MNIST example in this tutorial uses a utility function `get_remote_dir_to_local` provided by BigDL to download datasets and create the PyTorch DataLoader on each executor.

 ```python
 import torch
 import torchvision
 import torchvision.transforms as transforms
-from bigdl.orca.data.file import get_remote_file_to_local
+from bigdl.orca.data.file import get_remote_dir_to_local

 def train_data_creator(config, batch_size):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (0.5,))])

-    get_remote_file_to_local(remote_path="hdfs://path/to/dataset", local_path="/tmp/dataset")
+    get_remote_dir_to_local(remote_path="hdfs://path/to/dataset", local_path="/tmp/dataset")

    trainset = torchvision.datasets.FashionMNIST(root="/tmp/dataset", train=True,
                                                 download=False, transform=transform)
@ -88,7 +89,10 @@ export HADOOP_CONF_DIR=/path/to/hadoop/conf

 - See [here](../Overview/install.md#install-bigdl-orca) to install BigDL Orca in the created conda environment.

- You should install all the other Python libraries that you need in your program in the conda environment as well.
+- You should install all the other Python libraries that you need in your program in the conda environment as well. `torch` and `torchvision` are needed to run the Fashion-MNIST example:
+```bash
+pip install torch torchvision
+```

 - For more details, please see [Python User Guide](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/python.html).