Update k8s tutorial (part2) (#6700)

2022-11-21 19:57:22 +08:00 · 2022-11-21 19:57:22 +08:00 · bb3889958c
commit bb3889958c
parent cdd1f8421e
2 changed files with 47 additions and 54 deletions
--- a/docs/readthedocs/source/doc/Orca/Tutorial/k8s.md
+++ b/docs/readthedocs/source/doc/Orca/Tutorial/k8s.md
@ -32,7 +32,7 @@ In `init_orca_context`, you may specify necessary runtime configurations for run
 * `conf`: a dictionary to append extra conf for Spark (default to be `None`).

 __Note__: 
-* All arguments __except__ `cluster_mode` will be ignored when using [`spark-submit`](#use-spark-submit) and [Kubernetes deployment](#use-kubernetes-deployment-with-conda-archive) to submit and run Orca programs, in which case you are supposed to specify these configurations via the submit command or the YAML file.
+* All arguments __except__ `cluster_mode` will be ignored when using [`spark-submit`](#use-spark-submit) and [`Kubernetes deployment`](#use-kubernetes-deployment-with-conda-archive) to submit and run Orca programs, in which case you are supposed to specify these configurations via the submit command or the YAML file.

 After Orca programs finish, you should always call `stop_orca_context` at the end of the program to release resources and shutdown the underlying distributed runtime engine (such as Spark or Ray).
 ```python
@ -56,7 +56,7 @@ Please see more details in [K8s-Cluster](https://spark.apache.org/docs/latest/ru
 ### 1.3 Load Data from Volumes
 When you are running programs on K8s, please load data from [Volumes](https://kubernetes.io/docs/concepts/storage/volumes/) accessible to all K8s pods. We use Network File Systems (NFS) in this tutorial as an example.

-After mounting the Volume (NFS) into the BigDL container (see __[Section 2.2](#create-a-k8s-client-container)__), the Fashion-MNIST example could load data from NFS as local storage.
+After mounting the Volume (NFS) into the BigDL container (see __[Section 2.2](#create-a-k8s-client-container)__ for more details), the Fashion-MNIST example could load data from NFS as local storage.

 ```python
 import torch
@ -77,41 +77,38 @@ def train_data_creator(config, batch_size):


 ---
-## 2. Create & Launch BigDL K8s Container 
+## 2. Create BigDL K8s Container 
 ### 2.1 Pull Docker Image
-Please pull the BigDL 2.1.0 `bigdl-k8s` image from [Docker Hub](https://hub.docker.com/r/intelanalytics/bigdl-k8s/tags) as follows:
+Please pull the BigDL [`bigdl-k8s`]((https://hub.docker.com/r/intelanalytics/bigdl-k8s/tags)) image (built on top of Spark 3.1.3) from Docker Hub as follows:
 ```bash
+# For the latest nightly build version
+sudo docker pull intelanalytics/bigdl-k8s:latest
+
+# For the release version, e.g. 2.1.0
 sudo docker pull intelanalytics/bigdl-k8s:2.1.0
 ```

-__Note:__
-* If you need the nightly built BigDL, please pull the latest image as below:
-    ```bash
-    sudo docker pull intelanalytics/bigdl-k8s:latest
-    ```
-* The 2.1.0 and latest BigDL image is built on top of Spark 3.1.2.
-

 ### 2.2 Create a K8s Client Container
-Please launch the __Client Container__ following the script below:
+Please create the __Client Container__ using the script below:
 ```bash
 sudo docker run -itd --net=host \
    -v /etc/kubernetes:/etc/kubernetes \
    -v /root/.kube:/root/.kube \
-    intelanalytics/bigdl-k8s:2.1.0 bash
+    intelanalytics/bigdl-k8s:latest bash
 ```

 In the script:
-* `--net=host`: use the host network stack for the Docker container;
-* `-v /etc/kubernetes:/etc/kubernetes`: specify the path of kubernetes configurations;
-* `-v /root/.kube:/root/.kube`: specify the path of kubernetes installation;
+* **Please switch the tag according to the BigDL image you pull.**
+* `--net=host`: use the host network stack for the Docker container.
+* `-v /etc/kubernetes:/etc/kubernetes`: specify the path of Kubernetes configurations to mount into the Docker container.
+* `-v /root/.kube:/root/.kube`: specify the path of Kubernetes installation to mount into the Docker container.

 __Notes:__
-* Please switch the tag from `2.1.0` to `latest` if you pull the latest BigDL image.
-* The __Client Container__ contains all the required environment except K8s configs.
-* You needn't to create an __Executor Container__ manually, which is scheduled by K8s at runtime.
+* The __Client Container__ contains all the required environment except K8s configurations.
+* You don't need to create Spark executor containers manually, which is scheduled by K8s at runtime.

-We recommend you to specify more arguments when creating a container:
+We recommend you to specify more arguments when creating the __Client Container__:
 ```bash
 sudo docker run -itd --net=host \
    -v /etc/kubernetes:/etc/kubernetes \
@ -123,7 +120,7 @@ sudo docker run -itd --net=host \
    -e https_proxy=https://your-proxy-host:your-proxy-port \
    -e RUNTIME_SPARK_MASTER=k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
    -e RUNTIME_K8S_SERVICE_ACCOUNT=spark \
-    -e RUNTIME_K8S_SPARK_IMAGE=intelanalytics/bigdl-k8s:2.1.0 \
+    -e RUNTIME_K8S_SPARK_IMAGE=intelanalytics/bigdl-k8s:latest \
    -e RUNTIME_PERSISTENT_VOLUME_CLAIM=nfsvolumeclaim \
    -e RUNTIME_DRIVER_HOST=x.x.x.x \
    -e RUNTIME_DRIVER_PORT=54321 \
@ -133,35 +130,30 @@ sudo docker run -itd --net=host \
    -e RUNTIME_TOTAL_EXECUTOR_CORES=4 \
    -e RUNTIME_DRIVER_CORES=4 \
    -e RUNTIME_DRIVER_MEMORY=10g \
-    intelanalytics/bigdl-k8s:2.1.0 bash 
+    intelanalytics/bigdl-k8s:latest bash
 ```

-__Notes:__ 
-* Please make sure you are mounting the correct volumn path (e.g. NFS) in a container.
-* Please switch the `2.1.0` tag to `latest` if you pull the latest BigDL image.
-
 In the script:
-* `--net=host`: use the host network stack for the Docker container;
-* `/etc/kubernetes:/etc/kubernetes`: specify the path of kubernetes configurations;
-* `/root/.kube:/root/.kube`: specify the path of kubernetes installation;
-* `/path/to/nfsdata:/bigdl/data`: mount NFS path on host in a container as the sepcified path in value; 
-* `NOTEBOOK_PORT`: an Integer that specifies port number for Notebook (only required by notebook);
-* `NOTEBOOK_TOKEN`: a String that specifies the token for Notebook (only required by notebook);
-* `RUNTIME_SPARK_MASTER`: a URL format that specifies the Spark master;
-* `RUNTIME_K8S_SERVICE_ACCOUNT`: a String that specifies the service account for driver pod;
-* `RUNTIME_K8S_SPARK_IMAGE`: the lanuched k8s image;
-* `RUNTIME_PERSISTENT_VOLUME_CLAIM`: a String that specifies the Kubernetes volumeName;
-* `RUNTIME_DRIVER_HOST`: a URL format that specifies the driver localhost (only required by client mode);
-* `RUNTIME_DRIVER_PORT`: a String that specifies the driver port (only required by client mode);
-* `RUNTIME_EXECUTOR_INSTANCES`: an Integer that specifies the number of executors;
-* `RUNTIME_EXECUTOR_CORES`: an Integer that specifies the number of cores for each executor;
-* `RUNTIME_EXECUTOR_MEMORY`: a String that specifies the memory for each executor;
-* `RUNTIME_TOTAL_EXECUTOR_CORES`: an Integer that specifies the number of cores for all executors;
-* `RUNTIME_DRIVER_CORES`: an Integer that specifies the number of cores for the driver node;
-* `RUNTIME_DRIVER_MEMORY`: a String that specifies the memory for the driver node;
+* **Please switch the tag according to the BigDL image you pull.**
+* **Please make sure you are mounting the correct Volume path (e.g. NFS) into the container.**
+* `/path/to/nfsdata:/bigdl/nfsdata`: mount NFS path on the host into the container as the specified path (e.g. "/bigdl/nfsdata").
+* `NOTEBOOK_PORT`: an integer that specifies the port number for the Notebook (only required if you use notebook).
+* `NOTEBOOK_TOKEN`: a string that specifies the token for Notebook (only required if you use notebook).
+* `RUNTIME_SPARK_MASTER`: a URL format that specifies the Spark master.
+* `RUNTIME_K8S_SERVICE_ACCOUNT`: a string that specifies the service account for driver pod.
+* `RUNTIME_K8S_SPARK_IMAGE`: the launched k8s image for Spark.
+* `RUNTIME_PERSISTENT_VOLUME_CLAIM`: a string that specifies the Kubernetes volumeName (e.g. "nfsvolumeclaim").
+* `RUNTIME_DRIVER_HOST`: a URL format that specifies the driver localhost (only required by k8s-client mode).
+* `RUNTIME_DRIVER_PORT`: a string that specifies the driver port (only required by k8s-client mode).
+* `RUNTIME_EXECUTOR_INSTANCES`: an integer that specifies the number of executors.
+* `RUNTIME_EXECUTOR_CORES`: an integer that specifies the number of cores for each executor.
+* `RUNTIME_EXECUTOR_MEMORY`: a string that specifies the memory for each executor.
+* `RUNTIME_TOTAL_EXECUTOR_CORES`: an integer that specifies the number of cores for all executors.
+* `RUNTIME_DRIVER_CORES`: an integer that specifies the number of cores for the driver node.
+* `RUNTIME_DRIVER_MEMORY`: a string that specifies the memory for the driver node.


-### 2.3 Enter the K8s Client Container
+### 2.3 Launch the K8s Client Container
 Once the container is created, a `containerID` would be returned and with which you can enter the container following the command below:
 ```bash
 sudo docker exec -it <containerID> bash
@ -174,7 +166,7 @@ In the launched BigDL K8s **Client Container**, please setup the environment fol

 - See [here](../Overview/install.md#install-anaconda) to install conda and prepare the Python environment.

- See [here](../Overview/install.md#install-bigdl-orca) to install BigDL Orca in the created conda environment.
+- See [here](../Overview/install.md#to-install-orca-for-spark3) to install BigDL Orca in the created conda environment.

 - You should install all the other Python libraries that you need in your program in the conda environment as well. `torch` and `torchvision` are needed to run the Fashion-MNIST example:
 ```bash
@ -186,9 +178,9 @@ pip install torch torchvision

 ---
 ## 4. Prepare Dataset
-To run the example provided by this tutorial on K8s, you should upload the dataset to to a K8s volumn (e.g. NFS).
+To run the example provided by this tutorial on K8s, you should upload the dataset to a K8s Volume (e.g. NFS).

-Please download the Fashion-MNIST dataset manually on your __Develop Node__ (where you launch the container image). 
+Please download the Fashion-MNIST dataset manually on your __Develop Node__ (where you launch the container image). Note that PyTorch `FashionMNIST` Dataset requires unzipped files located in `FashionMNIST/raw/` under the root folder.

 ```bash
 # PyTorch official dataset download link
@ -201,8 +193,6 @@ mv /path/to/fashion-mnist/data/fashion /bigdl/nfsdata/dataset/FashionMNIST/raw
 gzip -dk /bigdl/nfsdata/dataset/FashionMNIST/raw/*
 ```

-__Note:__ PyTorch requires tge directory of dataset where `FashionMNIST/raw/train-images-idx3-ubyte` and `FashionMNIST/raw/t10k-images-idx3-ubyte` exist.
-

 ---
 ## 5. Prepare Custom Modules
--- a/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md
+++ b/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md
@ -107,14 +107,17 @@ unset ...

 ---
 ## 3. Prepare Dataset 
-To run the example on YARN, you should upload the Fashion-MNIST dataset to a distributed storage (such as HDFS or S3).   
+To run the example provided by this tutorial on YARN, you should upload the Fashion-MNIST dataset to a distributed storage (such as HDFS or S3).   

-First, download the Fashion-MNIST dataset manually on your __Client Node__:
+First, download the Fashion-MNIST dataset manually on your __Client Node__. Note that PyTorch `FashionMNIST` Dataset requires unzipped files located in `FashionMNIST/raw/` under the root folder.
 ```bash
 # PyTorch official dataset download link
 git clone https://github.com/zalandoresearch/fashion-mnist.git

-mv /path/to/fashion-mnist/data/fashion /path/to/local/data/FashionMNIST/raw 
+mv /path/to/fashion-mnist/data/fashion /path/to/local/data/FashionMNIST/raw
+
+# Extract FashionMNIST archives
+gzip -dk /bigdl/nfsdata/dataset/FashionMNIST/raw/*
 ```
 Then upload it to a distributed storage. Sample command to upload data to HDFS is as follows:
 ```bash
@ -288,7 +291,7 @@ Set the cluster_mode to "spark-submit" in `init_orca_context`.
 sc = init_orca_context(cluster_mode="spark-submit")
 ```

-Before submitting the application on the Client Node, you need to:
+Before submitting the application on the __Client Node__, you need to:

 1. Prepare the conda environment on a __Development Node__ where conda is available and pack the conda environment to an archive:
 ```bash