Update k8s tutorial (part2) (#6700)

This commit is contained in:
Kai Huang 2022-11-21 19:57:22 +08:00 committed by GitHub
parent cdd1f8421e
commit bb3889958c
2 changed files with 47 additions and 54 deletions

View file

@ -32,7 +32,7 @@ In `init_orca_context`, you may specify necessary runtime configurations for run
* `conf`: a dictionary to append extra conf for Spark (default to be `None`). * `conf`: a dictionary to append extra conf for Spark (default to be `None`).
__Note__: __Note__:
* All arguments __except__ `cluster_mode` will be ignored when using [`spark-submit`](#use-spark-submit) and [Kubernetes deployment](#use-kubernetes-deployment-with-conda-archive) to submit and run Orca programs, in which case you are supposed to specify these configurations via the submit command or the YAML file. * All arguments __except__ `cluster_mode` will be ignored when using [`spark-submit`](#use-spark-submit) and [`Kubernetes deployment`](#use-kubernetes-deployment-with-conda-archive) to submit and run Orca programs, in which case you are supposed to specify these configurations via the submit command or the YAML file.
After Orca programs finish, you should always call `stop_orca_context` at the end of the program to release resources and shutdown the underlying distributed runtime engine (such as Spark or Ray). After Orca programs finish, you should always call `stop_orca_context` at the end of the program to release resources and shutdown the underlying distributed runtime engine (such as Spark or Ray).
```python ```python
@ -56,7 +56,7 @@ Please see more details in [K8s-Cluster](https://spark.apache.org/docs/latest/ru
### 1.3 Load Data from Volumes ### 1.3 Load Data from Volumes
When you are running programs on K8s, please load data from [Volumes](https://kubernetes.io/docs/concepts/storage/volumes/) accessible to all K8s pods. We use Network File Systems (NFS) in this tutorial as an example. When you are running programs on K8s, please load data from [Volumes](https://kubernetes.io/docs/concepts/storage/volumes/) accessible to all K8s pods. We use Network File Systems (NFS) in this tutorial as an example.
After mounting the Volume (NFS) into the BigDL container (see __[Section 2.2](#create-a-k8s-client-container)__), the Fashion-MNIST example could load data from NFS as local storage. After mounting the Volume (NFS) into the BigDL container (see __[Section 2.2](#create-a-k8s-client-container)__ for more details), the Fashion-MNIST example could load data from NFS as local storage.
```python ```python
import torch import torch
@ -77,41 +77,38 @@ def train_data_creator(config, batch_size):
--- ---
## 2. Create & Launch BigDL K8s Container ## 2. Create BigDL K8s Container
### 2.1 Pull Docker Image ### 2.1 Pull Docker Image
Please pull the BigDL 2.1.0 `bigdl-k8s` image from [Docker Hub](https://hub.docker.com/r/intelanalytics/bigdl-k8s/tags) as follows: Please pull the BigDL [`bigdl-k8s`]((https://hub.docker.com/r/intelanalytics/bigdl-k8s/tags)) image (built on top of Spark 3.1.3) from Docker Hub as follows:
```bash ```bash
# For the latest nightly build version
sudo docker pull intelanalytics/bigdl-k8s:latest
# For the release version, e.g. 2.1.0
sudo docker pull intelanalytics/bigdl-k8s:2.1.0 sudo docker pull intelanalytics/bigdl-k8s:2.1.0
``` ```
__Note:__
* If you need the nightly built BigDL, please pull the latest image as below:
```bash
sudo docker pull intelanalytics/bigdl-k8s:latest
```
* The 2.1.0 and latest BigDL image is built on top of Spark 3.1.2.
### 2.2 Create a K8s Client Container ### 2.2 Create a K8s Client Container
Please launch the __Client Container__ following the script below: Please create the __Client Container__ using the script below:
```bash ```bash
sudo docker run -itd --net=host \ sudo docker run -itd --net=host \
-v /etc/kubernetes:/etc/kubernetes \ -v /etc/kubernetes:/etc/kubernetes \
-v /root/.kube:/root/.kube \ -v /root/.kube:/root/.kube \
intelanalytics/bigdl-k8s:2.1.0 bash intelanalytics/bigdl-k8s:latest bash
``` ```
In the script: In the script:
* `--net=host`: use the host network stack for the Docker container; * **Please switch the tag according to the BigDL image you pull.**
* `-v /etc/kubernetes:/etc/kubernetes`: specify the path of kubernetes configurations; * `--net=host`: use the host network stack for the Docker container.
* `-v /root/.kube:/root/.kube`: specify the path of kubernetes installation; * `-v /etc/kubernetes:/etc/kubernetes`: specify the path of Kubernetes configurations to mount into the Docker container.
* `-v /root/.kube:/root/.kube`: specify the path of Kubernetes installation to mount into the Docker container.
__Notes:__ __Notes:__
* Please switch the tag from `2.1.0` to `latest` if you pull the latest BigDL image. * The __Client Container__ contains all the required environment except K8s configurations.
* The __Client Container__ contains all the required environment except K8s configs. * You don't need to create Spark executor containers manually, which is scheduled by K8s at runtime.
* You needn't to create an __Executor Container__ manually, which is scheduled by K8s at runtime.
We recommend you to specify more arguments when creating a container: We recommend you to specify more arguments when creating the __Client Container__:
```bash ```bash
sudo docker run -itd --net=host \ sudo docker run -itd --net=host \
-v /etc/kubernetes:/etc/kubernetes \ -v /etc/kubernetes:/etc/kubernetes \
@ -123,7 +120,7 @@ sudo docker run -itd --net=host \
-e https_proxy=https://your-proxy-host:your-proxy-port \ -e https_proxy=https://your-proxy-host:your-proxy-port \
-e RUNTIME_SPARK_MASTER=k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \ -e RUNTIME_SPARK_MASTER=k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
-e RUNTIME_K8S_SERVICE_ACCOUNT=spark \ -e RUNTIME_K8S_SERVICE_ACCOUNT=spark \
-e RUNTIME_K8S_SPARK_IMAGE=intelanalytics/bigdl-k8s:2.1.0 \ -e RUNTIME_K8S_SPARK_IMAGE=intelanalytics/bigdl-k8s:latest \
-e RUNTIME_PERSISTENT_VOLUME_CLAIM=nfsvolumeclaim \ -e RUNTIME_PERSISTENT_VOLUME_CLAIM=nfsvolumeclaim \
-e RUNTIME_DRIVER_HOST=x.x.x.x \ -e RUNTIME_DRIVER_HOST=x.x.x.x \
-e RUNTIME_DRIVER_PORT=54321 \ -e RUNTIME_DRIVER_PORT=54321 \
@ -133,35 +130,30 @@ sudo docker run -itd --net=host \
-e RUNTIME_TOTAL_EXECUTOR_CORES=4 \ -e RUNTIME_TOTAL_EXECUTOR_CORES=4 \
-e RUNTIME_DRIVER_CORES=4 \ -e RUNTIME_DRIVER_CORES=4 \
-e RUNTIME_DRIVER_MEMORY=10g \ -e RUNTIME_DRIVER_MEMORY=10g \
intelanalytics/bigdl-k8s:2.1.0 bash intelanalytics/bigdl-k8s:latest bash
``` ```
__Notes:__
* Please make sure you are mounting the correct volumn path (e.g. NFS) in a container.
* Please switch the `2.1.0` tag to `latest` if you pull the latest BigDL image.
In the script: In the script:
* `--net=host`: use the host network stack for the Docker container; * **Please switch the tag according to the BigDL image you pull.**
* `/etc/kubernetes:/etc/kubernetes`: specify the path of kubernetes configurations; * **Please make sure you are mounting the correct Volume path (e.g. NFS) into the container.**
* `/root/.kube:/root/.kube`: specify the path of kubernetes installation; * `/path/to/nfsdata:/bigdl/nfsdata`: mount NFS path on the host into the container as the specified path (e.g. "/bigdl/nfsdata").
* `/path/to/nfsdata:/bigdl/data`: mount NFS path on host in a container as the sepcified path in value; * `NOTEBOOK_PORT`: an integer that specifies the port number for the Notebook (only required if you use notebook).
* `NOTEBOOK_PORT`: an Integer that specifies port number for Notebook (only required by notebook); * `NOTEBOOK_TOKEN`: a string that specifies the token for Notebook (only required if you use notebook).
* `NOTEBOOK_TOKEN`: a String that specifies the token for Notebook (only required by notebook); * `RUNTIME_SPARK_MASTER`: a URL format that specifies the Spark master.
* `RUNTIME_SPARK_MASTER`: a URL format that specifies the Spark master; * `RUNTIME_K8S_SERVICE_ACCOUNT`: a string that specifies the service account for driver pod.
* `RUNTIME_K8S_SERVICE_ACCOUNT`: a String that specifies the service account for driver pod; * `RUNTIME_K8S_SPARK_IMAGE`: the launched k8s image for Spark.
* `RUNTIME_K8S_SPARK_IMAGE`: the lanuched k8s image; * `RUNTIME_PERSISTENT_VOLUME_CLAIM`: a string that specifies the Kubernetes volumeName (e.g. "nfsvolumeclaim").
* `RUNTIME_PERSISTENT_VOLUME_CLAIM`: a String that specifies the Kubernetes volumeName; * `RUNTIME_DRIVER_HOST`: a URL format that specifies the driver localhost (only required by k8s-client mode).
* `RUNTIME_DRIVER_HOST`: a URL format that specifies the driver localhost (only required by client mode); * `RUNTIME_DRIVER_PORT`: a string that specifies the driver port (only required by k8s-client mode).
* `RUNTIME_DRIVER_PORT`: a String that specifies the driver port (only required by client mode); * `RUNTIME_EXECUTOR_INSTANCES`: an integer that specifies the number of executors.
* `RUNTIME_EXECUTOR_INSTANCES`: an Integer that specifies the number of executors; * `RUNTIME_EXECUTOR_CORES`: an integer that specifies the number of cores for each executor.
* `RUNTIME_EXECUTOR_CORES`: an Integer that specifies the number of cores for each executor; * `RUNTIME_EXECUTOR_MEMORY`: a string that specifies the memory for each executor.
* `RUNTIME_EXECUTOR_MEMORY`: a String that specifies the memory for each executor; * `RUNTIME_TOTAL_EXECUTOR_CORES`: an integer that specifies the number of cores for all executors.
* `RUNTIME_TOTAL_EXECUTOR_CORES`: an Integer that specifies the number of cores for all executors; * `RUNTIME_DRIVER_CORES`: an integer that specifies the number of cores for the driver node.
* `RUNTIME_DRIVER_CORES`: an Integer that specifies the number of cores for the driver node; * `RUNTIME_DRIVER_MEMORY`: a string that specifies the memory for the driver node.
* `RUNTIME_DRIVER_MEMORY`: a String that specifies the memory for the driver node;
### 2.3 Enter the K8s Client Container ### 2.3 Launch the K8s Client Container
Once the container is created, a `containerID` would be returned and with which you can enter the container following the command below: Once the container is created, a `containerID` would be returned and with which you can enter the container following the command below:
```bash ```bash
sudo docker exec -it <containerID> bash sudo docker exec -it <containerID> bash
@ -174,7 +166,7 @@ In the launched BigDL K8s **Client Container**, please setup the environment fol
- See [here](../Overview/install.md#install-anaconda) to install conda and prepare the Python environment. - See [here](../Overview/install.md#install-anaconda) to install conda and prepare the Python environment.
- See [here](../Overview/install.md#install-bigdl-orca) to install BigDL Orca in the created conda environment. - See [here](../Overview/install.md#to-install-orca-for-spark3) to install BigDL Orca in the created conda environment.
- You should install all the other Python libraries that you need in your program in the conda environment as well. `torch` and `torchvision` are needed to run the Fashion-MNIST example: - You should install all the other Python libraries that you need in your program in the conda environment as well. `torch` and `torchvision` are needed to run the Fashion-MNIST example:
```bash ```bash
@ -186,9 +178,9 @@ pip install torch torchvision
--- ---
## 4. Prepare Dataset ## 4. Prepare Dataset
To run the example provided by this tutorial on K8s, you should upload the dataset to to a K8s volumn (e.g. NFS). To run the example provided by this tutorial on K8s, you should upload the dataset to a K8s Volume (e.g. NFS).
Please download the Fashion-MNIST dataset manually on your __Develop Node__ (where you launch the container image). Please download the Fashion-MNIST dataset manually on your __Develop Node__ (where you launch the container image). Note that PyTorch `FashionMNIST` Dataset requires unzipped files located in `FashionMNIST/raw/` under the root folder.
```bash ```bash
# PyTorch official dataset download link # PyTorch official dataset download link
@ -201,8 +193,6 @@ mv /path/to/fashion-mnist/data/fashion /bigdl/nfsdata/dataset/FashionMNIST/raw
gzip -dk /bigdl/nfsdata/dataset/FashionMNIST/raw/* gzip -dk /bigdl/nfsdata/dataset/FashionMNIST/raw/*
``` ```
__Note:__ PyTorch requires tge directory of dataset where `FashionMNIST/raw/train-images-idx3-ubyte` and `FashionMNIST/raw/t10k-images-idx3-ubyte` exist.
--- ---
## 5. Prepare Custom Modules ## 5. Prepare Custom Modules

View file

@ -107,14 +107,17 @@ unset ...
--- ---
## 3. Prepare Dataset ## 3. Prepare Dataset
To run the example on YARN, you should upload the Fashion-MNIST dataset to a distributed storage (such as HDFS or S3). To run the example provided by this tutorial on YARN, you should upload the Fashion-MNIST dataset to a distributed storage (such as HDFS or S3).
First, download the Fashion-MNIST dataset manually on your __Client Node__: First, download the Fashion-MNIST dataset manually on your __Client Node__. Note that PyTorch `FashionMNIST` Dataset requires unzipped files located in `FashionMNIST/raw/` under the root folder.
```bash ```bash
# PyTorch official dataset download link # PyTorch official dataset download link
git clone https://github.com/zalandoresearch/fashion-mnist.git git clone https://github.com/zalandoresearch/fashion-mnist.git
mv /path/to/fashion-mnist/data/fashion /path/to/local/data/FashionMNIST/raw mv /path/to/fashion-mnist/data/fashion /path/to/local/data/FashionMNIST/raw
# Extract FashionMNIST archives
gzip -dk /bigdl/nfsdata/dataset/FashionMNIST/raw/*
``` ```
Then upload it to a distributed storage. Sample command to upload data to HDFS is as follows: Then upload it to a distributed storage. Sample command to upload data to HDFS is as follows:
```bash ```bash
@ -288,7 +291,7 @@ Set the cluster_mode to "spark-submit" in `init_orca_context`.
sc = init_orca_context(cluster_mode="spark-submit") sc = init_orca_context(cluster_mode="spark-submit")
``` ```
Before submitting the application on the Client Node, you need to: Before submitting the application on the __Client Node__, you need to:
1. Prepare the conda environment on a __Development Node__ where conda is available and pack the conda environment to an archive: 1. Prepare the conda environment on a __Development Node__ where conda is available and pack the conda environment to an archive:
```bash ```bash