Update yarn tutorial (#6456)

* update yarn tutorial

* style

* update

* revert

* minor
This commit is contained in:
Kai Huang 2022-11-07 13:58:58 +08:00 committed by GitHub
parent 5346ef45d8
commit 1e7681325b
2 changed files with 208 additions and 303 deletions

View file

@ -1,6 +1,37 @@
# Installation # Installation
We recommend using [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to prepare the Python environment. Install conda and create an environment for BigDL Orca: ## Install Java
You need to download and install JDK in the environment, and properly set the environment variable `JAVA_HOME`. JDK8 is highly recommended.
```bash
# For Ubuntu
sudo apt-get install openjdk-8-jre
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
# For CentOS
su -c "yum install java-1.8.0-openjdk"
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64/jre
export PATH=$PATH:$JAVA_HOME/bin
java -version # Verify the version of JDK.
```
## Install Anaconda
We recommend using [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to prepare the Python environment.
You can follow the steps below to install conda:
```bash
# Download Anaconda installation script
wget -P /tmp https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
# Execute the script to install conda
bash /tmp/Anaconda3-2020.02-Linux-x86_64.sh
# Run this command in your terminal to activate conda
source ~/.bashrc
```
Then create a Python environment for BigDL Orca:
```bash ```bash
conda create -n py37 python=3.7 # "py37" is conda environment name, you can use any name you like. conda create -n py37 python=3.7 # "py37" is conda environment name, you can use any name you like.
conda activate py37 conda activate py37

View file

@ -1,51 +1,54 @@
# Run on Hadoop/YARN Clusters # Run on Hadoop/YARN Clusters
This tutorial provides a step-by-step guide on how to run BigDL-Orca programs on Apache Hadoop/YARN clusters, using a [PyTorch Fashin-MNIST program](https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/pytorch/FashionMNIST/) as a working example. This tutorial provides a step-by-step guide on how to run BigDL-Orca programs on Apache Hadoop/YARN clusters, using a [PyTorch Fashion-MNIST program](https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/pytorch/FashionMNIST/) as a working example.
## 1. Key Concepts The **Client Node** that appears in this tutorial refer to the machine where you launch or submit your applications.
### 1.1 Init_orca_context
---
## 1. Basic Concepts
### 1.1 init_orca_context
A BigDL Orca program usually starts with the initialization of OrcaContext. For every BigDL Orca program, you should call `init_orca_context` at the beginning of the program as below: A BigDL Orca program usually starts with the initialization of OrcaContext. For every BigDL Orca program, you should call `init_orca_context` at the beginning of the program as below:
```python ```python
from bigdl.orca import init_orca_context from bigdl.orca import init_orca_context
init_orca_context(cluster_mode, cores, memory, num_nodes, driver_cores, driver_memory, extra_python_lib, conf) sc = init_orca_context(cluster_mode, cores, memory, num_nodes, driver_cores, driver_memory, extra_python_lib, conf)
``` ```
In `init_orca_context`, you may specify necessary runtime configurations for running the example on YARN, including: In `init_orca_context`, you may specify necessary runtime configurations for running the example on YARN, including:
* `cluster_mode`: a String that specifies the underlying cluster; valid value includes `"local"`, __`"yarn-client"`__, __`"yarn-cluster"`__, `"k8s-client"`, `"k8s-cluster"`, `"bigdl-submit"`, `"spark-submit"`. * `cluster_mode`: one of `"yarn-client"`, `"yarn-cluster"`, `"bigdl-submit"` or `"spark-submit"` when you run on Hadoop/YARN clusters.
* `cores`: an Integer that specifies the number of cores for each executor (default to be `2`). * `cores`: an integer that specifies the number of cores for each executor (default to be `2`).
* `memory`: a String that specifies the memory for each executor (default to be `"2g"`). * `memory`: a string that specifies the memory for each executor (default to be `"2g"`).
* `num_nodes`: an Integer that specifies the number of executors (default to be `1`). * `num_nodes`: an integer that specifies the number of executors (default to be `1`).
* `driver_cores`: an Integer that specifies the number of cores for the driver node (default to be `4`). * `driver_cores`: an integer that specifies the number of cores for the driver node (default to be `4`).
* `driver_memory`: a String that specifies the memory for the driver node (default to be `"1g"`). * `driver_memory`: a string that specifies the memory for the driver node (default to be `"1g"`).
* `extra_python_lib`: a String that specifies the path to extra Python package, one of `.py`, `.zip` or `.egg` files (default to be `None`). * `extra_python_lib`: a string that specifies the path to extra Python packages (default to be `None`). `.py`, `.zip` or `.egg` files are supported.
* `conf`: a Key-Value format to append extra conf for Spark (default to be `None`). * `conf`: a dictionary to append extra conf for Spark (default to be `None`).
__Note__: __Note__:
* All arguments __except__ `cluster_mode` will be ignored when using `bigdl-submit` or `spark-submit` to submit and run Orca programs, in which case you are supposed to specify the configurations via the submit command. * All the arguments __except__ `cluster_mode` will be ignored when using [`bigdl-submit`](#use-bigdl-submit) or [`spark-submit`](#use-spark-submit) to submit and run Orca programs, in which case you are supposed to specify these configurations via the submit command.
After the Orca programs finish, you should call `stop_orca_context` at the end of the program to release resources and shutdown the underlying distributed runtime engine (such as Spark or Ray). After Orca programs finish, you should always call `stop_orca_context` at the end of the program to release resources and shutdown the underlying distributed runtime engine (such as Spark or Ray).
```python ```python
from bigdl.orca import stop_orca_context from bigdl.orca import stop_orca_context
stop_orca_context() stop_orca_context()
``` ```
For more details, please see [OrcaContext](https://bigdl.readthedocs.io/en/latest/doc/Orca/Overview/orca-context.html). For more details, please see [OrcaContext](../Overview/orca-context.md).
### 1.2 Yarn-Client & Yarn-Cluster ### 1.2 Yarn-Client & Yarn-Cluster
The difference between yarn-client and yarn-cluster is where you run your Spark driver. The difference between yarn-client mode and yarn-cluster mode is where you run your Spark driver.
For yarn-client, the Spark driver runs in the client process, and the application master is only used for requesting resources from YARN, while for yarn-cluster the Spark driver runs inside an application master process which is managed by YARN in the cluster. For yarn-client, the Spark driver runs in the client process, and the application master is only used for requesting resources from YARN, while for yarn-cluster the Spark driver runs inside an application master process which is managed by YARN in the cluster.
For more details, please see [Launching Spark on YARN](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn). For more details, please see [Launching Spark on YARN](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn).
### 1.3 Use Distributed Storage When Running on YARN ### 1.3 Distributed storage on YARN
__Note:__ __Note__:
* When you are running programs on YARN, you are recommended to load data from a distributed storage (e.g. HDFS or S3) instead of the local file system. * When you run programs on YARN, you are highly recommended to load/write data from/to a distributed storage (e.g. [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) or [S3](https://aws.amazon.com/s3/)) instead of the local file system.
The Fashion-MNIST example uses a utility function `get_remote_file_to_local` provided by BigDL to download datasets and create PyTorch Dataloader on each executor. The Fashion-MNIST example in this tutorial uses a utility function `get_remote_file_to_local` provided by BigDL to download datasets and create the PyTorch DataLoader on each executor.
```python ```python
import torch import torch
@ -68,160 +71,97 @@ def train_data_creator(config, batch_size):
return trainloader return trainloader
``` ```
---
## 2. Prepare Environment ## 2. Prepare Environment
Before running the BigDL program on YARN, you need to setup the environment following the steps below: Before running BigDL Orca programs on YARN, you need to properly setup the environment following the steps below.
### 2.1 Setup JAVA & Hadoop Environment ### 2.1 Setup JAVA & Hadoop Environment
**Setup JAVA Environment** - See [here](../Overview/install.md#install-java) to prepare Java in your cluster.
You need to download and install JDK in the environment, and properly set the environment variable `JAVA_HOME`, which is required by Spark. JDK8 is highly recommended. - Check the Hadoop setup and configurations of your cluster. Make sure you correctly set the environment variable `HADOOP_CONF_DIR`, which is needed to initialize Spark on YARN:
```bash
# For Ubuntu
sudo apt-get install openjdk-8-jre
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
# For CentOS
su -c "yum install java-1.8.0-openjdk"
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64/jre
export PATH=$PATH:$JAVA_HOME/bin
java -version # Verify the version of JDK.
```
**Setup Hadoop Environment**
Check the Hadoop setup and configurations of our cluster. Make sure you correctly set the environment variable `HADOOP_CONF_DIR`, which is needed to initialize Spark on YARN:
```bash ```bash
export HADOOP_CONF_DIR=/path/to/hadoop/conf export HADOOP_CONF_DIR=/path/to/hadoop/conf
``` ```
### 2.2 Install Python Libraries ### 2.2 Install Python Libraries
**Install Conda** - See [here](../Overview/install.md#install-anaconda) to install conda and prepare the Python environment on the __Client Node__.
You need first to use conda to prepare the Python environment on the __Client Node__ (where you submit applications). You could download and install Conda following [Conda User Guide](https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html) or executing the command as below. - See [here](../Overview/install.md#to-use-basic-orca-features) to install BigDL Orca in the created conda environment.
- You should install all the other Python libraries that you need in your program in the conda environment as well.
- For more details, please see [Python User Guide](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/python.html).
### 2.3 Run on CDH
* For [CDH](https://www.cloudera.com/products/open-source/apache-hadoop/key-cdh-components.html) users, the environment variable `HADOOP_CONF_DIR` should be `/etc/hadoop/conf` by default.
* The __Client Node__ may have already installed a different version of Spark than the one installed with BigDL. To avoid conflicts, unset all Spark-related environment variables (you may use use `env | grep SPARK` to find all of them):
```bash ```bash
# Download Anaconda installation script unset SPARK_HOME
wget -P /tmp https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh unset ...
# Execute the script to install conda
bash /tmp/Anaconda3-2020.02-Linux-x86_64.sh
# Please type this command in your terminal to activate Conda environment
source ~/.bashrc
``` ```
**Use Conda to install BigDL and other Python libraries** ---
Create a conda environment, install BigDL and all the needed Python libraries in the created conda environment:
``` bash
# "env" is conda environment name, you can use any name you like.
# Please change Python version to 3.6 if you need a Python 3.6 environment.
conda create -n env python=3.7
conda activate env
```
You can install the latest release version of BigDL (built on top of Spark 2.4.6 by default) as follows:
```bash
pip install bigdl
```
You can install the latest nightly build of BigDL as follows:
```bash
pip install --pre --upgrade bigdl
```
__Notes:__
* Using Conda to install BigDL will automatically install libraries including `conda-pack`, `pyspark==2.4.6`, and other related dependencies.
* You can install BigDL built on top of Spark 3.1.2 as follows:
```bash
# Install the latest release version
pip install bigdl-spark3
# Install the latest nightly build version
pip install --pre --upgrade bigdl-spark3
```
Installing bigdl-spark3 will automatically install `pyspark==3.1.2`.
* You also need to install any additional python libraries that your application depends on in this Conda environment.
Please see more details in [Python User Guide](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/python.html).
### 2.3 Notes for CDH Users
* For CDH users, the environment variable `HADOOP_CONF_DIR` should be `/etc/hadoop/conf` by default.
* The __Client Node__ (where you submit applications) may have already installed a different version of Spark than the one installed with BigDL. To avoid conflicts, unset all Spark-related environment variables (you may use use `env | grep SPARK` to find all of them):
```bash
unset SPARK_HOME
unset SPARK_VERSION
unset ...
```
## 3. Prepare Dataset ## 3. Prepare Dataset
To run the example on YARN, you should upload the Fashion-MNIST dataset to a distributed storage (such as HDFS or S3). To run the example on YARN, you should upload the Fashion-MNIST dataset to a distributed storage (such as HDFS or S3).
First, please download the Fashion-MNIST dataset manually on your __Client Node__ (where you submit the program to YARN). First, download the Fashion-MNIST dataset manually on your __Client Node__:
```bash ```bash
# PyTorch official dataset download link # PyTorch official dataset download link
git clone https://github.com/zalandoresearch/fashion-mnist.git git clone https://github.com/zalandoresearch/fashion-mnist.git
mv /path/to/fashion-mnist/data/fashion /path/to/local/data/FashionMNIST/raw mv /path/to/fashion-mnist/data/fashion /path/to/local/data/FashionMNIST/raw
``` ```
Then upload it to a distributed storage. Then upload it to a distributed storage. Sample command to upload data to HDFS is as follows:
```bash ```bash
# Upload to HDFS
hdfs dfs -put /path/to/local/data/FashionMNIST hdfs://path/to/remote/data hdfs dfs -put /path/to/local/data/FashionMNIST hdfs://path/to/remote/data
``` ```
In the given example, you can specify the argument `--remote_dir` to be the directory on a distributed storage for the Fashion-MNIST dataset.
---
## 4. Prepare Custom Modules ## 4. Prepare Custom Modules
Spark allows to upload Python files (`.py`), and zipped Python packages (`.zip`) across the cluster by setting `--py-files` option in Spark scripts or `extra_python_lib` in `init_orca_context`. Spark allows to upload Python files (`.py`), and zipped Python packages (`.zip`) across the cluster by setting `--py-files` option in Spark scripts or specifying `extra_python_lib` in `init_orca_context`.
The FasionMNIST example needs to import modules from `model.py`. The FasionMNIST example needs to import modules from [`model.py`](https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/pytorch/FashionMNIST/model.py).
* When using `python` command, please specify `extra_python_lib` in `init_orca_context`. * When using [`python` command](#use-python-command), please specify `extra_python_lib` in `init_orca_context`.
```python ```python
from bigdl.orca import init_orca_context, stop_orca_context init_orca_context(..., extra_python_lib="model.py")
from model import model_creator, optimizer_creator ```
# Please switch the `cluster_mode` to `yarn-cluster` when running on cluster mode. For more details, please see [BigDL Python Dependencies](https://bigdl.readthedocs.io/en/latest/doc/Orca/Overview/orca-context.html#python-dependencies).
init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2,
driver_cores=2, driver_memory="4g",
extra_python_lib="model.py")
```
Please see more details in [Orca Document](https://bigdl.readthedocs.io/en/latest/doc/Orca/Overview/orca-context.html#python-dependencies). * When using [`bigdl-submit`](#use-bigdl-submit) or [`spark-submit`](#use-spark-submit), please specify `--py-files` option in the submit command.
```bash
* When using `bigdl-submit` or `spark-submit` script, please specify `--py-files` option in the script. bigdl-submit # or spark-submit
```bash
bigdl-submit # or spark-submit
--master yarn \ --master yarn \
--delopy-mode client \ --delopy-mode client \
--py-files model.py --py-files model.py
train.py train.py
``` ```
Import custom modules at the beginning of the example: For more details, please see [Spark Python Dependencies](https://spark.apache.org/docs/latest/submitting-applications.html).
```python
from bigdl.orca import init_orca_context, stop_orca_context
from model import model_creator, optimizer_creator
init_orca_context(cluster_mode="bigdl-submit") # or spark-submit * After uploading `model.py` to YARN, you can import this custom module as follows:
``` ```python
from model import model_creator, optimizer_creator
```
Please see more details in [Spark Document](https://spark.apache.org/docs/latest/submitting-applications.html). __Note__:
__Note:__ If your program depends on a nested directory of Python files, you are recommended to follow the steps below to use a zipped package instead.
* If your program depends on a nested directory of Python files, you are recommended to follow the steps below to use a zipped package instead.
1. Compress the directory into a zipped package.
```bash
zip -q -r FashionMNIST_zipped.zip FashionMNIST
```
2. Please upload the zipped package (`FashionMNIST_zipped.zip`) to YARN.
* When using `python` command, please specify `extra_python_lib` argument in `init_orca_context`.
* When using `bigdl-submit` or `spark-submit` script, please specify `--py-files` option in the script. 1. Compress the directory into a zipped package.
3. You can then import the custom modules from the unzipped file in your program as below. ```bash
```python zip -q -r FashionMNIST_zipped.zip FashionMNIST
from FashionMNIST.model import model_creator, optimizer_creator ```
``` 2. Upload the zipped package (`FashionMNIST_zipped.zip`) to YARN by setting `--py-files` or specifying `extra_python_lib` as discussed above.
3. You can then import the custom modules from the unzipped file in your program as follows:
```python
from FashionMNIST.model import model_creator, optimizer_creator
```
---
## 5. Run Jobs on YARN ## 5. Run Jobs on YARN
In the following part, we will illustrate three ways to submit and run BigDL Orca applications on YARN. In the following part, we will illustrate three ways to submit and run BigDL Orca applications on YARN.
@ -231,99 +171,74 @@ In the following part, we will illustrate three ways to submit and run BigDL Orc
You can choose one of them based on your preference or cluster settings. You can choose one of them based on your preference or cluster settings.
### 5.1 Use `python` Command We provide the running command for the [Fashion-MNIST example](https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/pytorch/FashionMNIST/) in this section.
This is the easiest and most recommended way to run BigDL on YARN.
__Note:__ ### 5.1 Use `python` Command
* You only need to prepare the environment on the __Client Node__ (where you submit applications), all dependencies would be automatically packaged and distributed to YARN cluster. This is the easiest and most recommended way to run BigDL Orca on YARN as a normal Python program. Using this way, you only need to prepare the environment on the __Client Node__ and the environment would be automatically packaged and distributed to the YARN cluster.
See [here](#init-orca-context) for the runtime configurations.
#### 5.1.1 Yarn Client #### 5.1.1 Yarn Client
Please call `init_orca_context` at the very beginning of each Orca program. Run the example with the following command by setting the cluster_mode to "yarn-client":
```python
from bigdl.orca import init_orca_context
init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2,
driver_cores=2, driver_memory="4g",
extra_python_lib="model.py")
```
Run the example following command below:
```bash ```bash
python train.py --cluster_mode yarn-client --remote_dir hdfs://path/to/remote/data python train.py --cluster_mode yarn-client --remote_dir hdfs://path/to/remote/data
``` ```
* `--cluster_mode`: set the cluster_mode in `init_orca_context`.
* `--remote_dir`: directory on a distributed storage for the dataset (see __[Section 3](#3-prepare-dataset)__).
__Note__:
* Please refer to __[Section 4](#4-prepare-custom-modules)__ for the description of `extra_python_lib`.
#### 5.1.2 Yarn Cluster #### 5.1.2 Yarn Cluster
Please call `init_orca_context` at the very beginning of each Orca program. Run the example with the following command by setting the cluster_mode to "yarn-cluster":
```python
from bigdl.orca import init_orca_context
init_orca_context(cluster_mode="yarn-cluster", cores=4, memory="10g", num_nodes=2,
driver_cores=2, driver_memory="4g",
extra_python_lib="model.py")
```
Run the example following command below:
```bash ```bash
python train.py --cluster_mode yarn-cluster --remote_dir hdfs://path/to/remote/data python train.py --cluster_mode yarn-cluster --remote_dir hdfs://path/to/remote/data
``` ```
* `--cluster_mode`: set the cluster_mode in `init_orca_context`.
* `--remote_dir`: directory on a distributed storage for the dataset (see __[Section 3](#3-prepare-dataset)__).
__Note__:
* Please refer to __[Section 4](#4-prepare-custom-modules)__ for the description of `extra_python_lib`.
#### 5.1.3 Jupyter Notebook #### 5.1.3 Jupyter Notebook
You can easily run the example in a Jupyter Notebook. You can easily run the example in a Jupyter Notebook using __`yarn-client` mode__. Launch the notebook using the following command:
```bash ```bash
# Start a jupyter notebook
jupyter notebook --notebook-dir=/path/to/notebook/directory --ip=* --no-browser jupyter notebook --notebook-dir=/path/to/notebook/directory --ip=* --no-browser
``` ```
You can copy the code of `train.py` to the notebook and run the cells on `yarn-client` mode.
```python
from bigdl.orca import init_orca_context
init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2, You can copy the code in [train.py](https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/pytorch/FashionMNIST/train.py) to the notebook and run the cells. Set the cluster_mode to "yarn-client" in `init_orca_context`.
```python
sc = init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2,
driver_cores=2, driver_memory="4g", driver_cores=2, driver_memory="4g",
extra_python_lib="model.py") extra_python_lib="model.py")
``` ```
__Note:__ Note that Jupyter Notebook cannot run on `yarn-cluster` mode, as the driver is not running on the __Client Node__ (where you run the notebook).
* Jupyter Notebook cannot run on `yarn-cluster`, as the driver is not running on the __Client Node__(the notebook page).
### 5.2 Use `bigdl-submit` ### 5.2 Use `bigdl-submit`
For users who want to use a script instead of Python command, BigDL provides an easy-to-use `bigdl-submit` script, which could automatically setup configuration and jars files from the current activate Conda environment. For users who want to use a script instead of Python command, BigDL provides an easy-to-use `bigdl-submit` script, which could automatically setup BigDL configuration and jars files from the current activate conda environment.
Please call `init_orca_context` at the very beginning of the program. Set the cluster_mode to "bigdl-submit" in `init_orca_context`.
```python ```python
from bigdl.orca import init_orca_context sc = init_orca_context(cluster_mode="bigdl-submit")
init_orca_context(cluster_mode="bigdl-submit")
``` ```
On the __Client Node__ (where you submit applications), before submitting the example: Pack the current activate conda environment to an archive on the __Client Node__ before submitting the example:
1. Install and activate Conda environment (see __[Section 2.2.1](#221-install-conda)__). ```bash
2. Use Conda to install BigDL and other Python libraries (see __[Section 2.2.2](#222-use-conda-to-install-bigdl-and-other-python-libraries)__). conda pack -o environment.tar.gz
3. Pack the current activate Conda environment to an archive. ```
```bash
conda pack -o environment.tar.gz Some runtime configurations for Spark are as follows:
```
* `--executor-memory`: the memory for each executor.
* `--driver-memory`: the memory for the driver node.
* `--executor-cores`: the number of cores for each executor.
* `--num_executors`: the number of executors.
* `--py-files`: the extra Python dependency files to be uploaded to YARN.
* `--archives`: the conda archive to be uploaded to YARN.
#### 5.2.1 Yarn Client #### 5.2.1 Yarn Client
Submit and run the example on `yarn-client` mode following `bigdl-submit` script below: Submit and run the example for `yarn-client` mode following the `bigdl-submit` script below:
```bash ```bash
bigdl-submit \ bigdl-submit \
--master yarn \ --master yarn \
--deploy-mode client \ --deploy-mode client \
--executor-memory 10g \ --executor-memory 10g \
--driver-memory 10g \ --driver-memory 4g \
--executor-cores 8 \ --executor-cores 4 \
--num-executors 2 \ --num-executors 2 \
--py-files model.py \ --py-files model.py \
--archives /path/to/environment.tar.gz#environment \ --archives /path/to/environment.tar.gz#environment \
@ -332,32 +247,21 @@ bigdl-submit \
train.py --cluster_mode bigdl-submit --remote_dir hdfs://path/to/remote/data train.py --cluster_mode bigdl-submit --remote_dir hdfs://path/to/remote/data
``` ```
In the `bigdl-submit` script: In the `bigdl-submit` script:
* `--master`: the spark master, set it to yarn; * `--master`: the spark master, set it to "yarn".
* `--deploy-mode`: set it to client when running programs on yarn-client mode; * `--deploy-mode`: set it to "client" when running programs on yarn-client mode.
* `--executor-memory`: set the memory for each executor; * `--conf spark.pyspark.driver.python`: set the activate Python location on __Client Node__ as the driver's Python environment. You can find it by running `which python`.
* `--driver-memory`: set the memory for the driver node; * `--conf spark.pyspark.python`: set the Python location in conda archive as each executor's Python environment.
* `--executor-cores`: set the cores number for each executor;
* `--num_executors`: set the number of executors;
* `--py-files`: upload extra Python dependency files to YARN;
* `--archives`: upload the Conda archive to YARN;
* `--conf spark.pyspark.driver.python`: set the activate Python location on __Client Node__ as driver's Python environment (find the location by running `which python`);
* `--conf spark.pyspark.python`: set the Python location in Conda archive as executors' Python environment;
__Notes:__
* `--cluster_mode`: set the cluster_mode in `init_orca_context`.
* `--remote_dir`: directory on a distributed storage for the dataset (see __[Section 3](#3-prepare-dataset)__).
* Please refer to __[Section 4](#4-prepare-custom-modules)__ for the description of extra Python dependencies.
#### 5.2.2 Yarn Cluster #### 5.2.2 Yarn Cluster
Submit and run the program on `yarn-cluster` mode following `bigdl-submit` script below: Submit and run the program for `yarn-cluster` mode following the `bigdl-submit` script below:
```bash ```bash
bigdl-submit \ bigdl-submit \
--master yarn \ --master yarn \
--deploy-mode cluster \ --deploy-mode cluster \
--executor-memory 10g \ --executor-memory 10g \
--driver-memory 10g \ --driver-memory 4g \
--executor-cores 8 \ --executor-cores 4 \
--num-executors 2 \ --num-executors 2 \
--py-files model.py \ --py-files model.py \
--archives /path/to/environment.tar.gz#environment \ --archives /path/to/environment.tar.gz#environment \
@ -366,67 +270,63 @@ bigdl-submit \
train.py --cluster_mode bigdl-submit --remote_dir hdfs://path/to/remote/data train.py --cluster_mode bigdl-submit --remote_dir hdfs://path/to/remote/data
``` ```
In the `bigdl-submit` script: In the `bigdl-submit` script:
* `--master`: the spark master, set it to `yarn`; * `--master`: the spark master, set it to "yarn".
* `--deploy-mode`: set it to `cluster` when running programs on yarn-cluster mode; * `--deploy-mode`: set it to "cluster" when running programs on yarn-cluster mode.
* `--executor-memory`: set the memory for each executor; * `--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON`: set the Python location in conda archive as the Python environment of the Application Master.
* `--driver-memory`: set the memory for the driver node; * `--conf spark.executorEnv.PYSPARK_PYTHON`: also set the Python location in conda archive as each executor's Python environment. The Application Master and the executors will all use the archive for the Python environment.
* `--executor-cores`: set the cores number for each executor;
* `--num_executors`: set the number of executors;
* `--py-files`: upload extra Python dependency files to YARN;
* `--archives`: upload the Conda archive to YARN;
* `--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON`: set the Python location in Conda archive as Python environment of Application Master process;
* `--conf spark.executorEnv.PYSPARK_PYTHON`: set the Python location in Conda archive as Python environment of executors, the Application Master and executor will all use the archive for Python environment;
__Notes:__
* `--cluster_mode`: set the cluster_mode in `init_orca_context`;
* `--remote_dir`: directory on a distributed storage for the dataset (see __[Section 3](#3-prepare-dataset)__).
* Please refer to __[Section 4](#4-prepare-custom-modules)__ for the description of extra Python dependencies.
### 5.3 Use `spark-submit` ### 5.3 Use `spark-submit`
When the __Client Node__ (where you submit applications) is not able to install BigDL using Conda, please use `spark-submit` script instead. When you are not able to install BigDL using conda on the __Client Node__ , please use the `spark-submit` script instead.
Please call `init_orca_context` at the very beginning of the program. Set the cluster_mode to "spark-submit" in `init_orca_context`.
```python ```python
from bigdl.orca import init_orca_context sc = init_orca_context(cluster_mode="spark-submit")
# Please set cluster_mode to "spark-submit".
init_orca_context(cluster_mode="spark-submit")
``` ```
Before submitting application, you need: Before submitting the application on the Client Node, you need to:
* On the __Development Node__ (which could use Conda):
1. Install and activate Conda environment (see __[Section 2.2.1](#221-install-conda)__). 1. Prepare the conda environment on a __Development Node__ where conda is available and pack the conda environment to an archive:
2. Use Conda to install BigDL and other Python libraries (see __[Section 2.2.2](#222-use-conda-to-install-bigdl-and-other-python-libraries)__). ```bash
3. Pack the current activate Conda environment to an archive; conda pack -o environment.tar.gz
```bash ```
conda pack -o environment.tar.gz 2. Send the Conda archive to the __Client Node__;
``` ```bash
4. Send the Conda archive to the __Client Node__; scp /path/to/environment.tar.gz username@client_ip:/path/to/
```bash ```
scp /path/to/environment.tar.gz username@client_ip:/path/to/
``` On the __Client Node__:
* On the __Client Node__ (where you submit applications):
1. Setup spark environment variables `${SPARK_HOME}` and `${SPARK_VERSION}`. 1. Download Spark and setup the environment variables `${SPARK_HOME}` and `${SPARK_VERSION}`.
```bash ```bash
export SPARK_HOME=/path/to/spark # the folder path where you extract the Spark package export SPARK_HOME=/path/to/spark # the folder path where you extract the Spark package
export SPARK_VERSION="your spark version" export SPARK_VERSION="downloaded spark version"
``` ```
2. Download and unzip a BigDL assembly package from [BigDL Assembly Spark 2.4.6](https://repo1.maven.org/maven2/com/intel/analytics/bigdl/bigdl-assembly-spark_2.4.6/2.1.0/bigdl-assembly-spark_2.4.6-2.1.0-fat-jars.zip) or [BigDL Assembly Spark 3.1.2](https://repo1.maven.org/maven2/com/intel/analytics/bigdl/bigdl-assembly-spark_3.1.2/2.1.0/bigdl-assembly-spark_3.1.2-2.1.0-fat-jars.zip) (according to your Spark version), then setup `${BIGDL_HOME}` and `${BIGDL_VERSION}`. 2. Download and unzip a BigDL assembly package from [BigDL Assembly Spark 2.4.6](https://repo1.maven.org/maven2/com/intel/analytics/bigdl/bigdl-assembly-spark_2.4.6/2.1.0/bigdl-assembly-spark_2.4.6-2.1.0-fat-jars.zip) or [BigDL Assembly Spark 3.1.2](https://repo1.maven.org/maven2/com/intel/analytics/bigdl/bigdl-assembly-spark_3.1.2/2.1.0/bigdl-assembly-spark_3.1.2-2.1.0-fat-jars.zip) (according to your Spark version), then setup the environment variables `${BIGDL_HOME}` and `${BIGDL_VERSION}`.
```bash ```bash
export BIGDL_HOME=/path/to/unzipped_BigDL export BIGDL_HOME=/path/to/unzipped_BigDL
export BIGDL_VERSION="download BigDL version" export BIGDL_VERSION="downloaded BigDL version"
``` ```
Some runtime configurations for Spark are as follows:
* `--executor-memory`: the memory for each executor.
* `--driver-memory`: the memory for the driver node.
* `--executor-cores`: the number of cores for each executor.
* `--num_executors`: the number of executors.
* `--py-files`: the extra Python dependency files to be uploaded to YARN.
* `--archives`: the conda archive to be uploaded to YARN.
#### 5.3.1 Yarn Client #### 5.3.1 Yarn Client
Submit and run the program on `yarn-client` mode following `spark-submit` script below: Submit and run the program for `yarn-client` mode following the `spark-submit` script below:
```bash ```bash
${SPARK_HOME}/bin/spark-submit \ ${SPARK_HOME}/bin/spark-submit \
--master yarn \ --master yarn \
--deploy-mode client \ --deploy-mode client \
--executor-memory 10g \ --executor-memory 10g \
--driver-memory 10g \ --driver-memory 4g \
--executor-cores 8 \ --executor-cores 4 \
--num-executors 2 \ --num-executors 2 \
--archives /path/to/environment.tar.gz#environment \ --archives /path/to/environment.tar.gz#environment \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \ --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
@ -438,38 +338,23 @@ ${SPARK_HOME}/bin/spark-submit \
train.py --cluster_mode spark-submit --remote_dir hdfs://path/to/remote/data train.py --cluster_mode spark-submit --remote_dir hdfs://path/to/remote/data
``` ```
In the `spark-submit` script: In the `spark-submit` script:
* `--master`: the spark master, set it to `yarn`; * `--master`: the spark master, set it to "yarn".
* `--deploy-mode`: set it to `client` when running programs on yarn-client mode; * `--deploy-mode`: set it to "client" when running programs on yarn-client mode.
* `--executor-memory`: set the memory for each executor; * `--properties-file`: the BigDL configuration properties to be uploaded to YARN.
* `--driver-memory`: set the memory for the driver node; * `--conf spark.pyspark.driver.python`: set the activate Python location on __Client Node__ as the driver's Python environment. You can find the location by running `which python`.
* `--executor-cores`: set the cores number for each executor; * `--conf spark.pyspark.python`: set the Python location in conda archive as each executor's Python environment.
* `--num_executors`: set the number of executors; * `--conf spark.driver.extraClassPath`: upload and register the BigDL jars to the driver's classpath.
* `--archives`: upload the Conda archive to YARN; * `--conf spark.executor.extraClassPath`: upload and register the BigDL jars to the executor's classpath.
* `--properties-file`: upload the BigDL configuration properties to YARN;
* `--py-files`: upload extra Python dependency files to YARN;
* `--conf spark.pyspark.driver.python`: set the Python location in Conda archive as driver's Python environment (find the location by running `which python`);
* `--conf spark.pyspark.python`: set the Python location in Conda archive as executors' Python environment;
* `--conf spark.driver.extraClassPath`: upload and register the BigDL jars files to the driver's classpath;
* `--conf spark.executor.extraClassPath`: upload and register the BigDL jars files to the executors' classpath;
__Notes:__
* `--cluster_mode`: set the cluster_mode in `init_orca_context`;
* `--remote_dir`: directory on a distributed storage for the dataset (see __[Section 3](#3-prepare-dataset)__).
* Please refer to __[Section 4](#4-prepare-custom-modules)__ for the description of extra Python dependencies.
#### 5.3.2 Yarn-Cluster #### 5.3.2 Yarn Cluster
Submit and run the program for `yarn-cluster` mode following the `spark-submit` script below:
__Note:__
* Please register BigDL jars through `--jars` option in the `spark-submit` script.
Submit and run the program on `yarn-cluster` mode following `spark-submit` script below:
```bash ```bash
${SPARK_HOME}/bin/spark-submit \ ${SPARK_HOME}/bin/spark-submit \
--master yarn \ --master yarn \
--deploy-mode cluster \ --deploy-mode cluster \
--executor-memory 10g \ --executor-memory 4g \
--driver-memory 10g \ --driver-memory 4g \
--executor-cores 4 \ --executor-cores 4 \
--num-executors 2 \ --num-executors 2 \
--archives /path/to/environment.tar.gz#environment \ --archives /path/to/environment.tar.gz#environment \
@ -480,19 +365,8 @@ ${SPARK_HOME}/bin/spark-submit \
train.py --cluster_mode spark-submit --remote_dir hdfs://path/to/remote/data train.py --cluster_mode spark-submit --remote_dir hdfs://path/to/remote/data
``` ```
In the `spark-submit` script: In the `spark-submit` script:
* `--master`: the spark master, set it to `yarn`; * `--master`: the spark master, set it to "yarn".
* `--deploy-mode`: set it to `cluster` when running programs on yarn-cluster mode; * `--deploy-mode`: set it to "cluster" when running programs on yarn-cluster mode.
* `--executor-memory`: set the memory for each executor; * `--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON`: set the Python location in conda archive as the Python environment of the Application Master.
* `--driver-memory`: set the memory for the driver node; * `--conf spark.executorEnv.PYSPARK_PYTHON`: also set the Python location in conda archive as each executor's Python environment. The Application Master and the executors will all use the archive for the Python environment.
* `--executor-cores`: set the cores number for each executor; * `--jars`: upload and register BigDL jars to YARN.
* `--num_executors`: set the number of executors;
* `--archives`: upload the Conda archive to YARN;
* `--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON`: set the Python location in Conda archive as Python environment of Application Master process;
* `--conf spark.executorEnv.PYSPARK_PYTHON`: set the Python location in Conda archive as executors' Python environment, the Application Master and executor will all use the archive for Python environment;
* `--py-files`: upload extra Python dependency files to YARN;
* `--jars`: upload and register BigDL dependency jars files to YARN;
__Notes:__
* `--cluster_mode`: set the cluster_mode in `init_orca_context`;
* `--remote_dir`: directory on a distributed storage for the dataset (see __[Section 3](#3-prepare-dataset)__).
* Please refer to __[Section 4](#4-prepare-custom-modules)__ for the description of extra Python dependencies.