23 KiB
Run on Hadoop/YARN Clusters
This tutorial provides a step-by-step guide on how to run BigDL-Orca programs on Apache Hadoop/YARN clusters, using a PyTorch Fashin-MNIST program as a working example.
1. Key Concepts
1.1 Init_orca_context
A BigDL Orca program usually starts with the initialization of OrcaContext. For every BigDL Orca program, you should call init_orca_context at the beginning of the program as below:
from bigdl.orca import init_orca_context
init_orca_context(cluster_mode, cores, memory, num_nodes, driver_cores, driver_memory, extra_python_lib, conf)
In init_orca_context, you may specify necessary runtime configurations for running the example on YARN, including:
cluster_mode: a String that specifies the underlying cluster; valid value includes"local","yarn-client","yarn-cluster","k8s-client","k8s-cluster","bigdl-submit","spark-submit".cores: an Integer that specifies the number of cores for each executor (default to be2).memory: a String that specifies the memory for each executor (default to be"2g").num_nodes: an Integer that specifies the number of executors (default to be1).driver_cores: an Integer that specifies the number of cores for the driver node (default to be4).driver_memory: a String that specifies the memory for the driver node (default to be"1g").extra_python_lib: a String that specifies the path to extra Python package, one of.py,.zipor.eggfiles (default to beNone).conf: a Key-Value format to append extra conf for Spark (default to beNone).
Note:
- All arguments except
cluster_modewill be ignored when usingbigdl-submitorspark-submitto submit and run Orca programs, in which case you are supposed to specify the configurations via the submit command.
After the Orca programs finish, you should call stop_orca_context at the end of the program to release resources and shutdown the underlying distributed runtime engine (such as Spark or Ray).
from bigdl.orca import stop_orca_context
stop_orca_context()
For more details, please see OrcaContext.
1.2 Yarn-Client & Yarn-Cluster
The difference between yarn-client and yarn-cluster is where you run your Spark driver.
For yarn-client, the Spark driver runs in the client process, and the application master is only used for requesting resources from YARN, while for yarn-cluster the Spark driver runs inside an application master process which is managed by YARN in the cluster.
For more details, please see Launching Spark on YARN.
1.3 Use Distributed Storage When Running on YARN
Note:
- When you are running programs on YARN, you are recommended to load data from a distributed storage (e.g. HDFS or S3) instead of the local file system.
The Fashion-MNIST example uses a utility function get_remote_file_to_local provided by BigDL to download datasets and create PyTorch Dataloader on each executor.
import torch
import torchvision
import torchvision.transforms as transforms
from bigdl.orca.data.file import get_remote_file_to_local
def train_data_creator(config, batch_size):
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))])
get_remote_file_to_local(remote_path="hdfs://path/to/dataset", local_path="/tmp/dataset")
trainset = torchvision.datasets.FashionMNIST(root="/tmp/dataset", train=True,
download=False, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
shuffle=True, num_workers=0)
return trainloader
2. Prepare Environment
Before running the BigDL program on YARN, you need to setup the environment following the steps below:
2.1 Setup JAVA & Hadoop Environment
Setup JAVA Environment
You need to download and install JDK in the environment, and properly set the environment variable JAVA_HOME, which is required by Spark. JDK8 is highly recommended.
# For Ubuntu
sudo apt-get install openjdk-8-jre
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
# For CentOS
su -c "yum install java-1.8.0-openjdk"
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64/jre
export PATH=$PATH:$JAVA_HOME/bin
java -version # Verify the version of JDK.
Setup Hadoop Environment
Check the Hadoop setup and configurations of our cluster. Make sure you correctly set the environment variable HADOOP_CONF_DIR, which is needed to initialize Spark on YARN:
export HADOOP_CONF_DIR=/path/to/hadoop/conf
2.2 Install Python Libraries
Install Conda
You need first to use conda to prepare the Python environment on the Client Node (where you submit applications). You could download and install Conda following Conda User Guide or executing the command as below.
# Download Anaconda installation script
wget -P /tmp https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
# Execute the script to install conda
bash /tmp/Anaconda3-2020.02-Linux-x86_64.sh
# Please type this command in your terminal to activate Conda environment
source ~/.bashrc
Use Conda to install BigDL and other Python libraries Create a conda environment, install BigDL and all the needed Python libraries in the created conda environment:
# "env" is conda environment name, you can use any name you like.
# Please change Python version to 3.6 if you need a Python 3.6 environment.
conda create -n env python=3.7
conda activate env
You can install the latest release version of BigDL (built on top of Spark 2.4.6 by default) as follows:
pip install bigdl
You can install the latest nightly build of BigDL as follows:
pip install --pre --upgrade bigdl
Notes:
-
Using Conda to install BigDL will automatically install libraries including
conda-pack,pyspark==2.4.6, and other related dependencies. -
You can install BigDL built on top of Spark 3.1.2 as follows:
# Install the latest release version pip install bigdl-spark3 # Install the latest nightly build version pip install --pre --upgrade bigdl-spark3Installing bigdl-spark3 will automatically install
pyspark==3.1.2. -
You also need to install any additional python libraries that your application depends on in this Conda environment.
Please see more details in Python User Guide.
2.3 Notes for CDH Users
-
For CDH users, the environment variable
HADOOP_CONF_DIRshould be/etc/hadoop/confby default. -
The Client Node (where you submit applications) may have already installed a different version of Spark than the one installed with BigDL. To avoid conflicts, unset all Spark-related environment variables (you may use use
env | grep SPARKto find all of them):unset SPARK_HOME unset SPARK_VERSION unset ...
3. Prepare Dataset
To run the example on YARN, you should upload the Fashion-MNIST dataset to a distributed storage (such as HDFS or S3).
First, please download the Fashion-MNIST dataset manually on your Client Node (where you submit the program to YARN).
# PyTorch official dataset download link
git clone https://github.com/zalandoresearch/fashion-mnist.git
mv /path/to/fashion-mnist/data/fashion /path/to/local/data/FashionMNIST/raw
Then upload it to a distributed storage.
# Upload to HDFS
hdfs dfs -put /path/to/local/data/FashionMNIST hdfs://path/to/remote/data
4. Prepare Custom Modules
Spark allows to upload Python files (.py), and zipped Python packages (.zip) across the cluster by setting --py-files option in Spark scripts or extra_python_lib in init_orca_context.
The FasionMNIST example needs to import modules from model.py.
-
When using
pythoncommand, please specifyextra_python_libininit_orca_context.from bigdl.orca import init_orca_context, stop_orca_context from model import model_creator, optimizer_creator # Please switch the `cluster_mode` to `yarn-cluster` when running on cluster mode. init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2, driver_cores=2, driver_memory="4g", extra_python_lib="model.py")Please see more details in Orca Document.
-
When using
bigdl-submitorspark-submitscript, please specify--py-filesoption in the script.bigdl-submit # or spark-submit --master yarn \ --delopy-mode client \ --py-files model.py train.pyImport custom modules at the beginning of the example:
from bigdl.orca import init_orca_context, stop_orca_context from model import model_creator, optimizer_creator init_orca_context(cluster_mode="bigdl-submit") # or spark-submitPlease see more details in Spark Document.
Note:
- If your program depends on a nested directory of Python files, you are recommended to follow the steps below to use a zipped package instead.
- Compress the directory into a zipped package.
zip -q -r FashionMNIST_zipped.zip FashionMNIST - Please upload the zipped package (
FashionMNIST_zipped.zip) to YARN.-
When using
pythoncommand, please specifyextra_python_libargument ininit_orca_context. -
When using
bigdl-submitorspark-submitscript, please specify--py-filesoption in the script.
-
- You can then import the custom modules from the unzipped file in your program as below.
from FashionMNIST.model import model_creator, optimizer_creator
- Compress the directory into a zipped package.
5. Run Jobs on YARN
In the following part, we will illustrate three ways to submit and run BigDL Orca applications on YARN.
- Use
pythoncommand - Use
bigdl-submit - Use
spark-submit
You can choose one of them based on your preference or cluster settings.
5.1 Use python Command
This is the easiest and most recommended way to run BigDL on YARN.
Note:
- You only need to prepare the environment on the Client Node (where you submit applications), all dependencies would be automatically packaged and distributed to YARN cluster.
5.1.1 Yarn Client
Please call init_orca_context at the very beginning of each Orca program.
from bigdl.orca import init_orca_context
init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2,
driver_cores=2, driver_memory="4g",
extra_python_lib="model.py")
Run the example following command below:
python train.py --cluster_mode yarn-client --remote_dir hdfs://path/to/remote/data
--cluster_mode: set the cluster_mode ininit_orca_context.--remote_dir: directory on a distributed storage for the dataset (see Section 3).
Note:
- Please refer to Section 4 for the description of
extra_python_lib.
5.1.2 Yarn Cluster
Please call init_orca_context at the very beginning of each Orca program.
from bigdl.orca import init_orca_context
init_orca_context(cluster_mode="yarn-cluster", cores=4, memory="10g", num_nodes=2,
driver_cores=2, driver_memory="4g",
extra_python_lib="model.py")
Run the example following command below:
python train.py --cluster_mode yarn-cluster --remote_dir hdfs://path/to/remote/data
--cluster_mode: set the cluster_mode ininit_orca_context.--remote_dir: directory on a distributed storage for the dataset (see Section 3).
Note:
- Please refer to Section 4 for the description of
extra_python_lib.
5.1.3 Jupyter Notebook
You can easily run the example in a Jupyter Notebook.
# Start a jupyter notebook
jupyter notebook --notebook-dir=/path/to/notebook/directory --ip=* --no-browser
You can copy the code of train.py to the notebook and run the cells on yarn-client mode.
from bigdl.orca import init_orca_context
init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2,
driver_cores=2, driver_memory="4g",
extra_python_lib="model.py")
Note:
- Jupyter Notebook cannot run on
yarn-cluster, as the driver is not running on the Client Node(the notebook page).
5.2 Use bigdl-submit
For users who want to use a script instead of Python command, BigDL provides an easy-to-use bigdl-submit script, which could automatically setup configuration and jars files from the current activate Conda environment.
Please call init_orca_context at the very beginning of the program.
from bigdl.orca import init_orca_context
init_orca_context(cluster_mode="bigdl-submit")
On the Client Node (where you submit applications), before submitting the example:
- Install and activate Conda environment (see Section 2.2.1).
- Use Conda to install BigDL and other Python libraries (see Section 2.2.2).
- Pack the current activate Conda environment to an archive.
conda pack -o environment.tar.gz
5.2.1 Yarn Client
Submit and run the example on yarn-client mode following bigdl-submit script below:
bigdl-submit \
--master yarn \
--deploy-mode client \
--executor-memory 10g \
--driver-memory 10g \
--executor-cores 8 \
--num-executors 2 \
--py-files model.py \
--archives /path/to/environment.tar.gz#environment \
--conf spark.pyspark.driver.python=/path/to/python \
--conf spark.pyspark.python=environment/bin/python \
train.py --cluster_mode bigdl-submit --remote_dir hdfs://path/to/remote/data
In the bigdl-submit script:
--master: the spark master, set it to yarn;--deploy-mode: set it to client when running programs on yarn-client mode;--executor-memory: set the memory for each executor;--driver-memory: set the memory for the driver node;--executor-cores: set the cores number for each executor;--num_executors: set the number of executors;--py-files: upload extra Python dependency files to YARN;--archives: upload the Conda archive to YARN;--conf spark.pyspark.driver.python: set the activate Python location on Client Node as driver's Python environment (find the location by runningwhich python);--conf spark.pyspark.python: set the Python location in Conda archive as executors' Python environment;
Notes:
--cluster_mode: set the cluster_mode ininit_orca_context.--remote_dir: directory on a distributed storage for the dataset (see Section 3).- Please refer to Section 4 for the description of extra Python dependencies.
5.2.2 Yarn Cluster
Submit and run the program on yarn-cluster mode following bigdl-submit script below:
bigdl-submit \
--master yarn \
--deploy-mode cluster \
--executor-memory 10g \
--driver-memory 10g \
--executor-cores 8 \
--num-executors 2 \
--py-files model.py \
--archives /path/to/environment.tar.gz#environment \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python \
--conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \
train.py --cluster_mode bigdl-submit --remote_dir hdfs://path/to/remote/data
In the bigdl-submit script:
--master: the spark master, set it toyarn;--deploy-mode: set it toclusterwhen running programs on yarn-cluster mode;--executor-memory: set the memory for each executor;--driver-memory: set the memory for the driver node;--executor-cores: set the cores number for each executor;--num_executors: set the number of executors;--py-files: upload extra Python dependency files to YARN;--archives: upload the Conda archive to YARN;--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON: set the Python location in Conda archive as Python environment of Application Master process;--conf spark.executorEnv.PYSPARK_PYTHON: set the Python location in Conda archive as Python environment of executors, the Application Master and executor will all use the archive for Python environment;
Notes:
--cluster_mode: set the cluster_mode ininit_orca_context;--remote_dir: directory on a distributed storage for the dataset (see Section 3).- Please refer to Section 4 for the description of extra Python dependencies.
5.3 Use spark-submit
When the Client Node (where you submit applications) is not able to install BigDL using Conda, please use spark-submit script instead.
Please call init_orca_context at the very beginning of the program.
from bigdl.orca import init_orca_context
# Please set cluster_mode to "spark-submit".
init_orca_context(cluster_mode="spark-submit")
Before submitting application, you need:
- On the Development Node (which could use Conda):
- Install and activate Conda environment (see Section 2.2.1).
- Use Conda to install BigDL and other Python libraries (see Section 2.2.2).
- Pack the current activate Conda environment to an archive;
conda pack -o environment.tar.gz - Send the Conda archive to the Client Node;
scp /path/to/environment.tar.gz username@client_ip:/path/to/
- On the Client Node (where you submit applications):
- Setup spark environment variables
${SPARK_HOME}and${SPARK_VERSION}.export SPARK_HOME=/path/to/spark # the folder path where you extract the Spark package export SPARK_VERSION="your spark version" - Download and unzip a BigDL assembly package from BigDL Assembly Spark 2.4.6 or BigDL Assembly Spark 3.1.2 (according to your Spark version), then setup
${BIGDL_HOME}and${BIGDL_VERSION}.export BIGDL_HOME=/path/to/unzipped_BigDL export BIGDL_VERSION="download BigDL version"
- Setup spark environment variables
5.3.1 Yarn Client
Submit and run the program on yarn-client mode following spark-submit script below:
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode client \
--executor-memory 10g \
--driver-memory 10g \
--executor-cores 8 \
--num-executors 2 \
--archives /path/to/environment.tar.gz#environment \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,model.py \
--conf spark.pyspark.driver.python=/path/to/python \
--conf spark.pyspark.python=environment/bin/python \
--conf spark.driver.extraClassPath=${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=${BIGDL_HOME}/jars/* \
train.py --cluster_mode spark-submit --remote_dir hdfs://path/to/remote/data
In the spark-submit script:
--master: the spark master, set it toyarn;--deploy-mode: set it toclientwhen running programs on yarn-client mode;--executor-memory: set the memory for each executor;--driver-memory: set the memory for the driver node;--executor-cores: set the cores number for each executor;--num_executors: set the number of executors;--archives: upload the Conda archive to YARN;--properties-file: upload the BigDL configuration properties to YARN;--py-files: upload extra Python dependency files to YARN;--conf spark.pyspark.driver.python: set the Python location in Conda archive as driver's Python environment (find the location by runningwhich python);--conf spark.pyspark.python: set the Python location in Conda archive as executors' Python environment;--conf spark.driver.extraClassPath: upload and register the BigDL jars files to the driver's classpath;--conf spark.executor.extraClassPath: upload and register the BigDL jars files to the executors' classpath;
Notes:
--cluster_mode: set the cluster_mode ininit_orca_context;--remote_dir: directory on a distributed storage for the dataset (see Section 3).- Please refer to Section 4 for the description of extra Python dependencies.
5.3.2 Yarn-Cluster
Note:
- Please register BigDL jars through
--jarsoption in thespark-submitscript.
Submit and run the program on yarn-cluster mode following spark-submit script below:
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--executor-memory 10g \
--driver-memory 10g \
--executor-cores 4 \
--num-executors 2 \
--archives /path/to/environment.tar.gz#environment \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python \
--conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \
--py-files ${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,model.py \
--jars ${BIGDL_HOME}/jars/bigdl-assembly-spark_${SPARK_VERSION}-${BIGDL_VERSION}-jar-with-dependencies.jar \
train.py --cluster_mode spark-submit --remote_dir hdfs://path/to/remote/data
In the spark-submit script:
--master: the spark master, set it toyarn;--deploy-mode: set it toclusterwhen running programs on yarn-cluster mode;--executor-memory: set the memory for each executor;--driver-memory: set the memory for the driver node;--executor-cores: set the cores number for each executor;--num_executors: set the number of executors;--archives: upload the Conda archive to YARN;--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON: set the Python location in Conda archive as Python environment of Application Master process;--conf spark.executorEnv.PYSPARK_PYTHON: set the Python location in Conda archive as executors' Python environment, the Application Master and executor will all use the archive for Python environment;--py-files: upload extra Python dependency files to YARN;--jars: upload and register BigDL dependency jars files to YARN;
Notes: