ipex-llm/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md
Kai Huang f98a8612c6 Add YARN tutorial to docs (#6242)
* add yarn to docs

* modify

* meet review

* modify style

* minor

* change #

* minor
2022-10-21 17:08:16 +08:00

23 KiB

Running BigDL-Orca on Hadoop/YARN Clusters

This tutorial provides a step-by-step guide on how to run BigDL-Orca programs on Apache Hadoop/YARN clusters, using a PyTorch Fashin-MNIST program as a working example.

1. Key Concepts

1.1 Init_orca_context

A BigDL Orca program usually starts with the initialization of OrcaContext. For every BigDL Orca program, you should call init_orca_context at the beginning of the program as below:

from bigdl.orca import init_orca_context

init_orca_context(cluster_mode, cores, memory, num_nodes, driver_cores, driver_memory, extra_python_lib, conf)

In init_orca_context, you may specify necessary runtime configurations for running the example on YARN, including:

  • cluster_mode: a String that specifies the underlying cluster; valid value includes "local", "yarn-client", "yarn-cluster", "k8s-client", "k8s-cluster", "bigdl-submit", "spark-submit".
  • cores: an Integer that specifies the number of cores for each executor (default to be 2).
  • memory: a String that specifies the memory for each executor (default to be "2g").
  • num_nodes: an Integer that specifies the number of executors (default to be 1).
  • driver_cores: an Integer that specifies the number of cores for the driver node (default to be 4).
  • driver_memory: a String that specifies the memory for the driver node (default to be "1g").
  • extra_python_lib: a String that specifies the path to extra Python package, one of .py, .zip or .egg files (default to be None).
  • conf: a Key-Value format to append extra conf for Spark (default to be None).

Note:

  • All arguments except cluster_mode will be ignored when using bigdl-submit or spark-submit to submit and run Orca programs, in which case you are supposed to specify the configurations via the submit command.

After the Orca programs finish, you should call stop_orca_context at the end of the program to release resources and shutdown the underlying distributed runtime engine (such as Spark or Ray).

from bigdl.orca import stop_orca_context

stop_orca_context()

For more details, please see OrcaContext.

1.2 Yarn-Client & Yarn-Cluster

The difference between yarn-client and yarn-cluster is where you run your Spark driver.

For yarn-client, the Spark driver runs in the client process, and the application master is only used for requesting resources from YARN, while for yarn-cluster the Spark driver runs inside an application master process which is managed by YARN in the cluster.

For more details, please see Launching Spark on YARN.

1.3 Use Distributed Storage When Running on YARN

Note:

  • When you are running programs on YARN, you are recommended to load data from a distributed storage (e.g. HDFS or S3) instead of the local file system.

The Fashion-MNIST example uses a utility function get_remote_file_to_local provided by BigDL to download datasets and create PyTorch Dataloader on each executor.

import torch
import torchvision
import torchvision.transforms as transforms
from bigdl.orca.data.file import get_remote_file_to_local

def train_data_creator(config, batch_size):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (0.5,))])

    get_remote_file_to_local(remote_path="hdfs://path/to/dataset", local_path="/tmp/dataset")

    trainset = torchvision.datasets.FashionMNIST(root="/tmp/dataset", train=True,
                                                 download=False, transform=transform)

    trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                              shuffle=True, num_workers=0)

    return trainloader

2. Prepare Environment

Before running the BigDL program on YARN, you need to setup the environment following the steps below:

2.1 Setup JAVA & Hadoop Environment

Setup JAVA Environment

You need to download and install JDK in the environment, and properly set the environment variable JAVA_HOME, which is required by Spark. JDK8 is highly recommended.

# For Ubuntu
sudo apt-get install openjdk-8-jre
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

# For CentOS
su -c "yum install java-1.8.0-openjdk"
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64/jre

export PATH=$PATH:$JAVA_HOME/bin
java -version  # Verify the version of JDK.

Setup Hadoop Environment

Check the Hadoop setup and configurations of our cluster. Make sure you correctly set the environment variable HADOOP_CONF_DIR, which is needed to initialize Spark on YARN:

export HADOOP_CONF_DIR=/path/to/hadoop/conf

2.2 Install Python Libraries

Install Conda

You need first to use conda to prepare the Python environment on the Client Node (where you submit applications). You could download and install Conda following Conda User Guide or executing the command as below.

# Download Anaconda installation script 
wget -P /tmp https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh

# Execute the script to install conda
bash /tmp/Anaconda3-2020.02-Linux-x86_64.sh

# Please type this command in your terminal to activate Conda environment
source ~/.bashrc

Use Conda to install BigDL and other Python libraries Create a conda environment, install BigDL and all the needed Python libraries in the created conda environment:

# "env" is conda environment name, you can use any name you like.
# Please change Python version to 3.6 if you need a Python 3.6 environment.
conda create -n env python=3.7 
conda activate env

You can install the latest release version of BigDL (built on top of Spark 2.4.6 by default) as follows:

pip install bigdl

You can install the latest nightly build of BigDL as follows:

pip install --pre --upgrade bigdl

Notes:

  • Using Conda to install BigDL will automatically install libraries including conda-pack, pyspark==2.4.6, and other related dependencies.

  • You can install BigDL built on top of Spark 3.1.2 as follows:

    # Install the latest release version
    pip install bigdl-spark3
    
    # Install the latest nightly build version
    pip install --pre --upgrade bigdl-spark3 
    

    Installing bigdl-spark3 will automatically install pyspark==3.1.2.

  • You also need to install any additional python libraries that your application depends on in this Conda environment.

Please see more details in Python User Guide.

2.3 Notes for CDH Users

  • For CDH users, the environment variable HADOOP_CONF_DIR should be /etc/hadoop/conf by default.

  • The Client Node (where you submit applications) may have already installed a different version of Spark than the one installed with BigDL. To avoid conflicts, unset all Spark-related environment variables (you may use use env | grep SPARK to find all of them):

    unset SPARK_HOME
    unset SPARK_VERSION
    unset ...
    

3. Prepare Dataset

To run the example on YARN, you should upload the Fashion-MNIST dataset to a distributed storage (such as HDFS or S3).

First, please download the Fashion-MNIST dataset manually on your Client Node (where you submit the program to YARN).

# PyTorch official dataset download link
git clone https://github.com/zalandoresearch/fashion-mnist.git

mv /path/to/fashion-mnist/data/fashion /path/to/local/data/FashionMNIST/raw 

Then upload it to a distributed storage.

# Upload to HDFS
hdfs dfs -put /path/to/local/data/FashionMNIST hdfs://path/to/remote/data

4. Prepare Custom Modules

Spark allows to upload Python files (.py), and zipped Python packages (.zip) across the cluster by setting --py-files option in Spark scripts or extra_python_lib in init_orca_context.

The FasionMNIST example needs to import modules from model.py.

  • When using python command, please specify extra_python_lib in init_orca_context.

    from bigdl.orca import init_orca_context, stop_orca_context
    from model import model_creator, optimizer_creator
    
    # Please switch the `cluster_mode` to `yarn-cluster` when running on cluster mode.
    init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2,
                      driver_cores=2, driver_memory="4g",
                      extra_python_lib="model.py")
    

    Please see more details in Orca Document.

  • When using bigdl-submit or spark-submit script, please specify --py-files option in the script.

    bigdl-submit # or spark-submit
        --master yarn \
        --delopy-mode client \
        --py-files model.py
        train.py
    

    Import custom modules at the beginning of the example:

    from bigdl.orca import init_orca_context, stop_orca_context
    from model import model_creator, optimizer_creator
    
    init_orca_context(cluster_mode="bigdl-submit") # or spark-submit
    

    Please see more details in Spark Document.

Note:

  • If your program depends on a nested directory of Python files, you are recommended to follow the steps below to use a zipped package instead.
    1. Compress the directory into a zipped package.
      zip -q -r FashionMNIST_zipped.zip FashionMNIST
      
    2. Please upload the zipped package (FashionMNIST_zipped.zip) to YARN.
      • When using python command, please specify extra_python_lib argument in init_orca_context.

      • When using bigdl-submit or spark-submit script, please specify --py-files option in the script.

    3. You can then import the custom modules from the unzipped file in your program as below.
      from FashionMNIST.model import model_creator, optimizer_creator
      

5. Run Jobs on YARN

In the following part, we will illustrate three ways to submit and run BigDL Orca applications on YARN.

  • Use python command
  • Use bigdl-submit
  • Use spark-submit

You can choose one of them based on your preference or cluster settings.

5.1 Use python Command

This is the easiest and most recommended way to run BigDL on YARN.

Note:

  • You only need to prepare the environment on the Client Node (where you submit applications), all dependencies would be automatically packaged and distributed to YARN cluster.

5.1.1 Yarn Client

Please call init_orca_context at the very beginning of each Orca program.

from bigdl.orca import init_orca_context

init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2, 
                  driver_cores=2, driver_memory="4g",
                  extra_python_lib="model.py")

Run the example following command below:

python train.py --cluster_mode yarn-client --remote_dir hdfs://path/to/remote/data
  • --cluster_mode: set the cluster_mode in init_orca_context.
  • --remote_dir: directory on a distributed storage for the dataset (see Section 3).

Note:

  • Please refer to Section 4 for the description of extra_python_lib.

5.1.2 Yarn Cluster

Please call init_orca_context at the very beginning of each Orca program.

from bigdl.orca import init_orca_context

init_orca_context(cluster_mode="yarn-cluster", cores=4, memory="10g", num_nodes=2,
                  driver_cores=2, driver_memory="4g",
                  extra_python_lib="model.py")

Run the example following command below:

python train.py --cluster_mode yarn-cluster --remote_dir hdfs://path/to/remote/data
  • --cluster_mode: set the cluster_mode in init_orca_context.
  • --remote_dir: directory on a distributed storage for the dataset (see Section 3).

Note:

  • Please refer to Section 4 for the description of extra_python_lib.

5.1.3 Jupyter Notebook

You can easily run the example in a Jupyter Notebook.

# Start a jupyter notebook
jupyter notebook --notebook-dir=/path/to/notebook/directory --ip=* --no-browser

You can copy the code of train.py to the notebook and run the cells on yarn-client mode.

from bigdl.orca import init_orca_context

init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2, 
                  driver_cores=2, driver_memory="4g",
                  extra_python_lib="model.py")

Note:

  • Jupyter Notebook cannot run on yarn-cluster, as the driver is not running on the Client Node(the notebook page).

5.2 Use bigdl-submit

For users who want to use a script instead of Python command, BigDL provides an easy-to-use bigdl-submit script, which could automatically setup configuration and jars files from the current activate Conda environment.

Please call init_orca_context at the very beginning of the program.

from bigdl.orca import init_orca_context

init_orca_context(cluster_mode="bigdl-submit")

On the Client Node (where you submit applications), before submitting the example:

  1. Install and activate Conda environment (see Section 2.2.1).
  2. Use Conda to install BigDL and other Python libraries (see Section 2.2.2).
  3. Pack the current activate Conda environment to an archive.
    conda pack -o environment.tar.gz
    

5.2.1 Yarn Client

Submit and run the example on yarn-client mode following bigdl-submit script below:

bigdl-submit \
    --master yarn \
    --deploy-mode client \
    --executor-memory 10g \
    --driver-memory 10g \
    --executor-cores 8 \
    --num-executors 2 \
    --py-files model.py \
    --archives /path/to/environment.tar.gz#environment \
    --conf spark.pyspark.driver.python=/path/to/python \
    --conf spark.pyspark.python=environment/bin/python \
    train.py --cluster_mode bigdl-submit --remote_dir hdfs://path/to/remote/data

In the bigdl-submit script:

  • --master: the spark master, set it to yarn;
  • --deploy-mode: set it to client when running programs on yarn-client mode;
  • --executor-memory: set the memory for each executor;
  • --driver-memory: set the memory for the driver node;
  • --executor-cores: set the cores number for each executor;
  • --num_executors: set the number of executors;
  • --py-files: upload extra Python dependency files to YARN;
  • --archives: upload the Conda archive to YARN;
  • --conf spark.pyspark.driver.python: set the activate Python location on Client Node as driver's Python environment (find the location by running which python);
  • --conf spark.pyspark.python: set the Python location in Conda archive as executors' Python environment;

Notes:

  • --cluster_mode: set the cluster_mode in init_orca_context.
  • --remote_dir: directory on a distributed storage for the dataset (see Section 3).
  • Please refer to Section 4 for the description of extra Python dependencies.

5.2.2 Yarn Cluster

Submit and run the program on yarn-cluster mode following bigdl-submit script below:

bigdl-submit \
    --master yarn \
    --deploy-mode cluster \
    --executor-memory 10g \
    --driver-memory 10g \
    --executor-cores 8 \
    --num-executors 2 \
    --py-files model.py \
    --archives /path/to/environment.tar.gz#environment \
    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python \
    --conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \
    train.py --cluster_mode bigdl-submit --remote_dir hdfs://path/to/remote/data

In the bigdl-submit script:

  • --master: the spark master, set it to yarn;
  • --deploy-mode: set it to cluster when running programs on yarn-cluster mode;
  • --executor-memory: set the memory for each executor;
  • --driver-memory: set the memory for the driver node;
  • --executor-cores: set the cores number for each executor;
  • --num_executors: set the number of executors;
  • --py-files: upload extra Python dependency files to YARN;
  • --archives: upload the Conda archive to YARN;
  • --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON: set the Python location in Conda archive as Python environment of Application Master process;
  • --conf spark.executorEnv.PYSPARK_PYTHON: set the Python location in Conda archive as Python environment of executors, the Application Master and executor will all use the archive for Python environment;

Notes:

  • --cluster_mode: set the cluster_mode in init_orca_context;
  • --remote_dir: directory on a distributed storage for the dataset (see Section 3).
  • Please refer to Section 4 for the description of extra Python dependencies.

5.3 Use spark-submit

When the Client Node (where you submit applications) is not able to install BigDL using Conda, please use spark-submit script instead.

Please call init_orca_context at the very beginning of the program.

from bigdl.orca import init_orca_context

# Please set cluster_mode to "spark-submit".
init_orca_context(cluster_mode="spark-submit")

Before submitting application, you need:

  • On the Development Node (which could use Conda):
    1. Install and activate Conda environment (see Section 2.2.1).
    2. Use Conda to install BigDL and other Python libraries (see Section 2.2.2).
    3. Pack the current activate Conda environment to an archive;
      conda pack -o environment.tar.gz
      
    4. Send the Conda archive to the Client Node;
      scp /path/to/environment.tar.gz username@client_ip:/path/to/
      
  • On the Client Node (where you submit applications):
    1. Setup spark environment variables ${SPARK_HOME} and ${SPARK_VERSION}.
      export SPARK_HOME=/path/to/spark # the folder path where you extract the Spark package
      export SPARK_VERSION="your spark version"
      
    2. Download and unzip a BigDL assembly package from BigDL Assembly Spark 2.4.6 or BigDL Assembly Spark 3.1.2 (according to your Spark version), then setup ${BIGDL_HOME} and ${BIGDL_VERSION}.
      export BIGDL_HOME=/path/to/unzipped_BigDL
      export BIGDL_VERSION="download BigDL version"
      

5.3.1 Yarn Client

Submit and run the program on yarn-client mode following spark-submit script below:

${SPARK_HOME}/bin/spark-submit \
    --master yarn \
    --deploy-mode client \
    --executor-memory 10g \
    --driver-memory 10g \
    --executor-cores 8 \
    --num-executors 2 \
    --archives /path/to/environment.tar.gz#environment \
    --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
    --py-files ${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,model.py \
    --conf spark.pyspark.driver.python=/path/to/python \
    --conf spark.pyspark.python=environment/bin/python \
    --conf spark.driver.extraClassPath=${BIGDL_HOME}/jars/* \
    --conf spark.executor.extraClassPath=${BIGDL_HOME}/jars/* \
    train.py --cluster_mode spark-submit --remote_dir hdfs://path/to/remote/data

In the spark-submit script:

  • --master: the spark master, set it to yarn;
  • --deploy-mode: set it to client when running programs on yarn-client mode;
  • --executor-memory: set the memory for each executor;
  • --driver-memory: set the memory for the driver node;
  • --executor-cores: set the cores number for each executor;
  • --num_executors: set the number of executors;
  • --archives: upload the Conda archive to YARN;
  • --properties-file: upload the BigDL configuration properties to YARN;
  • --py-files: upload extra Python dependency files to YARN;
  • --conf spark.pyspark.driver.python: set the Python location in Conda archive as driver's Python environment (find the location by running which python);
  • --conf spark.pyspark.python: set the Python location in Conda archive as executors' Python environment;
  • --conf spark.driver.extraClassPath: upload and register the BigDL jars files to the driver's classpath;
  • --conf spark.executor.extraClassPath: upload and register the BigDL jars files to the executors' classpath;

Notes:

  • --cluster_mode: set the cluster_mode in init_orca_context;
  • --remote_dir: directory on a distributed storage for the dataset (see Section 3).
  • Please refer to Section 4 for the description of extra Python dependencies.

5.3.2 Yarn-Cluster

Note:

  • Please register BigDL jars through --jars option in the spark-submit script.

Submit and run the program on yarn-cluster mode following spark-submit script below:

${SPARK_HOME}/bin/spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --executor-memory 10g \
    --driver-memory 10g \
    --executor-cores 4 \
    --num-executors 2 \
    --archives /path/to/environment.tar.gz#environment \
    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python \
    --conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \
    --py-files ${BIGDL_HOME}/python/bigdl-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,model.py \
    --jars ${BIGDL_HOME}/jars/bigdl-assembly-spark_${SPARK_VERSION}-${BIGDL_VERSION}-jar-with-dependencies.jar \
    train.py --cluster_mode spark-submit --remote_dir hdfs://path/to/remote/data

In the spark-submit script:

  • --master: the spark master, set it to yarn;
  • --deploy-mode: set it to cluster when running programs on yarn-cluster mode;
  • --executor-memory: set the memory for each executor;
  • --driver-memory: set the memory for the driver node;
  • --executor-cores: set the cores number for each executor;
  • --num_executors: set the number of executors;
  • --archives: upload the Conda archive to YARN;
  • --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON: set the Python location in Conda archive as Python environment of Application Master process;
  • --conf spark.executorEnv.PYSPARK_PYTHON: set the Python location in Conda archive as executors' Python environment, the Application Master and executor will all use the archive for Python environment;
  • --py-files: upload extra Python dependency files to YARN;
  • --jars: upload and register BigDL dependency jars files to YARN;

Notes:

  • --cluster_mode: set the cluster_mode in init_orca_context;
  • --remote_dir: directory on a distributed storage for the dataset (see Section 3).
  • Please refer to Section 4 for the description of extra Python dependencies.