Update hadoop.md (#3433)

* Update hadoop.md

* Update hadoop.md

* Update hadoop.md

* add pyspark with dllib

* some update

* add spark-submit-with-dllib

* Update hadoop.md

* Update hadoop.md

* Update hadoop.md

* Update hadoop.md

* Update hadoop.md

* Update hadoop.md

* Update hadoop.md

* Update hadoop.md

* Update hadoop.md
This commit is contained in:
Xin Qiu 2021-11-12 13:34:04 +08:00 committed by GitHub
parent 829ee31f1b
commit 453c991c19

View file

@ -4,10 +4,10 @@ Hadoop version: Apache Hadoop >= 2.7 (3.X included) or [CDH](https://www.clouder
--- ---
For scala user, please see [scala user guide](./scala.md) for how to run BigDL on hadoop/yarn cluster. For _**scala user**_, please see [scala user guide](./scala.md) for how to run BigDL on hadoop/yarn cluster.
For python user, you can run BigDL programs on standard Hadoop/YARN clusters without any changes to the cluster(i.e., no need to pre-install BigDL or any Python libraries in the cluster). For _**python user**_, you can run BigDL programs on standard Hadoop/YARN clusters without any changes to the cluster(i.e., no need to pre-install BigDL or any Python libraries in the cluster).
### **1. Prepare python Environment** ### **1. Prepare Python Environment**
- You need to first use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to prepare the Python environment _**on the local client machine**_. Create a conda environment and install all the needed Python libraries in the created conda environment: - You need to first use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to prepare the Python environment _**on the local client machine**_. Create a conda environment and install all the needed Python libraries in the created conda environment:
@ -50,19 +50,19 @@ Thus before running bigdl applications, you should unset all the spark related e
Also, CDH cluster's `HADOOP_CONF_DIR` should be `/etc/hadoop/conf` on CDH by default. Also, CDH cluster's `HADOOP_CONF_DIR` should be `/etc/hadoop/conf` on CDH by default.
--- ---
### **2. YARN Client Mode** ### **2. Run on YARN with build-in function**
_**This is the most recommended way to run bigdl on yarn,**_ as we has put conda pack and all the spark-submit setting into our codes, you can easy change your job between local and yarn.
- Install BigDL components in the created conda environment via pip, like dllib and orca: - Install BigDL components in the created conda environment via pip, like dllib and orca:
```bash ```bash
pip install bigdl-dllib pip install bigdl-dllib bigdl-orca
pip install bigdl-orca
``` ```
View the [Python User Guide](./python.md) for more details. View the [Python User Guide](./python.md) for more details.
- We recommend using `init_orca_context` at the very beginning of your code to initiate and run BigDL on standard Hadoop/YARN clusters in [YARN client mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn): - We recommend using `init_orca_context` at the very beginning of your code to initiate and run BigDL on standard [Hadoop/YARN clusters](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn):
```python ```python
from bigdl.orca import init_orca_context from bigdl.orca import init_orca_context
@ -70,37 +70,27 @@ Also, CDH cluster's `HADOOP_CONF_DIR` should be `/etc/hadoop/conf` on CDH by def
sc = init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2) sc = init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2)
``` ```
By specifying cluster_mode to be "yarn-client", `init_orca_context` would automatically prepare the runtime Python environment, detect the current Hadoop configurations from `HADOOP_CONF_DIR` and initiate the distributed execution engine on the underlying YARN cluster. View [Orca Context](../Orca/Overview/orca-context.md) for more details. `init_orca_context` would automatically prepare the runtime Python environment, detect the current Hadoop configurations from `HADOOP_CONF_DIR` and initiate the distributed execution engine on the underlying YARN cluster. View [Orca Context](../Orca/Overview/orca-context.md) for more details.
By specifying cluster_mode to be "yarn-client", `init_orca_context` will submit the job to yarn with client mode.
By specifying cluster_mode to be "yarn-cluster", `init_orca_context` will submit the job to yarn with cluster mode.
The difference between "yarn-client" and "yarn-cluster" is where you run your spark driver, "yarn-client"'s spark driver will run on the node you start python, while "yarn-cluster"'s spark driver will run on a random node on yarn cluster. So if you are running with "yarn-cluster", you should change the application's reading from local file to a network file system, like HDFS.
- You can then simply run your BigDL program in a Jupyter notebook, please notice _**jupyter cannot use yarn-cluster**_, as driver is not running on local node:
- You can then simply run your BigDL program in a Jupyter notebook:
```bash ```bash
jupyter notebook --notebook-dir=./ --ip=* --no-browser jupyter notebook --notebook-dir=./ --ip=* --no-browser
``` ```
or as a normal Python script (e.g. script.py): or as a normal Python script (e.g. script.py), both "yarn-client" and "yarn-cluster" is supported:
```bash ```bash
python script.py python script.py
``` ```
--- ---
### **3. YARN Cluster Mode** ### **3. Run on YARN with spark-submit**
Follow the steps below if you need to run BigDL in [YARN cluster mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn). Follow the steps below if you need to run BigDL with [spark-submit](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn).
- Download and extract [Spark](https://spark.apache.org/downloads.html). You are recommended to use [Spark 2.4.6](https://archive.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz). Set the environment variable `SPARK_HOME`:
```bash
export SPARK_HOME=the root directory where you extract the downloaded Spark package
```
- Download and extract [BigDL](../release.md). Make sure the BigDL package you download is built with the compatible version with your Spark. Set the environment variable `BIGDL_HOME`:
```bash
export BIGDL_HOME=the root directory where you extract the downloaded BigDL package
```
- Pack the current conda environment to `environment.tar.gz` (you can use any name you like): - Pack the current conda environment to `environment.tar.gz` (you can use any name you like):
@ -118,9 +108,11 @@ Follow the steps below if you need to run BigDL in [YARN cluster mode](https://s
- Use `spark-submit` to submit your BigDL program (e.g. script.py): - Use `spark-submit` to submit your BigDL program (e.g. script.py):
yarn-cluster mode:
```bash ```bash
PYSPARK_PYTHON=./environment/bin/python ${BIGDL_HOME}/bin/spark-submit-python-with-bigdl.sh \ spark-submit-with-dllib \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python \
--conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \
--master yarn-cluster \ --master yarn-cluster \
--executor-memory 10g \ --executor-memory 10g \
--driver-memory 10g \ --driver-memory 10g \
@ -131,3 +123,18 @@ Follow the steps below if you need to run BigDL in [YARN cluster mode](https://s
``` ```
You can adjust the configurations according to your cluster settings. You can adjust the configurations according to your cluster settings.
yarn-client mode:
```bash
spark-submit-with-dllib \
--conf spark.driverEnv.PYSPARK_PYTHON=environment/bin/python \
--conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \
--master yarn-client \
--executor-memory 10g \
--driver-memory 10g \
--executor-cores 8 \
--num-executors 2 \
--archives environment.tar.gz#environment \
script.py
```
Notice: `yarn-client`'s driver is running on local, while `yarn-cluster`'s driver is running on a yarn container, so the environment setting of driver's `PYSPARK_PYTHON` is different. `yarn-client` mode is `spark.driverEnv.PYSPARK_PYTHON`, and `yarn-cluster` mode is `spark.yarn.appMasterEnv.PYSPARK_PYTHON`.