Update hadoop.md (#3433)

* Update hadoop.md * Update hadoop.md * Update hadoop.md * add pyspark with dllib * some update * add spark-submit-with-dllib * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md
2021-11-12 13:34:04 +08:00 · 2021-11-12 13:34:04 +08:00 · 453c991c19
commit 453c991c19
parent 829ee31f1b
1 changed files with 37 additions and 30 deletions
--- a/docs/readthedocs/source/doc/UserGuide/hadoop.md
+++ b/docs/readthedocs/source/doc/UserGuide/hadoop.md
@ -4,10 +4,10 @@ Hadoop version: Apache Hadoop >= 2.7 (3.X included) or [CDH](https://www.clouder

 ---

-For scala user, please see [scala user guide](./scala.md) for how to run BigDL on hadoop/yarn cluster.  
-For python user, you can run BigDL programs on standard Hadoop/YARN clusters without any changes to the cluster(i.e., no need to pre-install BigDL or any Python libraries in the cluster).
+For _**scala user**_, please see [scala user guide](./scala.md) for how to run BigDL on hadoop/yarn cluster.  
+For _**python user**_, you can run BigDL programs on standard Hadoop/YARN clusters without any changes to the cluster(i.e., no need to pre-install BigDL or any Python libraries in the cluster).

-### **1. Prepare python Environment**
+### **1. Prepare Python Environment**

 - You need to first use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to prepare the Python environment _**on the local client machine**_. Create a conda environment and install all the needed Python libraries in the created conda environment:

@ -43,26 +43,26 @@ For python user, you can run BigDL programs on standard Hadoop/YARN clusters wit

 - **For CDH users**

-If you are using BigDL with pip and your CDH cluster has already installed Spark, the CDH's spark will have conflict with the pyspark installed by pip required by bigdl in next section.
+  If you are using BigDL with pip and your CDH cluster has already installed Spark, the CDH's spark will have conflict with the pyspark installed by pip required by bigdl in next section.

-Thus before running bigdl applications, you should unset all the spark related environment variables. You can use `env | grep SPARK` to find all the existing spark environment variables.
+  Thus before running bigdl applications, you should unset all the spark related environment variables. You can use `env | grep SPARK` to find all the existing spark environment variables.

-Also, CDH cluster's `HADOOP_CONF_DIR` should be `/etc/hadoop/conf` on CDH by default.
+  Also, CDH cluster's `HADOOP_CONF_DIR` should be `/etc/hadoop/conf` on CDH by default.

 ---
-### **2. YARN Client Mode**
+### **2. Run on YARN with build-in function**

+_**This is the most recommended way to run bigdl on yarn,**_ as we has put conda pack and all the spark-submit setting into our codes, you can easy change your job between local and yarn.
 - Install BigDL components in the created conda environment via pip, like dllib and orca:

  ```bash
-  pip install bigdl-dllib
-  pip install bigdl-orca
+  pip install bigdl-dllib bigdl-orca
  ```

  View the [Python User Guide](./python.md) for more details.
  

- We recommend using `init_orca_context` at the very beginning of your code to initiate and run BigDL on standard Hadoop/YARN clusters in [YARN client mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn):
+- We recommend using `init_orca_context` at the very beginning of your code to initiate and run BigDL on standard [Hadoop/YARN clusters](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn):

  ```python
  from bigdl.orca import init_orca_context
@ -70,37 +70,27 @@ Also, CDH cluster's `HADOOP_CONF_DIR` should be `/etc/hadoop/conf` on CDH by def
  sc = init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2)
  ```

-  By specifying cluster_mode to be "yarn-client", `init_orca_context` would automatically prepare the runtime Python environment, detect the current Hadoop configurations from `HADOOP_CONF_DIR` and initiate the distributed execution engine on the underlying YARN cluster. View [Orca Context](../Orca/Overview/orca-context.md) for more details.
-  
+  `init_orca_context` would automatically prepare the runtime Python environment, detect the current Hadoop configurations from `HADOOP_CONF_DIR` and initiate the distributed execution engine on the underlying YARN cluster. View [Orca Context](../Orca/Overview/orca-context.md) for more details.    
+  By specifying cluster_mode to be "yarn-client", `init_orca_context` will submit the job to yarn with client mode.  
+  By specifying cluster_mode to be "yarn-cluster", `init_orca_context` will submit the job to yarn with cluster mode.  
+  The difference between "yarn-client" and "yarn-cluster" is where you run your spark driver, "yarn-client"'s spark driver will run on the node you start python, while "yarn-cluster"'s spark driver will run on a random node on yarn cluster. So if you are running with "yarn-cluster", you should change the application's reading from local file to a network file system, like HDFS.  

- You can then simply run your BigDL program in a Jupyter notebook:
+- You can then simply run your BigDL program in a Jupyter notebook, please notice _**jupyter cannot use yarn-cluster**_, as driver is not running on local node:

  ```bash
  jupyter notebook --notebook-dir=./ --ip=* --no-browser
  ```

-  or as a normal Python script (e.g. script.py):
+  or as a normal Python script (e.g. script.py), both "yarn-client" and "yarn-cluster" is supported:

  ```bash
  python script.py
  ```

 ---
-### **3. YARN Cluster Mode**
+### **3. Run on YARN with spark-submit**

-Follow the steps below if you need to run BigDL in [YARN cluster mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn).
-
- Download and extract [Spark](https://spark.apache.org/downloads.html). You are recommended to use [Spark 2.4.6](https://archive.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz). Set the environment variable `SPARK_HOME`:
-
-  ```bash
-  export SPARK_HOME=the root directory where you extract the downloaded Spark package
-  ```
-
- Download and extract [BigDL](../release.md). Make sure the BigDL package you download is built with the compatible version with your Spark. Set the environment variable `BIGDL_HOME`:
-
-  ```bash
-  export BIGDL_HOME=the root directory where you extract the downloaded BigDL package
-  ```
+Follow the steps below if you need to run BigDL with [spark-submit](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn).  

 - Pack the current conda environment to `environment.tar.gz` (you can use any name you like):

@ -118,9 +108,11 @@ Follow the steps below if you need to run BigDL in [YARN cluster mode](https://s

 - Use `spark-submit` to submit your BigDL program (e.g. script.py):

+  yarn-cluster mode:
  ```bash
-  PYSPARK_PYTHON=./environment/bin/python ${BIGDL_HOME}/bin/spark-submit-python-with-bigdl.sh \
-      --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
+  spark-submit-with-dllib \
+      --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python \
+      --conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \
      --master yarn-cluster \
      --executor-memory 10g \
      --driver-memory 10g \
@ -131,3 +123,18 @@ Follow the steps below if you need to run BigDL in [YARN cluster mode](https://s
  ```

  You can adjust the configurations according to your cluster settings.
+
+  yarn-client mode:
+  ```bash
+  spark-submit-with-dllib \
+      --conf spark.driverEnv.PYSPARK_PYTHON=environment/bin/python \
+      --conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \
+      --master yarn-client \
+      --executor-memory 10g \
+      --driver-memory 10g \
+      --executor-cores 8 \
+      --num-executors 2 \
+      --archives environment.tar.gz#environment \
+      script.py
+  ```
+  Notice: `yarn-client`'s driver is running on local, while `yarn-cluster`'s driver is running on a yarn container, so the environment setting of driver's `PYSPARK_PYTHON` is different. `yarn-client` mode is `spark.driverEnv.PYSPARK_PYTHON`, and `yarn-cluster` mode is `spark.yarn.appMasterEnv.PYSPARK_PYTHON`.