Update hadoop.md (#3433)

* Update hadoop.md * Update hadoop.md * Update hadoop.md * add pyspark with dllib * some update * add spark-submit-with-dllib * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md
2021-11-12 13:34:04 +08:00 · 2021-11-12 13:34:04 +08:00 · 453c991c19
commit 453c991c19
parent 829ee31f1b
1 changed files with 37 additions and 30 deletions
--- a/docs/readthedocs/source/doc/UserGuide/hadoop.md
+++ b/docs/readthedocs/source/doc/UserGuide/hadoop.md
@ -4,10 +4,10 @@ Hadoop version: Apache Hadoop >= 2.7 (3.X included) or [CDH](https://www.clouder
 ---
-For scala user, please see [scala user guide](./scala.md) for how to run BigDL on hadoop/yarn cluster.  
+For _**scala user**_, please see [scala user guide](./scala.md) for how to run BigDL on hadoop/yarn cluster.  
-For python user, you can run BigDL programs on standard Hadoop/YARN clusters without any changes to the cluster(i.e., no need to pre-install BigDL or any Python libraries in the cluster).
+For _**python user**_, you can run BigDL programs on standard Hadoop/YARN clusters without any changes to the cluster(i.e., no need to pre-install BigDL or any Python libraries in the cluster).
-### **1. Prepare python Environment**
+### **1. Prepare Python Environment**
 - You need to first use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to prepare the Python environment _**on the local client machine**_. Create a conda environment and install all the needed Python libraries in the created conda environment:
@ -50,19 +50,19 @@ Thus before running bigdl applications, you should unset all the spark related e
  Also, CDH cluster's `HADOOP_CONF_DIR` should be `/etc/hadoop/conf` on CDH by default.
 ---
-### **2. YARN Client Mode**
+### **2. Run on YARN with build-in function**
 _**This is the most recommended way to run bigdl on yarn,**_ as we has put conda pack and all the spark-submit setting into our codes, you can easy change your job between local and yarn.
 - Install BigDL components in the created conda environment via pip, like dllib and orca:
  ```bash
-  pip install bigdl-dllib
+  pip install bigdl-dllib bigdl-orca
  pip install bigdl-orca
  ```
  View the [Python User Guide](./python.md) for more details.
- We recommend using `init_orca_context` at the very beginning of your code to initiate and run BigDL on standard Hadoop/YARN clusters in [YARN client mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn):
+- We recommend using `init_orca_context` at the very beginning of your code to initiate and run BigDL on standard [Hadoop/YARN clusters](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn):
  ```python
  from bigdl.orca import init_orca_context
@ -70,37 +70,27 @@ Also, CDH cluster's `HADOOP_CONF_DIR` should be `/etc/hadoop/conf` on CDH by def
  sc = init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2)
  ```
-  By specifying cluster_mode to be "yarn-client", `init_orca_context` would automatically prepare the runtime Python environment, detect the current Hadoop configurations from `HADOOP_CONF_DIR` and initiate the distributed execution engine on the underlying YARN cluster. View [Orca Context](../Orca/Overview/orca-context.md) for more details.
+  `init_orca_context` would automatically prepare the runtime Python environment, detect the current Hadoop configurations from `HADOOP_CONF_DIR` and initiate the distributed execution engine on the underlying YARN cluster. View [Orca Context](../Orca/Overview/orca-context.md) for more details.    
  By specifying cluster_mode to be "yarn-client", `init_orca_context` will submit the job to yarn with client mode.  
  By specifying cluster_mode to be "yarn-cluster", `init_orca_context` will submit the job to yarn with cluster mode.  
  The difference between "yarn-client" and "yarn-cluster" is where you run your spark driver, "yarn-client"'s spark driver will run on the node you start python, while "yarn-cluster"'s spark driver will run on a random node on yarn cluster. So if you are running with "yarn-cluster", you should change the application's reading from local file to a network file system, like HDFS.  
-
+- You can then simply run your BigDL program in a Jupyter notebook, please notice _**jupyter cannot use yarn-cluster**_, as driver is not running on local node:
 - You can then simply run your BigDL program in a Jupyter notebook:
  ```bash
  jupyter notebook --notebook-dir=./ --ip=* --no-browser
  ```
-  or as a normal Python script (e.g. script.py):
+  or as a normal Python script (e.g. script.py), both "yarn-client" and "yarn-cluster" is supported:
  ```bash
  python script.py
  ```
 ---
-### **3. YARN Cluster Mode**
+### **3. Run on YARN with spark-submit**
-Follow the steps below if you need to run BigDL in [YARN cluster mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn).
+Follow the steps below if you need to run BigDL with [spark-submit](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn).  
 - Download and extract [Spark](https://spark.apache.org/downloads.html). You are recommended to use [Spark 2.4.6](https://archive.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz). Set the environment variable `SPARK_HOME`:
  ```bash
  export SPARK_HOME=the root directory where you extract the downloaded Spark package
  ```
 - Download and extract [BigDL](../release.md). Make sure the BigDL package you download is built with the compatible version with your Spark. Set the environment variable `BIGDL_HOME`:
  ```bash
  export BIGDL_HOME=the root directory where you extract the downloaded BigDL package
  ```
 - Pack the current conda environment to `environment.tar.gz` (you can use any name you like):
@ -118,9 +108,11 @@ Follow the steps below if you need to run BigDL in [YARN cluster mode](https://s
 - Use `spark-submit` to submit your BigDL program (e.g. script.py):
  yarn-cluster mode:
  ```bash
-  PYSPARK_PYTHON=./environment/bin/python ${BIGDL_HOME}/bin/spark-submit-python-with-bigdl.sh \
+  spark-submit-with-dllib \
-      --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
+      --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python \
      --conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \
      --master yarn-cluster \
      --executor-memory 10g \
      --driver-memory 10g \
@ -131,3 +123,18 @@ Follow the steps below if you need to run BigDL in [YARN cluster mode](https://s
  ```
  You can adjust the configurations according to your cluster settings.
  yarn-client mode:
  ```bash
  spark-submit-with-dllib \
      --conf spark.driverEnv.PYSPARK_PYTHON=environment/bin/python \
      --conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \
      --master yarn-client \
      --executor-memory 10g \
      --driver-memory 10g \
      --executor-cores 8 \
      --num-executors 2 \
      --archives environment.tar.gz#environment \
      script.py
  ```
  Notice: `yarn-client`'s driver is running on local, while `yarn-cluster`'s driver is running on a yarn container, so the environment setting of driver's `PYSPARK_PYTHON` is different. `yarn-client` mode is `spark.driverEnv.PYSPARK_PYTHON`, and `yarn-cluster` mode is `spark.yarn.appMasterEnv.PYSPARK_PYTHON`.