Fix spark-submit scripts and doc (#4344)

* fix yarn client * fix shell script * revert sh * revert * update doc * finish doc update * fix * minor
2022-04-06 19:27:43 +08:00 · 2022-04-06 19:27:43 +08:00 · 0c0e99fee5
commit 0c0e99fee5
parent 39f7fd2d1f
1 changed files with 32 additions and 35 deletions
--- a/docs/readthedocs/source/doc/UserGuide/hadoop.md
+++ b/docs/readthedocs/source/doc/UserGuide/hadoop.md
@ -4,19 +4,23 @@ Hadoop version: Apache Hadoop >= 2.7 (3.X included) or [CDH](https://www.clouder

 ---

-For _**scala user**_, please see [scala user guide](./scala.md) for how to run BigDL on hadoop/yarn cluster.  
-For _**python user**_, you can run BigDL programs on standard Hadoop/YARN clusters without any changes to the cluster(i.e., no need to pre-install BigDL or any Python libraries in the cluster).
+For _**Scala users**_, please see [Scala User Guide](./scala.md) for how to run BigDL on Hadoop/YARN clusters.  
+
+For _**Python users**_, you can run BigDL programs on standard Hadoop/YARN clusters without any changes to the cluster (i.e., no need to pre-install BigDL or other Python libraries on all nodes in the cluster).

 ### **1. Prepare Python Environment**

- You need to first use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to prepare the Python environment _**on the local client machine**_. Create a conda environment and install all the needed Python libraries in the created conda environment:
+- You need to first use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to prepare the Python environment _**on the local machine**_ where you submit your application. Create a conda environment, install BigDL and all the needed Python libraries in the created conda environment:

  ```bash
  conda create -n bigdl python=3.7  # "bigdl" is conda environment name, you can use any name you like.
  conda activate bigdl

+  pip install bigdl
+
  # Use conda or pip to install all the needed Python dependencies in the created conda environment.
  ```
+  View the [Python User Guide](./python.md) for more details for BigDL installation.

 - You need to download and install JDK in the environment, and properly set the environment variable `JAVA_HOME`, which is required by Spark. __JDK8__ is highly recommended.

@ -43,26 +47,18 @@ For _**python user**_, you can run BigDL programs on standard Hadoop/YARN cluste

 - **For CDH users**

-  If you are using BigDL with pip and your CDH cluster has already installed Spark, the CDH's spark will have conflict with the pyspark installed by pip required by bigdl in next section.
+  If your CDH cluster has already installed Spark, the CDH's Spark might be conflict with the pyspark installed by pip required by BigDL.

-  Thus before running bigdl applications, you should unset all the spark related environment variables. You can use `env | grep SPARK` to find all the existing spark environment variables.
+  Thus before running BigDL applications, you should unset all the Spark related environment variables. You can use `env | grep SPARK` to find all the existing Spark environment variables.

-  Also, CDH cluster's `HADOOP_CONF_DIR` should be `/etc/hadoop/conf` on CDH by default.
+  Also, a CDH cluster's `HADOOP_CONF_DIR` should be `/etc/hadoop/conf` on CDH by default.

 ---
-### **2. Run on YARN with build-in function**
+### **2. Run on YARN with built-in function**

-_**This is the most recommended way to run bigdl on yarn,**_ as we has put conda pack and all the spark-submit setting into our codes, you can easy change your job between local and yarn.
- Install BigDL components in the created conda environment via pip, like dllib and orca:
+_**This is the easiest and most recommended way to run BigDL on YARN,**_ as you don't need to care about environment preparation and Spark related commands. In this way, you can easily switch your job between local (for test) and YARN (for production) by changing the "cluster_mode".

-  ```bash
-  pip install bigdl-dllib bigdl-orca
-  ```
-
-  View the [Python User Guide](./python.md) for more details.
-  
-
- We recommend using `init_orca_context` at the very beginning of your code to initiate and run BigDL on standard [Hadoop/YARN clusters](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn):
+- Call `init_orca_context` at the very beginning of your code to initiate and run BigDL on standard [Hadoop/YARN clusters](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn):

  ```python
  from bigdl.orca import init_orca_context
@ -71,17 +67,18 @@ _**This is the most recommended way to run bigdl on yarn,**_ as we has put conda
  ```

  `init_orca_context` would automatically prepare the runtime Python environment, detect the current Hadoop configurations from `HADOOP_CONF_DIR` and initiate the distributed execution engine on the underlying YARN cluster. View [Orca Context](../Orca/Overview/orca-context.md) for more details.    
-  By specifying cluster_mode to be "yarn-client", `init_orca_context` will submit the job to yarn with client mode.  
-  By specifying cluster_mode to be "yarn-cluster", `init_orca_context` will submit the job to yarn with cluster mode.  
-  The difference between "yarn-client" and "yarn-cluster" is where you run your spark driver, "yarn-client"'s spark driver will run on the node you start python, while "yarn-cluster"'s spark driver will run on a random node on yarn cluster. So if you are running with "yarn-cluster", you should change the application's reading from local file to a network file system, like HDFS.  
+  
+  By specifying "cluster_mode" to be `yarn-client` or `yarn-cluster`, `init_orca_context` will submit the job to YARN with client and cluster mode respectively.  
+  
+  The difference between `yarn-client` and `yarn-cluster` is where you run your Spark driver. For `yarn-client`, the Spark driver will run on the node where you start Python, while for `yarn-cluster` the Spark driver will run on a random node in the YARN cluster. So if you are running with `yarn-cluster`, you should change the application's data loading from local file to a network file system (e.g. HDFS).  

- You can then simply run your BigDL program in a Jupyter notebook, please notice _**jupyter cannot use yarn-cluster**_, as driver is not running on local node:
+- You can then simply run your BigDL program in a Jupyter notebook. Note that _**jupyter cannot run on yarn-cluster**_, as the driver is not running on the local node.

  ```bash
  jupyter notebook --notebook-dir=./ --ip=* --no-browser
  ```

-  or as a normal Python script (e.g. script.py), both "yarn-client" and "yarn-cluster" is supported:
+  Or you can run your BigDL program as a normal Python script (e.g. script.py) and in this case both `yarn-client` and `yarn-cluster` are supported.

  ```bash
  python script.py
@ -92,13 +89,13 @@ _**This is the most recommended way to run bigdl on yarn,**_ as we has put conda

 Follow the steps below if you need to run BigDL with [spark-submit](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn).  

- Pack the current conda environment to `environment.tar.gz` (you can use any name you like):
+- Pack the current active conda environment to `environment.tar.gz` (you can use any name you like) in the current working directory:

  ```bash
  conda pack -o environment.tar.gz
  ```

- _You need to write your BigDL program as a Python script._ In the script, you can call `init_orca_context` and specify cluster_mode to be "spark-submit":
+- _**You need to write your BigDL program as a Python script.**_ In the script, you need to call `init_orca_context` at the very beginning of your code and specify "cluster_mode" to be `spark-submit`:

  ```python
  from bigdl.orca import init_orca_context
@ -106,14 +103,15 @@ Follow the steps below if you need to run BigDL with [spark-submit](https://spar
  sc = init_orca_context(cluster_mode="spark-submit")
  ```

- Use `spark-submit` to submit your BigDL program (e.g. script.py):
+- Use `spark-submit-with-bigdl` to submit your BigDL program (e.g. script.py). You can adjust the configurations according to your cluster settings. Note that if `environment.tar.gz` is not under the same directory with `script.py`, you may need to modify its path in `--archives` in the running command below.

-  yarn-cluster mode:
+  For `yarn-cluster` mode:
  ```bash
-  spark-submit-with-dllib \
+  spark-submit-with-bigdl \
      --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python \
      --conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \
-      --master yarn-cluster \
+      --master yarn \
+      --deploy-mode cluster \
      --executor-memory 10g \
      --driver-memory 10g \
      --executor-cores 8 \
@ -121,15 +119,14 @@ Follow the steps below if you need to run BigDL with [spark-submit](https://spar
      --archives environment.tar.gz#environment \
      script.py
  ```
+  Note: For `yarn-client`, the Spark driver is running on local and it will use the Python interpreter in the current active conda environment while the executors will use the Python interpreter in `environment.tar.gz`.

-  You can adjust the configurations according to your cluster settings.

-  yarn-client mode:
+  For `yarn-client` mode:
  ```bash
-  spark-submit-with-dllib \
-      --conf spark.driverEnv.PYSPARK_PYTHON=environment/bin/python \
-      --conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \
-      --master yarn-client \
+  PYSPARK_PYTHON=environment/bin/python spark-submit-with-bigdl \
+      --master yarn \
+      --deploy-mode client \
      --executor-memory 10g \
      --driver-memory 10g \
      --executor-cores 8 \
@ -137,4 +134,4 @@ Follow the steps below if you need to run BigDL with [spark-submit](https://spar
      --archives environment.tar.gz#environment \
      script.py
  ```
-  Notice: `yarn-client`'s driver is running on local, while `yarn-cluster`'s driver is running on a yarn container, so the environment setting of driver's `PYSPARK_PYTHON` is different. `yarn-client` mode is `spark.driverEnv.PYSPARK_PYTHON`, and `yarn-cluster` mode is `spark.yarn.appMasterEnv.PYSPARK_PYTHON`.
+  Note: For `yarn-cluster`, the Spark driver is running in a YARN container as well and thus both the driver and executors will use the Python interpreter in `environment.tar.gz`.