Update hadoop.md (#3433)
* Update hadoop.md * Update hadoop.md * Update hadoop.md * add pyspark with dllib * some update * add spark-submit-with-dllib * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md
This commit is contained in:
parent
829ee31f1b
commit
453c991c19
1 changed files with 37 additions and 30 deletions
|
|
@ -4,10 +4,10 @@ Hadoop version: Apache Hadoop >= 2.7 (3.X included) or [CDH](https://www.clouder
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
For scala user, please see [scala user guide](./scala.md) for how to run BigDL on hadoop/yarn cluster.
|
For _**scala user**_, please see [scala user guide](./scala.md) for how to run BigDL on hadoop/yarn cluster.
|
||||||
For python user, you can run BigDL programs on standard Hadoop/YARN clusters without any changes to the cluster(i.e., no need to pre-install BigDL or any Python libraries in the cluster).
|
For _**python user**_, you can run BigDL programs on standard Hadoop/YARN clusters without any changes to the cluster(i.e., no need to pre-install BigDL or any Python libraries in the cluster).
|
||||||
|
|
||||||
### **1. Prepare python Environment**
|
### **1. Prepare Python Environment**
|
||||||
|
|
||||||
- You need to first use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to prepare the Python environment _**on the local client machine**_. Create a conda environment and install all the needed Python libraries in the created conda environment:
|
- You need to first use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to prepare the Python environment _**on the local client machine**_. Create a conda environment and install all the needed Python libraries in the created conda environment:
|
||||||
|
|
||||||
|
|
@ -50,19 +50,19 @@ Thus before running bigdl applications, you should unset all the spark related e
|
||||||
Also, CDH cluster's `HADOOP_CONF_DIR` should be `/etc/hadoop/conf` on CDH by default.
|
Also, CDH cluster's `HADOOP_CONF_DIR` should be `/etc/hadoop/conf` on CDH by default.
|
||||||
|
|
||||||
---
|
---
|
||||||
### **2. YARN Client Mode**
|
### **2. Run on YARN with build-in function**
|
||||||
|
|
||||||
|
_**This is the most recommended way to run bigdl on yarn,**_ as we has put conda pack and all the spark-submit setting into our codes, you can easy change your job between local and yarn.
|
||||||
- Install BigDL components in the created conda environment via pip, like dllib and orca:
|
- Install BigDL components in the created conda environment via pip, like dllib and orca:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install bigdl-dllib
|
pip install bigdl-dllib bigdl-orca
|
||||||
pip install bigdl-orca
|
|
||||||
```
|
```
|
||||||
|
|
||||||
View the [Python User Guide](./python.md) for more details.
|
View the [Python User Guide](./python.md) for more details.
|
||||||
|
|
||||||
|
|
||||||
- We recommend using `init_orca_context` at the very beginning of your code to initiate and run BigDL on standard Hadoop/YARN clusters in [YARN client mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn):
|
- We recommend using `init_orca_context` at the very beginning of your code to initiate and run BigDL on standard [Hadoop/YARN clusters](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn):
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from bigdl.orca import init_orca_context
|
from bigdl.orca import init_orca_context
|
||||||
|
|
@ -70,37 +70,27 @@ Also, CDH cluster's `HADOOP_CONF_DIR` should be `/etc/hadoop/conf` on CDH by def
|
||||||
sc = init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2)
|
sc = init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2)
|
||||||
```
|
```
|
||||||
|
|
||||||
By specifying cluster_mode to be "yarn-client", `init_orca_context` would automatically prepare the runtime Python environment, detect the current Hadoop configurations from `HADOOP_CONF_DIR` and initiate the distributed execution engine on the underlying YARN cluster. View [Orca Context](../Orca/Overview/orca-context.md) for more details.
|
`init_orca_context` would automatically prepare the runtime Python environment, detect the current Hadoop configurations from `HADOOP_CONF_DIR` and initiate the distributed execution engine on the underlying YARN cluster. View [Orca Context](../Orca/Overview/orca-context.md) for more details.
|
||||||
|
By specifying cluster_mode to be "yarn-client", `init_orca_context` will submit the job to yarn with client mode.
|
||||||
|
By specifying cluster_mode to be "yarn-cluster", `init_orca_context` will submit the job to yarn with cluster mode.
|
||||||
|
The difference between "yarn-client" and "yarn-cluster" is where you run your spark driver, "yarn-client"'s spark driver will run on the node you start python, while "yarn-cluster"'s spark driver will run on a random node on yarn cluster. So if you are running with "yarn-cluster", you should change the application's reading from local file to a network file system, like HDFS.
|
||||||
|
|
||||||
|
- You can then simply run your BigDL program in a Jupyter notebook, please notice _**jupyter cannot use yarn-cluster**_, as driver is not running on local node:
|
||||||
- You can then simply run your BigDL program in a Jupyter notebook:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
jupyter notebook --notebook-dir=./ --ip=* --no-browser
|
jupyter notebook --notebook-dir=./ --ip=* --no-browser
|
||||||
```
|
```
|
||||||
|
|
||||||
or as a normal Python script (e.g. script.py):
|
or as a normal Python script (e.g. script.py), both "yarn-client" and "yarn-cluster" is supported:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python script.py
|
python script.py
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
### **3. YARN Cluster Mode**
|
### **3. Run on YARN with spark-submit**
|
||||||
|
|
||||||
Follow the steps below if you need to run BigDL in [YARN cluster mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn).
|
Follow the steps below if you need to run BigDL with [spark-submit](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn).
|
||||||
|
|
||||||
- Download and extract [Spark](https://spark.apache.org/downloads.html). You are recommended to use [Spark 2.4.6](https://archive.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz). Set the environment variable `SPARK_HOME`:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export SPARK_HOME=the root directory where you extract the downloaded Spark package
|
|
||||||
```
|
|
||||||
|
|
||||||
- Download and extract [BigDL](../release.md). Make sure the BigDL package you download is built with the compatible version with your Spark. Set the environment variable `BIGDL_HOME`:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export BIGDL_HOME=the root directory where you extract the downloaded BigDL package
|
|
||||||
```
|
|
||||||
|
|
||||||
- Pack the current conda environment to `environment.tar.gz` (you can use any name you like):
|
- Pack the current conda environment to `environment.tar.gz` (you can use any name you like):
|
||||||
|
|
||||||
|
|
@ -118,9 +108,11 @@ Follow the steps below if you need to run BigDL in [YARN cluster mode](https://s
|
||||||
|
|
||||||
- Use `spark-submit` to submit your BigDL program (e.g. script.py):
|
- Use `spark-submit` to submit your BigDL program (e.g. script.py):
|
||||||
|
|
||||||
|
yarn-cluster mode:
|
||||||
```bash
|
```bash
|
||||||
PYSPARK_PYTHON=./environment/bin/python ${BIGDL_HOME}/bin/spark-submit-python-with-bigdl.sh \
|
spark-submit-with-dllib \
|
||||||
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
|
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python \
|
||||||
|
--conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \
|
||||||
--master yarn-cluster \
|
--master yarn-cluster \
|
||||||
--executor-memory 10g \
|
--executor-memory 10g \
|
||||||
--driver-memory 10g \
|
--driver-memory 10g \
|
||||||
|
|
@ -131,3 +123,18 @@ Follow the steps below if you need to run BigDL in [YARN cluster mode](https://s
|
||||||
```
|
```
|
||||||
|
|
||||||
You can adjust the configurations according to your cluster settings.
|
You can adjust the configurations according to your cluster settings.
|
||||||
|
|
||||||
|
yarn-client mode:
|
||||||
|
```bash
|
||||||
|
spark-submit-with-dllib \
|
||||||
|
--conf spark.driverEnv.PYSPARK_PYTHON=environment/bin/python \
|
||||||
|
--conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \
|
||||||
|
--master yarn-client \
|
||||||
|
--executor-memory 10g \
|
||||||
|
--driver-memory 10g \
|
||||||
|
--executor-cores 8 \
|
||||||
|
--num-executors 2 \
|
||||||
|
--archives environment.tar.gz#environment \
|
||||||
|
script.py
|
||||||
|
```
|
||||||
|
Notice: `yarn-client`'s driver is running on local, while `yarn-cluster`'s driver is running on a yarn container, so the environment setting of driver's `PYSPARK_PYTHON` is different. `yarn-client` mode is `spark.driverEnv.PYSPARK_PYTHON`, and `yarn-cluster` mode is `spark.yarn.appMasterEnv.PYSPARK_PYTHON`.
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue