diff --git a/docs/readthedocs/source/doc/UserGuide/hadoop.md b/docs/readthedocs/source/doc/UserGuide/hadoop.md index 45f0ae53..c0493800 100644 --- a/docs/readthedocs/source/doc/UserGuide/hadoop.md +++ b/docs/readthedocs/source/doc/UserGuide/hadoop.md @@ -1,18 +1,19 @@ # Hadoop/YARN User Guide -Hadoop version: Hadoop >= 2.7 or [CDH](https://www.cloudera.com/products/open-source/apache-hadoop/key-cdh-components.html) 5.X. Hadoop 3.X or CDH 6.X have not been tested and thus currently not supported. +Hadoop version: Apache Hadoop >= 2.7 (3.X included) or [CDH](https://www.cloudera.com/products/open-source/apache-hadoop/key-cdh-components.html) 5.X. CDH 6.X have not been tested and thus currently not supported. --- -You can run Analytics Zoo programs on standard Hadoop/YARN clusters without any changes to the cluster (i.e., no need to pre-install Analytics Zoo or any Python libraries in the cluster). +For scala user, please see [scala user guide](./scala.md) for how to run BigDL on hadoop/yarn cluster. +For python user, you can run BigDL programs on standard Hadoop/YARN clusters without any changes to the cluster(i.e., no need to pre-install BigDL or any Python libraries in the cluster). -### **1. Prepare Environment** +### **1. Prepare python Environment** - You need to first use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to prepare the Python environment _**on the local client machine**_. Create a conda environment and install all the needed Python libraries in the created conda environment: ```bash - conda create -n zoo python=3.7 # "zoo" is conda environment name, you can use any name you like. - conda activate zoo + conda create -n bigdl python=3.7 # "bigdl" is conda environment name, you can use any name you like. + conda activate bigdl # Use conda or pip to install all the needed Python dependencies in the created conda environment. ``` @@ -42,28 +43,29 @@ You can run Analytics Zoo programs on standard Hadoop/YARN clusters without any - **For CDH users** -If your CDH cluster has already installed Spark, the CDH's spark will have conflict with the pyspark installed by pip required by analytics-zoo in next section. +If you are using BigDL with pip and your CDH cluster has already installed Spark, the CDH's spark will have conflict with the pyspark installed by pip required by bigdl in next section. -Thus before running analytics-zoo applications, you should unset all the spark related environment variables. You can use `env | grep SPARK` to find all the existing spark environment variables. +Thus before running bigdl applications, you should unset all the spark related environment variables. You can use `env | grep SPARK` to find all the existing spark environment variables. -Also, CDH cluster's `HADOOP_CONF_DIR` should by default be set to `/etc/hadoop/conf`. +Also, CDH cluster's `HADOOP_CONF_DIR` should be `/etc/hadoop/conf` on CDH by default. --- ### **2. YARN Client Mode** -- Install Analytics Zoo in the created conda environment via pip: +- Install BigDL components in the created conda environment via pip, like dllib and orca: ```bash - pip install analytics-zoo + pip install bigdl-dllib + pip install bigdl-orca ``` View the [Python User Guide](./python.md) for more details. -- We recommend using `init_orca_context` at the very beginning of your code to initiate and run Analytics Zoo on standard Hadoop/YARN clusters in [YARN client mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn): +- We recommend using `init_orca_context` at the very beginning of your code to initiate and run BigDL on standard Hadoop/YARN clusters in [YARN client mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn): ```python - from zoo.orca import init_orca_context + from bigdl.orca import init_orca_context sc = init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2) ``` @@ -71,7 +73,7 @@ Also, CDH cluster's `HADOOP_CONF_DIR` should by default be set to `/etc/hadoop/c By specifying cluster_mode to be "yarn-client", `init_orca_context` would automatically prepare the runtime Python environment, detect the current Hadoop configurations from `HADOOP_CONF_DIR` and initiate the distributed execution engine on the underlying YARN cluster. View [Orca Context](../Orca/Overview/orca-context.md) for more details. -- You can then simply run your Analytics Zoo program in a Jupyter notebook: +- You can then simply run your BigDL program in a Jupyter notebook: ```bash jupyter notebook --notebook-dir=./ --ip=* --no-browser @@ -86,18 +88,18 @@ Also, CDH cluster's `HADOOP_CONF_DIR` should by default be set to `/etc/hadoop/c --- ### **3. YARN Cluster Mode** -Follow the steps below if you need to run Analytics Zoo in [YARN cluster mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn). +Follow the steps below if you need to run BigDL in [YARN cluster mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn). -- Download and extract [Spark](https://spark.apache.org/downloads.html). You are recommended to use [Spark 2.4.3](https://archive.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz). Set the environment variable `SPARK_HOME`: +- Download and extract [Spark](https://spark.apache.org/downloads.html). You are recommended to use [Spark 2.4.6](https://archive.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz). Set the environment variable `SPARK_HOME`: ```bash export SPARK_HOME=the root directory where you extract the downloaded Spark package ``` -- Download and extract [Analytics Zoo](../release.md). Make sure the Analytics Zoo package you download is built with the compatible version with your Spark. Set the environment variable `ANALYTICS_ZOO_HOME`: +- Download and extract [BigDL](../release.md). Make sure the BigDL package you download is built with the compatible version with your Spark. Set the environment variable `BIGDL_HOME`: ```bash - export ANALYTICS_ZOO_HOME=the root directory where you extract the downloaded Analytics Zoo package + export BIGDL_HOME=the root directory where you extract the downloaded BigDL package ``` - Pack the current conda environment to `environment.tar.gz` (you can use any name you like): @@ -106,18 +108,18 @@ Follow the steps below if you need to run Analytics Zoo in [YARN cluster mode](h conda pack -o environment.tar.gz ``` -- _You need to write your Analytics Zoo program as a Python script._ In the script, you can call `init_orca_context` and specify cluster_mode to be "spark-submit": +- _You need to write your BigDL program as a Python script._ In the script, you can call `init_orca_context` and specify cluster_mode to be "spark-submit": ```python - from zoo.orca import init_orca_context + from bigdl.orca import init_orca_context sc = init_orca_context(cluster_mode="spark-submit") ``` -- Use `spark-submit` to submit your Analytics Zoo program (e.g. script.py): +- Use `spark-submit` to submit your BigDL program (e.g. script.py): ```bash - PYSPARK_PYTHON=./environment/bin/python ${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \ + PYSPARK_PYTHON=./environment/bin/python ${BIGDL_HOME}/bin/spark-submit-python-with-bigdl.sh \ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \ --master yarn-cluster \ --executor-memory 10g \