* Update hadoop.md * Update hadoop.md * Update hadoop.md * add pyspark with dllib * some update * add spark-submit-with-dllib * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md * Update hadoop.md
		
			
				
	
	
	
	
		
			6.4 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	Hadoop/YARN User Guide
Hadoop version: Apache Hadoop >= 2.7 (3.X included) or CDH 5.X. CDH 6.X have not been tested and thus currently not supported.
For scala user, please see scala user guide for how to run BigDL on hadoop/yarn cluster.
For python user, you can run BigDL programs on standard Hadoop/YARN clusters without any changes to the cluster(i.e., no need to pre-install BigDL or any Python libraries in the cluster).
1. Prepare Python Environment
- 
You need to first use conda to prepare the Python environment on the local client machine. Create a conda environment and install all the needed Python libraries in the created conda environment:
conda create -n bigdl python=3.7 # "bigdl" is conda environment name, you can use any name you like. conda activate bigdl # Use conda or pip to install all the needed Python dependencies in the created conda environment. - 
You need to download and install JDK in the environment, and properly set the environment variable
JAVA_HOME, which is required by Spark. JDK8 is highly recommended.You may take the following commands as a reference for installing OpenJDK:
# For Ubuntu sudo apt-get install openjdk-8-jre export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ # For CentOS su -c "yum install java-1.8.0-openjdk" export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64/jre export PATH=$PATH:$JAVA_HOME/bin java -version # Verify the version of JDK. - 
Check the Hadoop setup and configurations of your cluster. Make sure you properly set the environment variable
HADOOP_CONF_DIR, which is needed to initialize Spark on YARN:export HADOOP_CONF_DIR=the directory of the hadoop and yarn configurations - 
For CDH users
If you are using BigDL with pip and your CDH cluster has already installed Spark, the CDH's spark will have conflict with the pyspark installed by pip required by bigdl in next section.
Thus before running bigdl applications, you should unset all the spark related environment variables. You can use
env | grep SPARKto find all the existing spark environment variables.Also, CDH cluster's
HADOOP_CONF_DIRshould be/etc/hadoop/confon CDH by default. 
2. Run on YARN with build-in function
This is the most recommended way to run bigdl on yarn, as we has put conda pack and all the spark-submit setting into our codes, you can easy change your job between local and yarn.
- 
Install BigDL components in the created conda environment via pip, like dllib and orca:
pip install bigdl-dllib bigdl-orcaView the Python User Guide for more details.
 - 
We recommend using
init_orca_contextat the very beginning of your code to initiate and run BigDL on standard Hadoop/YARN clusters:from bigdl.orca import init_orca_context sc = init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2)init_orca_contextwould automatically prepare the runtime Python environment, detect the current Hadoop configurations fromHADOOP_CONF_DIRand initiate the distributed execution engine on the underlying YARN cluster. View Orca Context for more details.
By specifying cluster_mode to be "yarn-client",init_orca_contextwill submit the job to yarn with client mode.
By specifying cluster_mode to be "yarn-cluster",init_orca_contextwill submit the job to yarn with cluster mode.
The difference between "yarn-client" and "yarn-cluster" is where you run your spark driver, "yarn-client"'s spark driver will run on the node you start python, while "yarn-cluster"'s spark driver will run on a random node on yarn cluster. So if you are running with "yarn-cluster", you should change the application's reading from local file to a network file system, like HDFS. - 
You can then simply run your BigDL program in a Jupyter notebook, please notice jupyter cannot use yarn-cluster, as driver is not running on local node:
jupyter notebook --notebook-dir=./ --ip=* --no-browseror as a normal Python script (e.g. script.py), both "yarn-client" and "yarn-cluster" is supported:
python script.py 
3. Run on YARN with spark-submit
Follow the steps below if you need to run BigDL with spark-submit.
- 
Pack the current conda environment to
environment.tar.gz(you can use any name you like):conda pack -o environment.tar.gz - 
You need to write your BigDL program as a Python script. In the script, you can call
init_orca_contextand specify cluster_mode to be "spark-submit":from bigdl.orca import init_orca_context sc = init_orca_context(cluster_mode="spark-submit") - 
Use
spark-submitto submit your BigDL program (e.g. script.py):yarn-cluster mode:
spark-submit-with-dllib \ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python \ --conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \ --master yarn-cluster \ --executor-memory 10g \ --driver-memory 10g \ --executor-cores 8 \ --num-executors 2 \ --archives environment.tar.gz#environment \ script.pyYou can adjust the configurations according to your cluster settings.
yarn-client mode:
spark-submit-with-dllib \ --conf spark.driverEnv.PYSPARK_PYTHON=environment/bin/python \ --conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \ --master yarn-client \ --executor-memory 10g \ --driver-memory 10g \ --executor-cores 8 \ --num-executors 2 \ --archives environment.tar.gz#environment \ script.pyNotice:
yarn-client's driver is running on local, whileyarn-cluster's driver is running on a yarn container, so the environment setting of driver'sPYSPARK_PYTHONis different.yarn-clientmode isspark.driverEnv.PYSPARK_PYTHON, andyarn-clustermode isspark.yarn.appMasterEnv.PYSPARK_PYTHON.