diff --git a/docs/readthedocs/source/doc/UserGuide/databricks.md b/docs/readthedocs/source/doc/UserGuide/databricks.md index 642b59ab..06f2bebe 100644 --- a/docs/readthedocs/source/doc/UserGuide/databricks.md +++ b/docs/readthedocs/source/doc/UserGuide/databricks.md @@ -5,44 +5,63 @@ You can run BigDL program on the [Databricks](https://databricks.com/) cluster as follows. ### **1. Create a Databricks Cluster** -- Create either [AWS Databricks](https://docs.databricks.com/getting-started/try-databricks.html) workspace or [Azure Databricks](https://docs.microsoft.com/en-us/azure/azure-databricks/) workspace. -- Create a Databricks [clusters](https://docs.databricks.com/clusters/create.html) using the UI. Choose Databricks runtime version. This guide is tested on Runtime 7.3 (includes Apache Spark 3.0.1, Scala 2.12). +- Create either an [AWS Databricks](https://docs.databricks.com/getting-started/try-databricks.html) workspace or an [Azure Databricks](https://docs.microsoft.com/en-us/azure/azure-databricks/) workspace. +- Create a Databricks [cluster](https://docs.databricks.com/clusters/create.html) using the UI. Choose Databricks runtime version. This guide is tested on Runtime 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12). -### **2. Installing BigDL Python libraries** +### 2. Download BigDL Libraries -In the left pane, click **Clusters** and select your cluster. +Download the BigDL package from [here](https://oss.sonatype.org/content/repositories/snapshots/com/intel/analytics/bigdl/bigdl-assembly-spark_3.1.2/2.1.0-SNAPSHOT/), scroll down to the bottom, choose the **latest** release **bigdl-assembly-spark_3.1.2-2.1.0-*-fat-jars.zip**. -![](images/cluster.png) +![](images/fat-jars.png) -Install BigDL DLLib python environment using prebuilt release Wheel package. Click **Libraries > Install New > Upload > Python Whl**. Download BigDL DLLib prebuilt Wheel [here](https://sourceforge.net/projects/analytics-zoo/files/dllib-py). Choose a wheel with timestamp for the same Spark version and platform as Databricks runtime. Download and drop it on Databricks. +Unzip the zip file, we only need two files: -![](images/dllib-whl.png) +- jars/**bigdl-assembly-spark_3.1.2-2.1.0-SNAPSHOT-jar-with-dependencies.jar** +- python/**bigdl-spark_3.1.2-2.1.0-SNAPSHOT-python-api.zip** -Install BigDL Orca python environment using prebuilt release Wheel package. Click **Libraries > Install New > Upload > Python Whl**. Download Bigdl Orca prebuilt Wheel [here](https://sourceforge.net/projects/analytics-zoo/files/dllib-py). Choose a wheel with timestamp for the same Spark version and platform as Databricks runtime. Download and drop it on Databricks. +### 3. Install BigDL Java dependencies -![](images/orca-whl.png) +In the Databricks left panel, click **Compute** and select your cluster. -If you want to use other BigDL libraries (Friesian, Chronos, Nano, Serving, etc.), download prebuilt release Wheel package from [here](https://sourceforge.net/projects/analytics-zoo/files/) and install to cluster in the similar ways. +![](images/compute.png) +Install BigDL java packages using **bigdl-assembly-spark_3.1.2-2.1.0-SNAPSHOT-jar-with-dependencies.jar** from [step 2](#2-download-bigdl-libraries). Click **Libraries > Install New > Library Source(Upload) > Library Type (Jar)**. Drop the jar on Databricks. -### **3. Installing BigDL Java libraries** +![](images/assembly-jar.png) -Install BigDL DLLib prebuilt jar package. Click **Libraries > Install New > Upload > Jar**. Download BigDL DLLib prebuilt package from [Release Page](../release.md). Please note that you should choose the same spark version of package as your Databricks runtime version. Find jar named "bigdl-dllib-spark_*-jar-with-dependencies.jar" in the lib directory. Drop the jar on Databricks. +After upload finishes, click **Install**. -![](images/dllib-jar.png) +> Tips: if you find your upload process is really slow, try to use **Databricks CLI** to upload, see [Appendix B](#appendix-b) for details. -Install BigDL Orca prebuilt jar package. Click **Libraries > Install New > Upload > Jar**. Download BigDL Orca prebuilt package from [Release Page](../release.md). Please note that you should choose the same spark version of package as your Databricks runtime version. Find jar named "bigdl-orca-spark_*-jar-with-dependencies.jar" in the lib directory. Drop the jar on Databricks. +### 4. Install BigDL Python libraries -![](images/orca-jar.png) +Install BigDL python environment using **bigdl-spark_3.1.2-2.1.0-SNAPSHOT-python-api.zip** from [step 2](#2-download-bigdl-libraries). However, Databricks can only upload **Jar**, **Python Egg** and **Python Whl**, but doesn't support **Zip**, so we can not simply upload the python api zip and install it like what we do in [step 3](#3-install-bigdl-java-dependencies). You can upload and install the zip package in one of the following ways. -If you want to use other BigDL libraries (Friesian, Chronos, Nano, Serving, etc.), download prebuilt jar package from [Release Page](../release.md) and install to cluster in the similar ways. +#### 4.1 Upload and Install through DBFS +**First, upload the zip package to [DBFS](https://docs.databricks.com/dbfs/index.html).** In the left panel, click **Data > DBFS**, if your panel don't have DBFS, see [Appendix A](#appendix-a). then choose or create a folder and right click in the folder, choose **Upload here**. -Make sure the jar files and whl files are installed on all clusters. In **Libraries** tab of your cluster, check installed libraries and click “Install automatically on all clusters” option in **Admin Settings**. +![](images/upload.png) -![](images/apply-all.png) +Upload your zip package. -### **4. Setting Spark configuration** +![](images/upload-success.png) + +Right click the uploaded zip package and choose **Copy path**, copy the **Spark API Format** path. + +![](images/copy-path.png) + +**Then install the zip package from DBFS.** In the left panel, click **Compute > choose your cluster > Libraries > Install new > Library Source(DBFS/ADLS) > Library Type(Python Egg) > paste the path > Install** + +![](images/install-zip.png) + +#### 4.2 Change the File Extension Name + +You can simply change the **bigdl-spark_3.1.2-2.1.0-SNAPSHOT-python-api.zip** extension name(**.zip**) to **.egg**, since Egg is essentially a zip format package. Then in the left panel, click **Compute > choose your cluster > Libraries > Install new > Library Source(Upload) > Library Type(Python Egg) > Install** + +![](images/egg.png) + +### **5. Set Spark configuration** On the cluster configuration page, click the **Advanced Options** toggle. Click the **Spark** tab. You can provide custom [Spark configuration properties](https://spark.apache.org/docs/latest/configuration.html) in a cluster configuration. Please set it according to your cluster resource and program needs. @@ -55,7 +74,7 @@ spark.executor.cores 2 spark.cores.max 4 ``` -### **5. Running BigDL on Databricks** +### **6. Run BigDL on Databricks** Open a new notebook, and call `init_orca_context` at the beginning of your code (with `cluster_mode` set to "spark-submit"). @@ -66,9 +85,61 @@ init_orca_context(cluster_mode="spark-submit") Output on Databricks: -![](images/spark-context.png) +![](images/init-orca-context.png) -### **6. Install other third-party libraries on Databricks if necessary** +### **7. Install other third-party libraries on Databricks if necessary** If you want to use other third-party libraries, check related Databricks documentation of [libraries for AWS Databricks](https://docs.databricks.com/libraries/index.html) and [libraries for Azure Databricks](https://docs.microsoft.com/en-us/azure/databricks/libraries/). + +### Appendix A + +If there is no DBFS in your panel, go to **User profile > Admin Console > Workspace settings > Advanced > Enabled DBFS File Browser** + +![](images/dbfs.png) + +### Appendix B + +Use **Databricks CLI** to upload file to DBFS. + +**Install and config Azure Databricks CLI** + +1. Install Python, need Python version 2.7.9 and above if you’re using Python 2 or Python 3.6 and above if you’re using Python 3. + +2. Run `pip install databricks-cli` + +3. Set authentication, Click **user profile icon > User Settings > Access tokens > Generate new token > generate > copy the token**, make sure to **copy** the token and store it in a secure location, **it won't show again**. + + ![](images/token.png) + +4. Copy the URL of Databricks host, the format is `https://adb-..azuredatabricks.net`, you can copy it from your Databricks web page URL. + + ![](images/url.png) + +5. In cmd run `dbfs config --token` as shown below: + + ``` + dbfs configure --token + Databricks Host (should begin with https://): https://your.url.from.step.4 + Token: your-token-from-step-3 + ``` + +6. Verify whether you are able to connect to DBFS, run "databricks fs ls". + + ![](images/verify-dbfs.png) + +**Upload through Databricks CLI** + +Now, we can use Databricks CLI to upload file to DBFS. run command: + +``` +dbfs cp /your/local/filepath/bigdl-assembly-spark_3.1.2-2.1.0-SNAPSHOT-jar-with-dependencies.jar dbfs:/FileStore/jars/stable/bigdl-assembly-spark_3.1.2-2.1.0-SNAPSHOT-jar-with-dependencies.jar +``` + +After command finished, check DBFS in Databricks, in left panel, click **Data > DBFS > your upload directory**, if you do not see DBFS in your panel, see [Appendix A](#appendix-a). + +**Install package from DBFS** + +In the left panel, click **Compute > choose your cluster > Libraries > Install new > Library Source(DBFS/ADLS) > Library Type(your package type)**. + +![](images/install-zip.png) \ No newline at end of file diff --git a/docs/readthedocs/source/doc/UserGuide/images/assembly-jar.png b/docs/readthedocs/source/doc/UserGuide/images/assembly-jar.png new file mode 100644 index 00000000..db1e0815 Binary files /dev/null and b/docs/readthedocs/source/doc/UserGuide/images/assembly-jar.png differ diff --git a/docs/readthedocs/source/doc/UserGuide/images/compute.png b/docs/readthedocs/source/doc/UserGuide/images/compute.png new file mode 100644 index 00000000..05629901 Binary files /dev/null and b/docs/readthedocs/source/doc/UserGuide/images/compute.png differ diff --git a/docs/readthedocs/source/doc/UserGuide/images/copy-path.png b/docs/readthedocs/source/doc/UserGuide/images/copy-path.png new file mode 100644 index 00000000..c996f358 Binary files /dev/null and b/docs/readthedocs/source/doc/UserGuide/images/copy-path.png differ diff --git a/docs/readthedocs/source/doc/UserGuide/images/dbfs.png b/docs/readthedocs/source/doc/UserGuide/images/dbfs.png new file mode 100644 index 00000000..ebc86d9d Binary files /dev/null and b/docs/readthedocs/source/doc/UserGuide/images/dbfs.png differ diff --git a/docs/readthedocs/source/doc/UserGuide/images/egg.png b/docs/readthedocs/source/doc/UserGuide/images/egg.png new file mode 100644 index 00000000..8b96eb41 Binary files /dev/null and b/docs/readthedocs/source/doc/UserGuide/images/egg.png differ diff --git a/docs/readthedocs/source/doc/UserGuide/images/fat-jars.png b/docs/readthedocs/source/doc/UserGuide/images/fat-jars.png new file mode 100644 index 00000000..d8d87753 Binary files /dev/null and b/docs/readthedocs/source/doc/UserGuide/images/fat-jars.png differ diff --git a/docs/readthedocs/source/doc/UserGuide/images/init-orca-context.png b/docs/readthedocs/source/doc/UserGuide/images/init-orca-context.png new file mode 100644 index 00000000..653be141 Binary files /dev/null and b/docs/readthedocs/source/doc/UserGuide/images/init-orca-context.png differ diff --git a/docs/readthedocs/source/doc/UserGuide/images/install-zip.png b/docs/readthedocs/source/doc/UserGuide/images/install-zip.png new file mode 100644 index 00000000..9777492c Binary files /dev/null and b/docs/readthedocs/source/doc/UserGuide/images/install-zip.png differ diff --git a/docs/readthedocs/source/doc/UserGuide/images/token.png b/docs/readthedocs/source/doc/UserGuide/images/token.png new file mode 100644 index 00000000..a035333c Binary files /dev/null and b/docs/readthedocs/source/doc/UserGuide/images/token.png differ diff --git a/docs/readthedocs/source/doc/UserGuide/images/upload-success.png b/docs/readthedocs/source/doc/UserGuide/images/upload-success.png new file mode 100644 index 00000000..da952392 Binary files /dev/null and b/docs/readthedocs/source/doc/UserGuide/images/upload-success.png differ diff --git a/docs/readthedocs/source/doc/UserGuide/images/upload.png b/docs/readthedocs/source/doc/UserGuide/images/upload.png new file mode 100644 index 00000000..f9aab2f3 Binary files /dev/null and b/docs/readthedocs/source/doc/UserGuide/images/upload.png differ diff --git a/docs/readthedocs/source/doc/UserGuide/images/url.png b/docs/readthedocs/source/doc/UserGuide/images/url.png new file mode 100644 index 00000000..e483bfaa Binary files /dev/null and b/docs/readthedocs/source/doc/UserGuide/images/url.png differ diff --git a/docs/readthedocs/source/doc/UserGuide/images/verify-dbfs.png b/docs/readthedocs/source/doc/UserGuide/images/verify-dbfs.png new file mode 100644 index 00000000..5ba9fc23 Binary files /dev/null and b/docs/readthedocs/source/doc/UserGuide/images/verify-dbfs.png differ