update Databricks user guide (#5779)

* update databricks doc

* update databricks doc

* update databricks doc

* update databricks doc

* update databricks doc

* update databricks doc

Co-authored-by: Zhou <jian.zhou@intel.com>
This commit is contained in:
Jian Zhou 2022-09-16 10:31:55 +08:00 committed by GitHub
parent a1c9182515
commit 5c453157de
14 changed files with 93 additions and 22 deletions

View file

@ -5,44 +5,63 @@
You can run BigDL program on the [Databricks](https://databricks.com/) cluster as follows. You can run BigDL program on the [Databricks](https://databricks.com/) cluster as follows.
### **1. Create a Databricks Cluster** ### **1. Create a Databricks Cluster**
- Create either [AWS Databricks](https://docs.databricks.com/getting-started/try-databricks.html) workspace or [Azure Databricks](https://docs.microsoft.com/en-us/azure/azure-databricks/) workspace. - Create either an [AWS Databricks](https://docs.databricks.com/getting-started/try-databricks.html) workspace or an [Azure Databricks](https://docs.microsoft.com/en-us/azure/azure-databricks/) workspace.
- Create a Databricks [clusters](https://docs.databricks.com/clusters/create.html) using the UI. Choose Databricks runtime version. This guide is tested on Runtime 7.3 (includes Apache Spark 3.0.1, Scala 2.12). - Create a Databricks [cluster](https://docs.databricks.com/clusters/create.html) using the UI. Choose Databricks runtime version. This guide is tested on Runtime 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12).
### **2. Installing BigDL Python libraries** ### 2. Download BigDL Libraries
In the left pane, click **Clusters** and select your cluster. Download the BigDL package from [here](https://oss.sonatype.org/content/repositories/snapshots/com/intel/analytics/bigdl/bigdl-assembly-spark_3.1.2/2.1.0-SNAPSHOT/), scroll down to the bottom, choose the **latest** release **bigdl-assembly-spark_3.1.2-2.1.0-*-fat-jars.zip**.
![](images/cluster.png) ![](images/fat-jars.png)
Install BigDL DLLib python environment using prebuilt release Wheel package. Click **Libraries > Install New > Upload > Python Whl**. Download BigDL DLLib prebuilt Wheel [here](https://sourceforge.net/projects/analytics-zoo/files/dllib-py). Choose a wheel with timestamp for the same Spark version and platform as Databricks runtime. Download and drop it on Databricks. Unzip the zip file, we only need two files:
![](images/dllib-whl.png) - jars/**bigdl-assembly-spark_3.1.2-2.1.0-SNAPSHOT-jar-with-dependencies.jar**
- python/**bigdl-spark_3.1.2-2.1.0-SNAPSHOT-python-api.zip**
Install BigDL Orca python environment using prebuilt release Wheel package. Click **Libraries > Install New > Upload > Python Whl**. Download Bigdl Orca prebuilt Wheel [here](https://sourceforge.net/projects/analytics-zoo/files/dllib-py). Choose a wheel with timestamp for the same Spark version and platform as Databricks runtime. Download and drop it on Databricks. ### 3. Install BigDL Java dependencies
![](images/orca-whl.png) In the Databricks left panel, click **Compute** and select your cluster.
If you want to use other BigDL libraries (Friesian, Chronos, Nano, Serving, etc.), download prebuilt release Wheel package from [here](https://sourceforge.net/projects/analytics-zoo/files/) and install to cluster in the similar ways. ![](images/compute.png)
Install BigDL java packages using **bigdl-assembly-spark_3.1.2-2.1.0-SNAPSHOT-jar-with-dependencies.jar** from [step 2](#2-download-bigdl-libraries). Click **Libraries > Install New > Library Source(Upload) > Library Type (Jar)**. Drop the jar on Databricks.
### **3. Installing BigDL Java libraries** ![](images/assembly-jar.png)
Install BigDL DLLib prebuilt jar package. Click **Libraries > Install New > Upload > Jar**. Download BigDL DLLib prebuilt package from [Release Page](../release.md). Please note that you should choose the same spark version of package as your Databricks runtime version. Find jar named "bigdl-dllib-spark_*-jar-with-dependencies.jar" in the lib directory. Drop the jar on Databricks. After upload finishes, click **Install**.
![](images/dllib-jar.png) > Tips: if you find your upload process is really slow, try to use **Databricks CLI** to upload, see [Appendix B](#appendix-b) for details.
Install BigDL Orca prebuilt jar package. Click **Libraries > Install New > Upload > Jar**. Download BigDL Orca prebuilt package from [Release Page](../release.md). Please note that you should choose the same spark version of package as your Databricks runtime version. Find jar named "bigdl-orca-spark_*-jar-with-dependencies.jar" in the lib directory. Drop the jar on Databricks. ### 4. Install BigDL Python libraries
![](images/orca-jar.png) Install BigDL python environment using **bigdl-spark_3.1.2-2.1.0-SNAPSHOT-python-api.zip** from [step 2](#2-download-bigdl-libraries). However, Databricks can only upload **Jar**, **Python Egg** and **Python Whl**, but doesn't support **Zip**, so we can not simply upload the python api zip and install it like what we do in [step 3](#3-install-bigdl-java-dependencies). You can upload and install the zip package in one of the following ways.
If you want to use other BigDL libraries (Friesian, Chronos, Nano, Serving, etc.), download prebuilt jar package from [Release Page](../release.md) and install to cluster in the similar ways. #### 4.1 Upload and Install through DBFS
**First, upload the zip package to [DBFS](https://docs.databricks.com/dbfs/index.html).** In the left panel, click **Data > DBFS**, if your panel don't have DBFS, see [Appendix A](#appendix-a). then choose or create a folder and right click in the folder, choose **Upload here**.
Make sure the jar files and whl files are installed on all clusters. In **Libraries** tab of your cluster, check installed libraries and click “Install automatically on all clusters” option in **Admin Settings**. ![](images/upload.png)
![](images/apply-all.png) Upload your zip package.
### **4. Setting Spark configuration** ![](images/upload-success.png)
Right click the uploaded zip package and choose **Copy path**, copy the **Spark API Format** path.
![](images/copy-path.png)
**Then install the zip package from DBFS.** In the left panel, click **Compute > choose your cluster > Libraries > Install new > Library Source(DBFS/ADLS) > Library Type(Python Egg) > paste the path > Install**
![](images/install-zip.png)
#### 4.2 Change the File Extension Name
You can simply change the **bigdl-spark_3.1.2-2.1.0-SNAPSHOT-python-api.zip** extension name(**.zip**) to **.egg**, since Egg is essentially a zip format package. Then in the left panel, click **Compute > choose your cluster > Libraries > Install new > Library Source(Upload) > Library Type(Python Egg) > Install**
![](images/egg.png)
### **5. Set Spark configuration**
On the cluster configuration page, click the **Advanced Options** toggle. Click the **Spark** tab. You can provide custom [Spark configuration properties](https://spark.apache.org/docs/latest/configuration.html) in a cluster configuration. Please set it according to your cluster resource and program needs. On the cluster configuration page, click the **Advanced Options** toggle. Click the **Spark** tab. You can provide custom [Spark configuration properties](https://spark.apache.org/docs/latest/configuration.html) in a cluster configuration. Please set it according to your cluster resource and program needs.
@ -55,7 +74,7 @@ spark.executor.cores 2
spark.cores.max 4 spark.cores.max 4
``` ```
### **5. Running BigDL on Databricks** ### **6. Run BigDL on Databricks**
Open a new notebook, and call `init_orca_context` at the beginning of your code (with `cluster_mode` set to "spark-submit"). Open a new notebook, and call `init_orca_context` at the beginning of your code (with `cluster_mode` set to "spark-submit").
@ -66,9 +85,61 @@ init_orca_context(cluster_mode="spark-submit")
Output on Databricks: Output on Databricks:
![](images/spark-context.png) ![](images/init-orca-context.png)
### **6. Install other third-party libraries on Databricks if necessary** ### **7. Install other third-party libraries on Databricks if necessary**
If you want to use other third-party libraries, check related Databricks documentation of [libraries for AWS Databricks](https://docs.databricks.com/libraries/index.html) and [libraries for Azure Databricks](https://docs.microsoft.com/en-us/azure/databricks/libraries/). If you want to use other third-party libraries, check related Databricks documentation of [libraries for AWS Databricks](https://docs.databricks.com/libraries/index.html) and [libraries for Azure Databricks](https://docs.microsoft.com/en-us/azure/databricks/libraries/).
### Appendix A
If there is no DBFS in your panel, go to **User profile > Admin Console > Workspace settings > Advanced > Enabled DBFS File Browser**
![](images/dbfs.png)
### Appendix B
Use **Databricks CLI** to upload file to DBFS.
**Install and config Azure Databricks CLI**
1. Install Python, need Python version 2.7.9 and above if youre using Python 2 or Python 3.6 and above if youre using Python 3.
2. Run `pip install databricks-cli`
3. Set authentication, Click **user profile icon > User Settings > Access tokens > Generate new token > generate > copy the token**, make sure to **copy** the token and store it in a secure location, **it won't show again**.
![](images/token.png)
4. Copy the URL of Databricks host, the format is `https://adb-<workspace-id>.<random-number>.azuredatabricks.net`, you can copy it from your Databricks web page URL.
![](images/url.png)
5. In cmd run `dbfs config --token` as shown below:
```
dbfs configure --token
Databricks Host (should begin with https://): https://your.url.from.step.4
Token: your-token-from-step-3
```
6. Verify whether you are able to connect to DBFS, run "databricks fs ls".
![](images/verify-dbfs.png)
**Upload through Databricks CLI**
Now, we can use Databricks CLI to upload file to DBFS. run command:
```
dbfs cp /your/local/filepath/bigdl-assembly-spark_3.1.2-2.1.0-SNAPSHOT-jar-with-dependencies.jar dbfs:/FileStore/jars/stable/bigdl-assembly-spark_3.1.2-2.1.0-SNAPSHOT-jar-with-dependencies.jar
```
After command finished, check DBFS in Databricks, in left panel, click **Data > DBFS > your upload directory**, if you do not see DBFS in your panel, see [Appendix A](#appendix-a).
**Install package from DBFS**
In the left panel, click **Compute > choose your cluster > Libraries > Install new > Library Source(DBFS/ADLS) > Library Type(your package type)**.
![](images/install-zip.png)

Binary file not shown.

After

Width:  |  Height:  |  Size: 82 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 64 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 78 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 127 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 85 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 80 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 78 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 85 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 69 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7 KiB