update Databricks user guide (#5779)
* update databricks doc * update databricks doc * update databricks doc * update databricks doc * update databricks doc * update databricks doc Co-authored-by: Zhou <jian.zhou@intel.com>
|
|
@ -5,44 +5,63 @@
|
||||||
You can run BigDL program on the [Databricks](https://databricks.com/) cluster as follows.
|
You can run BigDL program on the [Databricks](https://databricks.com/) cluster as follows.
|
||||||
### **1. Create a Databricks Cluster**
|
### **1. Create a Databricks Cluster**
|
||||||
|
|
||||||
- Create either [AWS Databricks](https://docs.databricks.com/getting-started/try-databricks.html) workspace or [Azure Databricks](https://docs.microsoft.com/en-us/azure/azure-databricks/) workspace.
|
- Create either an [AWS Databricks](https://docs.databricks.com/getting-started/try-databricks.html) workspace or an [Azure Databricks](https://docs.microsoft.com/en-us/azure/azure-databricks/) workspace.
|
||||||
- Create a Databricks [clusters](https://docs.databricks.com/clusters/create.html) using the UI. Choose Databricks runtime version. This guide is tested on Runtime 7.3 (includes Apache Spark 3.0.1, Scala 2.12).
|
- Create a Databricks [cluster](https://docs.databricks.com/clusters/create.html) using the UI. Choose Databricks runtime version. This guide is tested on Runtime 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12).
|
||||||
|
|
||||||
### **2. Installing BigDL Python libraries**
|
### 2. Download BigDL Libraries
|
||||||
|
|
||||||
In the left pane, click **Clusters** and select your cluster.
|
Download the BigDL package from [here](https://oss.sonatype.org/content/repositories/snapshots/com/intel/analytics/bigdl/bigdl-assembly-spark_3.1.2/2.1.0-SNAPSHOT/), scroll down to the bottom, choose the **latest** release **bigdl-assembly-spark_3.1.2-2.1.0-*-fat-jars.zip**.
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
Install BigDL DLLib python environment using prebuilt release Wheel package. Click **Libraries > Install New > Upload > Python Whl**. Download BigDL DLLib prebuilt Wheel [here](https://sourceforge.net/projects/analytics-zoo/files/dllib-py). Choose a wheel with timestamp for the same Spark version and platform as Databricks runtime. Download and drop it on Databricks.
|
Unzip the zip file, we only need two files:
|
||||||
|
|
||||||

|
- jars/**bigdl-assembly-spark_3.1.2-2.1.0-SNAPSHOT-jar-with-dependencies.jar**
|
||||||
|
- python/**bigdl-spark_3.1.2-2.1.0-SNAPSHOT-python-api.zip**
|
||||||
|
|
||||||
Install BigDL Orca python environment using prebuilt release Wheel package. Click **Libraries > Install New > Upload > Python Whl**. Download Bigdl Orca prebuilt Wheel [here](https://sourceforge.net/projects/analytics-zoo/files/dllib-py). Choose a wheel with timestamp for the same Spark version and platform as Databricks runtime. Download and drop it on Databricks.
|
### 3. Install BigDL Java dependencies
|
||||||
|
|
||||||

|
In the Databricks left panel, click **Compute** and select your cluster.
|
||||||
|
|
||||||
If you want to use other BigDL libraries (Friesian, Chronos, Nano, Serving, etc.), download prebuilt release Wheel package from [here](https://sourceforge.net/projects/analytics-zoo/files/) and install to cluster in the similar ways.
|

|
||||||
|
|
||||||
|
Install BigDL java packages using **bigdl-assembly-spark_3.1.2-2.1.0-SNAPSHOT-jar-with-dependencies.jar** from [step 2](#2-download-bigdl-libraries). Click **Libraries > Install New > Library Source(Upload) > Library Type (Jar)**. Drop the jar on Databricks.
|
||||||
|
|
||||||
### **3. Installing BigDL Java libraries**
|

|
||||||
|
|
||||||
Install BigDL DLLib prebuilt jar package. Click **Libraries > Install New > Upload > Jar**. Download BigDL DLLib prebuilt package from [Release Page](../release.md). Please note that you should choose the same spark version of package as your Databricks runtime version. Find jar named "bigdl-dllib-spark_*-jar-with-dependencies.jar" in the lib directory. Drop the jar on Databricks.
|
After upload finishes, click **Install**.
|
||||||
|
|
||||||

|
> Tips: if you find your upload process is really slow, try to use **Databricks CLI** to upload, see [Appendix B](#appendix-b) for details.
|
||||||
|
|
||||||
Install BigDL Orca prebuilt jar package. Click **Libraries > Install New > Upload > Jar**. Download BigDL Orca prebuilt package from [Release Page](../release.md). Please note that you should choose the same spark version of package as your Databricks runtime version. Find jar named "bigdl-orca-spark_*-jar-with-dependencies.jar" in the lib directory. Drop the jar on Databricks.
|
### 4. Install BigDL Python libraries
|
||||||
|
|
||||||

|
Install BigDL python environment using **bigdl-spark_3.1.2-2.1.0-SNAPSHOT-python-api.zip** from [step 2](#2-download-bigdl-libraries). However, Databricks can only upload **Jar**, **Python Egg** and **Python Whl**, but doesn't support **Zip**, so we can not simply upload the python api zip and install it like what we do in [step 3](#3-install-bigdl-java-dependencies). You can upload and install the zip package in one of the following ways.
|
||||||
|
|
||||||
If you want to use other BigDL libraries (Friesian, Chronos, Nano, Serving, etc.), download prebuilt jar package from [Release Page](../release.md) and install to cluster in the similar ways.
|
#### 4.1 Upload and Install through DBFS
|
||||||
|
|
||||||
|
**First, upload the zip package to [DBFS](https://docs.databricks.com/dbfs/index.html).** In the left panel, click **Data > DBFS**, if your panel don't have DBFS, see [Appendix A](#appendix-a). then choose or create a folder and right click in the folder, choose **Upload here**.
|
||||||
|
|
||||||
Make sure the jar files and whl files are installed on all clusters. In **Libraries** tab of your cluster, check installed libraries and click “Install automatically on all clusters” option in **Admin Settings**.
|

|
||||||
|
|
||||||

|
Upload your zip package.
|
||||||
|
|
||||||
### **4. Setting Spark configuration**
|

|
||||||
|
|
||||||
|
Right click the uploaded zip package and choose **Copy path**, copy the **Spark API Format** path.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
**Then install the zip package from DBFS.** In the left panel, click **Compute > choose your cluster > Libraries > Install new > Library Source(DBFS/ADLS) > Library Type(Python Egg) > paste the path > Install**
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
#### 4.2 Change the File Extension Name
|
||||||
|
|
||||||
|
You can simply change the **bigdl-spark_3.1.2-2.1.0-SNAPSHOT-python-api.zip** extension name(**.zip**) to **.egg**, since Egg is essentially a zip format package. Then in the left panel, click **Compute > choose your cluster > Libraries > Install new > Library Source(Upload) > Library Type(Python Egg) > Install**
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
### **5. Set Spark configuration**
|
||||||
|
|
||||||
On the cluster configuration page, click the **Advanced Options** toggle. Click the **Spark** tab. You can provide custom [Spark configuration properties](https://spark.apache.org/docs/latest/configuration.html) in a cluster configuration. Please set it according to your cluster resource and program needs.
|
On the cluster configuration page, click the **Advanced Options** toggle. Click the **Spark** tab. You can provide custom [Spark configuration properties](https://spark.apache.org/docs/latest/configuration.html) in a cluster configuration. Please set it according to your cluster resource and program needs.
|
||||||
|
|
||||||
|
|
@ -55,7 +74,7 @@ spark.executor.cores 2
|
||||||
spark.cores.max 4
|
spark.cores.max 4
|
||||||
```
|
```
|
||||||
|
|
||||||
### **5. Running BigDL on Databricks**
|
### **6. Run BigDL on Databricks**
|
||||||
|
|
||||||
Open a new notebook, and call `init_orca_context` at the beginning of your code (with `cluster_mode` set to "spark-submit").
|
Open a new notebook, and call `init_orca_context` at the beginning of your code (with `cluster_mode` set to "spark-submit").
|
||||||
|
|
||||||
|
|
@ -66,9 +85,61 @@ init_orca_context(cluster_mode="spark-submit")
|
||||||
|
|
||||||
Output on Databricks:
|
Output on Databricks:
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
|
|
||||||
### **6. Install other third-party libraries on Databricks if necessary**
|
### **7. Install other third-party libraries on Databricks if necessary**
|
||||||
|
|
||||||
If you want to use other third-party libraries, check related Databricks documentation of [libraries for AWS Databricks](https://docs.databricks.com/libraries/index.html) and [libraries for Azure Databricks](https://docs.microsoft.com/en-us/azure/databricks/libraries/).
|
If you want to use other third-party libraries, check related Databricks documentation of [libraries for AWS Databricks](https://docs.databricks.com/libraries/index.html) and [libraries for Azure Databricks](https://docs.microsoft.com/en-us/azure/databricks/libraries/).
|
||||||
|
|
||||||
|
### Appendix A
|
||||||
|
|
||||||
|
If there is no DBFS in your panel, go to **User profile > Admin Console > Workspace settings > Advanced > Enabled DBFS File Browser**
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
### Appendix B
|
||||||
|
|
||||||
|
Use **Databricks CLI** to upload file to DBFS.
|
||||||
|
|
||||||
|
**Install and config Azure Databricks CLI**
|
||||||
|
|
||||||
|
1. Install Python, need Python version 2.7.9 and above if you’re using Python 2 or Python 3.6 and above if you’re using Python 3.
|
||||||
|
|
||||||
|
2. Run `pip install databricks-cli`
|
||||||
|
|
||||||
|
3. Set authentication, Click **user profile icon > User Settings > Access tokens > Generate new token > generate > copy the token**, make sure to **copy** the token and store it in a secure location, **it won't show again**.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
4. Copy the URL of Databricks host, the format is `https://adb-<workspace-id>.<random-number>.azuredatabricks.net`, you can copy it from your Databricks web page URL.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
5. In cmd run `dbfs config --token` as shown below:
|
||||||
|
|
||||||
|
```
|
||||||
|
dbfs configure --token
|
||||||
|
Databricks Host (should begin with https://): https://your.url.from.step.4
|
||||||
|
Token: your-token-from-step-3
|
||||||
|
```
|
||||||
|
|
||||||
|
6. Verify whether you are able to connect to DBFS, run "databricks fs ls".
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
**Upload through Databricks CLI**
|
||||||
|
|
||||||
|
Now, we can use Databricks CLI to upload file to DBFS. run command:
|
||||||
|
|
||||||
|
```
|
||||||
|
dbfs cp /your/local/filepath/bigdl-assembly-spark_3.1.2-2.1.0-SNAPSHOT-jar-with-dependencies.jar dbfs:/FileStore/jars/stable/bigdl-assembly-spark_3.1.2-2.1.0-SNAPSHOT-jar-with-dependencies.jar
|
||||||
|
```
|
||||||
|
|
||||||
|
After command finished, check DBFS in Databricks, in left panel, click **Data > DBFS > your upload directory**, if you do not see DBFS in your panel, see [Appendix A](#appendix-a).
|
||||||
|
|
||||||
|
**Install package from DBFS**
|
||||||
|
|
||||||
|
In the left panel, click **Compute > choose your cluster > Libraries > Install new > Library Source(DBFS/ADLS) > Library Type(your package type)**.
|
||||||
|
|
||||||
|

|
||||||
BIN
docs/readthedocs/source/doc/UserGuide/images/assembly-jar.png
Normal file
|
After Width: | Height: | Size: 82 KiB |
BIN
docs/readthedocs/source/doc/UserGuide/images/compute.png
Normal file
|
After Width: | Height: | Size: 64 KiB |
BIN
docs/readthedocs/source/doc/UserGuide/images/copy-path.png
Normal file
|
After Width: | Height: | Size: 78 KiB |
BIN
docs/readthedocs/source/doc/UserGuide/images/dbfs.png
Normal file
|
After Width: | Height: | Size: 127 KiB |
BIN
docs/readthedocs/source/doc/UserGuide/images/egg.png
Normal file
|
After Width: | Height: | Size: 85 KiB |
BIN
docs/readthedocs/source/doc/UserGuide/images/fat-jars.png
Normal file
|
After Width: | Height: | Size: 80 KiB |
|
After Width: | Height: | Size: 62 KiB |
BIN
docs/readthedocs/source/doc/UserGuide/images/install-zip.png
Normal file
|
After Width: | Height: | Size: 86 KiB |
BIN
docs/readthedocs/source/doc/UserGuide/images/token.png
Normal file
|
After Width: | Height: | Size: 78 KiB |
BIN
docs/readthedocs/source/doc/UserGuide/images/upload-success.png
Normal file
|
After Width: | Height: | Size: 85 KiB |
BIN
docs/readthedocs/source/doc/UserGuide/images/upload.png
Normal file
|
After Width: | Height: | Size: 49 KiB |
BIN
docs/readthedocs/source/doc/UserGuide/images/url.png
Normal file
|
After Width: | Height: | Size: 69 KiB |
BIN
docs/readthedocs/source/doc/UserGuide/images/verify-dbfs.png
Normal file
|
After Width: | Height: | Size: 7 KiB |