Update Orca install doc (#6518)
* update install * update * update * update link * update 5min * update 5mins * update link
This commit is contained in:
parent
17fb75f8d7
commit
7da102243e
4 changed files with 90 additions and 36 deletions
|
|
@ -1,6 +1,10 @@
|
|||
# Installation
|
||||
|
||||
## Install Java
|
||||
---
|
||||
## Prepare the environment
|
||||
You can follow the commands in this section to install Java and conda before installing BigDL Orca.
|
||||
|
||||
### Install Java
|
||||
You need to download and install JDK in the environment, and properly set the environment variable `JAVA_HOME`. JDK8 is highly recommended.
|
||||
|
||||
```bash
|
||||
|
|
@ -16,7 +20,7 @@ export PATH=$PATH:$JAVA_HOME/bin
|
|||
java -version # Verify the version of JDK.
|
||||
```
|
||||
|
||||
## Install Anaconda
|
||||
### Install Anaconda
|
||||
We recommend using [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to prepare the Python environment.
|
||||
|
||||
You can follow the steps below to install conda:
|
||||
|
|
@ -37,7 +41,10 @@ conda create -n py37 python=3.7 # "py37" is conda environment name, you can use
|
|||
conda activate py37
|
||||
```
|
||||
|
||||
## To use basic Orca features
|
||||
---
|
||||
## Install BigDL Orca
|
||||
|
||||
### To use basic Orca features
|
||||
You can install Orca in your created conda environment for distributed data processing, training and inference with the following command:
|
||||
```bash
|
||||
pip install bigdl-orca # For the official release version
|
||||
|
|
@ -48,7 +55,9 @@ or for the nightly build version, use:
|
|||
pip install --pre --upgrade bigdl-orca # For the latest nightly build version
|
||||
```
|
||||
|
||||
## To additionally use RayOnSpark
|
||||
Note that installing Orca will automatically install the dependencies including `bigdl-dllib`, `bigdl-tf`, `bigdl-math`, `packaging`, `filelock`, `pyzmq` and their dependencies if they haven't been detected in your conda environment._
|
||||
|
||||
### To additionally use RayOnSpark
|
||||
|
||||
If you wish to run [RayOnSpark](ray.md) or [sklearn-style Estimator APIs in Orca](distributed-training-inference.md) with "ray" backend, use the extra key `[ray]` during the installation above:
|
||||
|
||||
|
|
@ -64,7 +73,7 @@ pip install --pre --upgrade bigdl-orca[ray] # For the latest nightly build vers
|
|||
Note that with the extra key of [ray], `pip` will automatically install the additional dependencies for RayOnSpark,
|
||||
including `ray[default]==1.9.2`, `aiohttp==3.8.1`, `async-timeout==4.0.1`, `aioredis==1.3.1`, `hiredis==2.0.0`, `prometheus-client==0.11.0`, `psutil`, `setproctitle`.
|
||||
|
||||
## To additionally use AutoML
|
||||
### To additionally use AutoML
|
||||
|
||||
If you wish to run AutoML, use the extra key `[automl]` during the installation above:
|
||||
|
||||
|
|
@ -83,3 +92,32 @@ including `ray[tune]==1.9.2`, `scikit-learn`, `tensorboard`, `xgboost` together
|
|||
- To use [Pytorch AutoEstimator](distributed-tuning.md#pytorch-autoestimator), you need to install Pytorch with `pip install torch==1.8.1`.
|
||||
|
||||
- To use [TensorFlow/Keras AutoEstimator](distributed-tuning.md#tensorflow-keras-autoestimator), you need to install TensorFlow with `pip install tensorflow==1.15.0`.
|
||||
|
||||
### To install Orca for Spark3
|
||||
|
||||
By default, Orca is built on top of Spark 2.4.6 (with pyspark==2.4.6 as a dependency). If you want to install Orca built on top of Spark 3.1.3 (with pyspark==3.1.3 as a dependency), you can use the following command instead:
|
||||
|
||||
```bash
|
||||
# For the official release version
|
||||
pip install bigdl-orca-spark3
|
||||
pip install bigdl-orca-spark3[ray]
|
||||
pip install bigdl-orca-spark3[automl]
|
||||
|
||||
# For the latest nightly build version
|
||||
pip install --pre --upgrade bigdl-orca-spark3
|
||||
pip install --pre --upgrade bigdl-orca-spark3[ray]
|
||||
pip install --pre --upgrade bigdl-orca-spark3[automl]
|
||||
```
|
||||
|
||||
__Note__: You should only install Orca built on top of __ONE__ Spark version, but not both. If you want to switch the Spark version, please [**uninstall**](#to-uninstall-orca) Orca cleanly before reinstall.
|
||||
|
||||
### To uninstall Orca
|
||||
```bash
|
||||
# For default Orca built on top of Spark 2.4.6
|
||||
pip uninstall bigdl-orca bigdl-dllib bigdl-tf bigdl-math bigdl-core
|
||||
|
||||
# For Orca built on top of Spark 3.1.3
|
||||
pip uninstall bigdl-orca-spark3 bigdl-dllib-spark3 bigdl-tf bigdl-math bigdl-core
|
||||
```
|
||||
|
||||
__Note__: If necessary, you need to manually uninstall `pyspark` and other [dependencies](https://github.com/intel-analytics/BigDL/tree/main/python/requirements/orca) introduced by Orca.
|
||||
|
|
|
|||
|
|
@ -2,13 +2,15 @@
|
|||
|
||||
### Overview
|
||||
|
||||
Most AI projects start with a Python notebook running on a single laptop; however, one usually needs to go through a mountain of pains to scale it to handle larger data set in a distributed fashion. The _**Orca**_ library seamlessly scales out your single node Python notebook across large clusters (so as to process distributed Big Data).
|
||||
The _**Orca**_ library in BigDL can seamlessly scale out your single node Python notebook across large clusters to process large-scale data.
|
||||
|
||||
This page demonstrates how to scale the distributed training and inference of a standard TensorFlow model to a large cluster with minimum code changes to your notebook using Orca. We use [Neural Collaborative Filtering](https://arxiv.org/abs/1708.05031) for recommendation as an example.
|
||||
|
||||
---
|
||||
|
||||
### TensorFlow Bite-sized Example
|
||||
|
||||
First of all, follow the steps [here](install.md#to-use-basic-orca-features) to install Orca in your environment.
|
||||
Before running this example, follow the steps [here](install.md) to prepare the environment and install Orca in your environment.
|
||||
|
||||
This section uses **TensorFlow 2.x**, and you should also install TensorFlow before running this example:
|
||||
```bash
|
||||
|
|
@ -24,11 +26,10 @@ from bigdl.orca import init_orca_context, stop_orca_context, OrcaContext
|
|||
sc = init_orca_context(cluster_mode="local", cores=4, memory="10g", num_nodes=1)
|
||||
```
|
||||
|
||||
Next, perform [data-parallel processing in Orca](data-parallel-processing.md) (supporting standard Spark Dataframes, TensorFlow Dataset, PyTorch DataLoader, Pandas, etc.). Here to make things simple, we just generate some random data with Spark DataFrame:
|
||||
Next, perform [data-parallel processing in Orca](data-parallel-processing.md) (supporting standard Spark DataFrames, TensorFlow Dataset, PyTorch DataLoader, Pandas, etc.). Here to make things simple, we just generate some random data with Spark DataFrame:
|
||||
|
||||
```python
|
||||
import random
|
||||
from pyspark.sql.functions import array
|
||||
from pyspark.sql.types import StructType, StructField, IntegerType
|
||||
from bigdl.orca import OrcaContext
|
||||
|
||||
|
|
@ -41,44 +42,59 @@ schema = StructType([StructField("user", IntegerType(), False),
|
|||
StructField("item", IntegerType(), False),
|
||||
StructField("label", IntegerType(), False)])
|
||||
df = spark.createDataFrame(rdd, schema)
|
||||
train, test = df.randomSplit([0.8, 0.2], seed=1)
|
||||
train_df, test_df = df.randomSplit([0.8, 0.2], seed=1)
|
||||
```
|
||||
|
||||
Finally, use [sklearn-style Estimator APIs in Orca](distributed-training-inference.md) to perform distributed _TensorFlow_, _PyTorch_, _Keras_ and _BigDL_ training and inference:
|
||||
|
||||
```python
|
||||
from tensorflow import keras
|
||||
from bigdl.orca.learn.tf2.estimator import Estimator
|
||||
|
||||
def model_creator(config):
|
||||
user_input = keras.layers.Input(shape=(1,), dtype="int32", name="use_input")
|
||||
item_input = keras.layers.Input(shape=(1,), dtype="int32", name="item_input")
|
||||
from tensorflow import keras
|
||||
|
||||
mlp_embed_user = keras.layers.Embedding(input_dim=num_users, output_dim=config["embed_dim"],
|
||||
input_length=1)(user_input)
|
||||
mlp_embed_item = keras.layers.Embedding(input_dim=num_items, output_dim=config["embed_dim"],
|
||||
input_length=1)(item_input)
|
||||
user_input = keras.layers.Input(shape=(1,), dtype="int32", name="use_input")
|
||||
item_input = keras.layers.Input(shape=(1,), dtype="int32", name="item_input")
|
||||
|
||||
user_latent = keras.layers.Flatten()(mlp_embed_user)
|
||||
item_latent = keras.layers.Flatten()(mlp_embed_item)
|
||||
mlp_embed_user = keras.layers.Embedding(input_dim=config["num_users"], output_dim=config["embed_dim"],
|
||||
input_length=1)(user_input)
|
||||
mlp_embed_item = keras.layers.Embedding(input_dim=config["num_items"], output_dim=config["embed_dim"],
|
||||
input_length=1)(item_input)
|
||||
|
||||
mlp_latent = keras.layers.concatenate([user_latent, item_latent], axis=1)
|
||||
predictions = keras.layers.Dense(2, activation="sigmoid")(mlp_latent)
|
||||
model = keras.models.Model(inputs=[user_input, item_input], outputs=predictions)
|
||||
model.compile(optimizer='adam',
|
||||
loss='sparse_categorical_crossentropy',
|
||||
metrics=['accuracy'])
|
||||
return model
|
||||
user_latent = keras.layers.Flatten()(mlp_embed_user)
|
||||
item_latent = keras.layers.Flatten()(mlp_embed_item)
|
||||
|
||||
est = Estimator.from_keras(model_creator=model_creator, backend="spark", config={"embed_dim": 8})
|
||||
est.fit(data=train,
|
||||
batch_size=64,
|
||||
mlp_latent = keras.layers.concatenate([user_latent, item_latent], axis=1)
|
||||
predictions = keras.layers.Dense(1, activation="sigmoid")(mlp_latent)
|
||||
model = keras.models.Model(inputs=[user_input, item_input], outputs=predictions)
|
||||
model.compile(optimizer='adam',
|
||||
loss='binary_crossentropy',
|
||||
metrics=['accuracy'])
|
||||
return model
|
||||
|
||||
|
||||
batch_size = 64
|
||||
train_steps = int(train_df.count() / batch_size)
|
||||
val_steps = int(test_df.count() / batch_size)
|
||||
|
||||
est = Estimator.from_keras(model_creator=model_creator, backend="spark",
|
||||
config={"embed_dim": 8, "num_users": num_users, "num_items": num_items})
|
||||
est.fit(data=train_df,
|
||||
batch_size=batch_size,
|
||||
epochs=4,
|
||||
feature_cols=['user', 'item'],
|
||||
label_cols=['label'],
|
||||
steps_per_epoch=int(train.count()/64),
|
||||
validation_data=test,
|
||||
validation_steps=int(test.count()/64))
|
||||
steps_per_epoch=train_steps,
|
||||
validation_data=test_df,
|
||||
validation_steps=val_steps)
|
||||
prediction_df = est.predict(test_df,
|
||||
batch_size=batch_size,
|
||||
feature_cols=['user', 'item'],
|
||||
steps=val_steps)
|
||||
```
|
||||
|
||||
Stop [Orca Context](orca-context.md) after you finish your program:
|
||||
|
||||
```python
|
||||
stop_orca_context()
|
||||
```
|
||||
|
|
|
|||
|
|
@ -86,7 +86,7 @@ export HADOOP_CONF_DIR=/path/to/hadoop/conf
|
|||
### 2.2 Install Python Libraries
|
||||
- See [here](../Overview/install.md#install-anaconda) to install conda and prepare the Python environment on the __Client Node__.
|
||||
|
||||
- See [here](../Overview/install.md#to-use-basic-orca-features) to install BigDL Orca in the created conda environment.
|
||||
- See [here](../Overview/install.md#install-bigdl-orca) to install BigDL Orca in the created conda environment.
|
||||
|
||||
- You should install all the other Python libraries that you need in your program in the conda environment as well.
|
||||
|
||||
|
|
|
|||
|
|
@ -62,12 +62,12 @@ pip uninstall bigdl-dllib bigdl-core bigdl-tf bigdl-math bigdl-orca bigdl-chrono
|
|||
|
||||
#### 1.3 BigDL on Spark 3
|
||||
|
||||
You can install BigDL built on top of Spark 3.1.2 as follows:
|
||||
You can install BigDL built on top of Spark 3.1.3 as follows:
|
||||
```bash
|
||||
pip install bigdl-spark3 # Install the latest release version
|
||||
pip install --pre --upgrade bigdl-spark3 # Install the latest nightly build version
|
||||
```
|
||||
You can find the list of the nightly build versions built on top of Spark 3.1.2 [here](https://pypi.org/project/bigdl-spark3/#history).
|
||||
You can find the list of the nightly build versions built on top of Spark 3.1.3 [here](https://pypi.org/project/bigdl-spark3/#history).
|
||||
|
||||
You could uninstall all the packages of BigDL on Spark3 as follows:
|
||||
|
||||
|
|
@ -123,7 +123,7 @@ For more details, please refer to [Orca Context](../Orca/Overview/orca-context.m
|
|||
BigDL has been tested on __Python 3.6 and 3.7__ with the following library versions:
|
||||
|
||||
```bash
|
||||
pyspark==2.4.6 or 3.1.2
|
||||
pyspark==2.4.6 or 3.1.3
|
||||
ray==1.9.2
|
||||
tensorflow==1.15.0 or >2.0
|
||||
pytorch>=1.5.0
|
||||
|
|
|
|||
Loading…
Reference in a new issue