Update Orca install doc (#6518)

* update install

* update

* update

* update link

* update 5min

* update 5mins

* update link
This commit is contained in:
Kai Huang 2022-11-10 10:28:46 +08:00 committed by GitHub
parent 17fb75f8d7
commit 7da102243e
4 changed files with 90 additions and 36 deletions

View file

@ -1,6 +1,10 @@
# Installation
## Install Java
---
## Prepare the environment
You can follow the commands in this section to install Java and conda before installing BigDL Orca.
### Install Java
You need to download and install JDK in the environment, and properly set the environment variable `JAVA_HOME`. JDK8 is highly recommended.
```bash
@ -16,7 +20,7 @@ export PATH=$PATH:$JAVA_HOME/bin
java -version # Verify the version of JDK.
```
## Install Anaconda
### Install Anaconda
We recommend using [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to prepare the Python environment.
You can follow the steps below to install conda:
@ -37,7 +41,10 @@ conda create -n py37 python=3.7 # "py37" is conda environment name, you can use
conda activate py37
```
## To use basic Orca features
---
## Install BigDL Orca
### To use basic Orca features
You can install Orca in your created conda environment for distributed data processing, training and inference with the following command:
```bash
pip install bigdl-orca # For the official release version
@ -48,7 +55,9 @@ or for the nightly build version, use:
pip install --pre --upgrade bigdl-orca # For the latest nightly build version
```
## To additionally use RayOnSpark
Note that installing Orca will automatically install the dependencies including `bigdl-dllib`, `bigdl-tf`, `bigdl-math`, `packaging`, `filelock`, `pyzmq` and their dependencies if they haven't been detected in your conda environment._
### To additionally use RayOnSpark
If you wish to run [RayOnSpark](ray.md) or [sklearn-style Estimator APIs in Orca](distributed-training-inference.md) with "ray" backend, use the extra key `[ray]` during the installation above:
@ -64,7 +73,7 @@ pip install --pre --upgrade bigdl-orca[ray] # For the latest nightly build vers
Note that with the extra key of [ray], `pip` will automatically install the additional dependencies for RayOnSpark,
including `ray[default]==1.9.2`, `aiohttp==3.8.1`, `async-timeout==4.0.1`, `aioredis==1.3.1`, `hiredis==2.0.0`, `prometheus-client==0.11.0`, `psutil`, `setproctitle`.
## To additionally use AutoML
### To additionally use AutoML
If you wish to run AutoML, use the extra key `[automl]` during the installation above:
@ -83,3 +92,32 @@ including `ray[tune]==1.9.2`, `scikit-learn`, `tensorboard`, `xgboost` together
- To use [Pytorch AutoEstimator](distributed-tuning.md#pytorch-autoestimator), you need to install Pytorch with `pip install torch==1.8.1`.
- To use [TensorFlow/Keras AutoEstimator](distributed-tuning.md#tensorflow-keras-autoestimator), you need to install TensorFlow with `pip install tensorflow==1.15.0`.
### To install Orca for Spark3
By default, Orca is built on top of Spark 2.4.6 (with pyspark==2.4.6 as a dependency). If you want to install Orca built on top of Spark 3.1.3 (with pyspark==3.1.3 as a dependency), you can use the following command instead:
```bash
# For the official release version
pip install bigdl-orca-spark3
pip install bigdl-orca-spark3[ray]
pip install bigdl-orca-spark3[automl]
# For the latest nightly build version
pip install --pre --upgrade bigdl-orca-spark3
pip install --pre --upgrade bigdl-orca-spark3[ray]
pip install --pre --upgrade bigdl-orca-spark3[automl]
```
__Note__: You should only install Orca built on top of __ONE__ Spark version, but not both. If you want to switch the Spark version, please [**uninstall**](#to-uninstall-orca) Orca cleanly before reinstall.
### To uninstall Orca
```bash
# For default Orca built on top of Spark 2.4.6
pip uninstall bigdl-orca bigdl-dllib bigdl-tf bigdl-math bigdl-core
# For Orca built on top of Spark 3.1.3
pip uninstall bigdl-orca-spark3 bigdl-dllib-spark3 bigdl-tf bigdl-math bigdl-core
```
__Note__: If necessary, you need to manually uninstall `pyspark` and other [dependencies](https://github.com/intel-analytics/BigDL/tree/main/python/requirements/orca) introduced by Orca.

View file

@ -2,13 +2,15 @@
### Overview
Most AI projects start with a Python notebook running on a single laptop; however, one usually needs to go through a mountain of pains to scale it to handle larger data set in a distributed fashion. The _**Orca**_ library seamlessly scales out your single node Python notebook across large clusters (so as to process distributed Big Data).
The _**Orca**_ library in BigDL can seamlessly scale out your single node Python notebook across large clusters to process large-scale data.
This page demonstrates how to scale the distributed training and inference of a standard TensorFlow model to a large cluster with minimum code changes to your notebook using Orca. We use [Neural Collaborative Filtering](https://arxiv.org/abs/1708.05031) for recommendation as an example.
---
### TensorFlow Bite-sized Example
First of all, follow the steps [here](install.md#to-use-basic-orca-features) to install Orca in your environment.
Before running this example, follow the steps [here](install.md) to prepare the environment and install Orca in your environment.
This section uses **TensorFlow 2.x**, and you should also install TensorFlow before running this example:
```bash
@ -24,11 +26,10 @@ from bigdl.orca import init_orca_context, stop_orca_context, OrcaContext
sc = init_orca_context(cluster_mode="local", cores=4, memory="10g", num_nodes=1)
```
Next, perform [data-parallel processing in Orca](data-parallel-processing.md) (supporting standard Spark Dataframes, TensorFlow Dataset, PyTorch DataLoader, Pandas, etc.). Here to make things simple, we just generate some random data with Spark DataFrame:
Next, perform [data-parallel processing in Orca](data-parallel-processing.md) (supporting standard Spark DataFrames, TensorFlow Dataset, PyTorch DataLoader, Pandas, etc.). Here to make things simple, we just generate some random data with Spark DataFrame:
```python
import random
from pyspark.sql.functions import array
from pyspark.sql.types import StructType, StructField, IntegerType
from bigdl.orca import OrcaContext
@ -41,44 +42,59 @@ schema = StructType([StructField("user", IntegerType(), False),
StructField("item", IntegerType(), False),
StructField("label", IntegerType(), False)])
df = spark.createDataFrame(rdd, schema)
train, test = df.randomSplit([0.8, 0.2], seed=1)
train_df, test_df = df.randomSplit([0.8, 0.2], seed=1)
```
Finally, use [sklearn-style Estimator APIs in Orca](distributed-training-inference.md) to perform distributed _TensorFlow_, _PyTorch_, _Keras_ and _BigDL_ training and inference:
```python
from tensorflow import keras
from bigdl.orca.learn.tf2.estimator import Estimator
def model_creator(config):
user_input = keras.layers.Input(shape=(1,), dtype="int32", name="use_input")
item_input = keras.layers.Input(shape=(1,), dtype="int32", name="item_input")
from tensorflow import keras
mlp_embed_user = keras.layers.Embedding(input_dim=num_users, output_dim=config["embed_dim"],
input_length=1)(user_input)
mlp_embed_item = keras.layers.Embedding(input_dim=num_items, output_dim=config["embed_dim"],
input_length=1)(item_input)
user_input = keras.layers.Input(shape=(1,), dtype="int32", name="use_input")
item_input = keras.layers.Input(shape=(1,), dtype="int32", name="item_input")
user_latent = keras.layers.Flatten()(mlp_embed_user)
item_latent = keras.layers.Flatten()(mlp_embed_item)
mlp_embed_user = keras.layers.Embedding(input_dim=config["num_users"], output_dim=config["embed_dim"],
input_length=1)(user_input)
mlp_embed_item = keras.layers.Embedding(input_dim=config["num_items"], output_dim=config["embed_dim"],
input_length=1)(item_input)
mlp_latent = keras.layers.concatenate([user_latent, item_latent], axis=1)
predictions = keras.layers.Dense(2, activation="sigmoid")(mlp_latent)
model = keras.models.Model(inputs=[user_input, item_input], outputs=predictions)
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
return model
user_latent = keras.layers.Flatten()(mlp_embed_user)
item_latent = keras.layers.Flatten()(mlp_embed_item)
est = Estimator.from_keras(model_creator=model_creator, backend="spark", config={"embed_dim": 8})
est.fit(data=train,
batch_size=64,
mlp_latent = keras.layers.concatenate([user_latent, item_latent], axis=1)
predictions = keras.layers.Dense(1, activation="sigmoid")(mlp_latent)
model = keras.models.Model(inputs=[user_input, item_input], outputs=predictions)
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
return model
batch_size = 64
train_steps = int(train_df.count() / batch_size)
val_steps = int(test_df.count() / batch_size)
est = Estimator.from_keras(model_creator=model_creator, backend="spark",
config={"embed_dim": 8, "num_users": num_users, "num_items": num_items})
est.fit(data=train_df,
batch_size=batch_size,
epochs=4,
feature_cols=['user', 'item'],
label_cols=['label'],
steps_per_epoch=int(train.count()/64),
validation_data=test,
validation_steps=int(test.count()/64))
steps_per_epoch=train_steps,
validation_data=test_df,
validation_steps=val_steps)
prediction_df = est.predict(test_df,
batch_size=batch_size,
feature_cols=['user', 'item'],
steps=val_steps)
```
Stop [Orca Context](orca-context.md) after you finish your program:
```python
stop_orca_context()
```

View file

@ -86,7 +86,7 @@ export HADOOP_CONF_DIR=/path/to/hadoop/conf
### 2.2 Install Python Libraries
- See [here](../Overview/install.md#install-anaconda) to install conda and prepare the Python environment on the __Client Node__.
- See [here](../Overview/install.md#to-use-basic-orca-features) to install BigDL Orca in the created conda environment.
- See [here](../Overview/install.md#install-bigdl-orca) to install BigDL Orca in the created conda environment.
- You should install all the other Python libraries that you need in your program in the conda environment as well.

View file

@ -62,12 +62,12 @@ pip uninstall bigdl-dllib bigdl-core bigdl-tf bigdl-math bigdl-orca bigdl-chrono
#### 1.3 BigDL on Spark 3
You can install BigDL built on top of Spark 3.1.2 as follows:
You can install BigDL built on top of Spark 3.1.3 as follows:
```bash
pip install bigdl-spark3 # Install the latest release version
pip install --pre --upgrade bigdl-spark3 # Install the latest nightly build version
```
You can find the list of the nightly build versions built on top of Spark 3.1.2 [here](https://pypi.org/project/bigdl-spark3/#history).
You can find the list of the nightly build versions built on top of Spark 3.1.3 [here](https://pypi.org/project/bigdl-spark3/#history).
You could uninstall all the packages of BigDL on Spark3 as follows:
@ -123,7 +123,7 @@ For more details, please refer to [Orca Context](../Orca/Overview/orca-context.m
BigDL has been tested on __Python 3.6 and 3.7__ with the following library versions:
```bash
pyspark==2.4.6 or 3.1.2
pyspark==2.4.6 or 3.1.3
ray==1.9.2
tensorflow==1.15.0 or >2.0
pytorch>=1.5.0