diff --git a/docs/readthedocs/source/doc/Orca/Overview/install.md b/docs/readthedocs/source/doc/Orca/Overview/install.md index 5a016fbd..a3e47da8 100644 --- a/docs/readthedocs/source/doc/Orca/Overview/install.md +++ b/docs/readthedocs/source/doc/Orca/Overview/install.md @@ -1,6 +1,10 @@ # Installation -## Install Java +--- +## Prepare the environment +You can follow the commands in this section to install Java and conda before installing BigDL Orca. + +### Install Java You need to download and install JDK in the environment, and properly set the environment variable `JAVA_HOME`. JDK8 is highly recommended. ```bash @@ -16,7 +20,7 @@ export PATH=$PATH:$JAVA_HOME/bin java -version # Verify the version of JDK. ``` -## Install Anaconda +### Install Anaconda We recommend using [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to prepare the Python environment. You can follow the steps below to install conda: @@ -37,7 +41,10 @@ conda create -n py37 python=3.7 # "py37" is conda environment name, you can use conda activate py37 ``` -## To use basic Orca features +--- +## Install BigDL Orca + +### To use basic Orca features You can install Orca in your created conda environment for distributed data processing, training and inference with the following command: ```bash pip install bigdl-orca # For the official release version @@ -48,7 +55,9 @@ or for the nightly build version, use: pip install --pre --upgrade bigdl-orca # For the latest nightly build version ``` -## To additionally use RayOnSpark +Note that installing Orca will automatically install the dependencies including `bigdl-dllib`, `bigdl-tf`, `bigdl-math`, `packaging`, `filelock`, `pyzmq` and their dependencies if they haven't been detected in your conda environment._ + +### To additionally use RayOnSpark If you wish to run [RayOnSpark](ray.md) or [sklearn-style Estimator APIs in Orca](distributed-training-inference.md) with "ray" backend, use the extra key `[ray]` during the installation above: @@ -64,7 +73,7 @@ pip install --pre --upgrade bigdl-orca[ray] # For the latest nightly build vers Note that with the extra key of [ray], `pip` will automatically install the additional dependencies for RayOnSpark, including `ray[default]==1.9.2`, `aiohttp==3.8.1`, `async-timeout==4.0.1`, `aioredis==1.3.1`, `hiredis==2.0.0`, `prometheus-client==0.11.0`, `psutil`, `setproctitle`. -## To additionally use AutoML +### To additionally use AutoML If you wish to run AutoML, use the extra key `[automl]` during the installation above: @@ -83,3 +92,32 @@ including `ray[tune]==1.9.2`, `scikit-learn`, `tensorboard`, `xgboost` together - To use [Pytorch AutoEstimator](distributed-tuning.md#pytorch-autoestimator), you need to install Pytorch with `pip install torch==1.8.1`. - To use [TensorFlow/Keras AutoEstimator](distributed-tuning.md#tensorflow-keras-autoestimator), you need to install TensorFlow with `pip install tensorflow==1.15.0`. + +### To install Orca for Spark3 + +By default, Orca is built on top of Spark 2.4.6 (with pyspark==2.4.6 as a dependency). If you want to install Orca built on top of Spark 3.1.3 (with pyspark==3.1.3 as a dependency), you can use the following command instead: + +```bash +# For the official release version +pip install bigdl-orca-spark3 +pip install bigdl-orca-spark3[ray] +pip install bigdl-orca-spark3[automl] + +# For the latest nightly build version +pip install --pre --upgrade bigdl-orca-spark3 +pip install --pre --upgrade bigdl-orca-spark3[ray] +pip install --pre --upgrade bigdl-orca-spark3[automl] +``` + +__Note__: You should only install Orca built on top of __ONE__ Spark version, but not both. If you want to switch the Spark version, please [**uninstall**](#to-uninstall-orca) Orca cleanly before reinstall. + +### To uninstall Orca +```bash +# For default Orca built on top of Spark 2.4.6 +pip uninstall bigdl-orca bigdl-dllib bigdl-tf bigdl-math bigdl-core + +# For Orca built on top of Spark 3.1.3 +pip uninstall bigdl-orca-spark3 bigdl-dllib-spark3 bigdl-tf bigdl-math bigdl-core +``` + +__Note__: If necessary, you need to manually uninstall `pyspark` and other [dependencies](https://github.com/intel-analytics/BigDL/tree/main/python/requirements/orca) introduced by Orca. diff --git a/docs/readthedocs/source/doc/Orca/Overview/orca.md b/docs/readthedocs/source/doc/Orca/Overview/orca.md index 5f7de428..9b83efda 100644 --- a/docs/readthedocs/source/doc/Orca/Overview/orca.md +++ b/docs/readthedocs/source/doc/Orca/Overview/orca.md @@ -2,13 +2,15 @@ ### Overview -Most AI projects start with a Python notebook running on a single laptop; however, one usually needs to go through a mountain of pains to scale it to handle larger data set in a distributed fashion. The _**Orca**_ library seamlessly scales out your single node Python notebook across large clusters (so as to process distributed Big Data). +The _**Orca**_ library in BigDL can seamlessly scale out your single node Python notebook across large clusters to process large-scale data. + +This page demonstrates how to scale the distributed training and inference of a standard TensorFlow model to a large cluster with minimum code changes to your notebook using Orca. We use [Neural Collaborative Filtering](https://arxiv.org/abs/1708.05031) for recommendation as an example. --- ### TensorFlow Bite-sized Example -First of all, follow the steps [here](install.md#to-use-basic-orca-features) to install Orca in your environment. +Before running this example, follow the steps [here](install.md) to prepare the environment and install Orca in your environment. This section uses **TensorFlow 2.x**, and you should also install TensorFlow before running this example: ```bash @@ -24,11 +26,10 @@ from bigdl.orca import init_orca_context, stop_orca_context, OrcaContext sc = init_orca_context(cluster_mode="local", cores=4, memory="10g", num_nodes=1) ``` -Next, perform [data-parallel processing in Orca](data-parallel-processing.md) (supporting standard Spark Dataframes, TensorFlow Dataset, PyTorch DataLoader, Pandas, etc.). Here to make things simple, we just generate some random data with Spark DataFrame: +Next, perform [data-parallel processing in Orca](data-parallel-processing.md) (supporting standard Spark DataFrames, TensorFlow Dataset, PyTorch DataLoader, Pandas, etc.). Here to make things simple, we just generate some random data with Spark DataFrame: ```python import random -from pyspark.sql.functions import array from pyspark.sql.types import StructType, StructField, IntegerType from bigdl.orca import OrcaContext @@ -41,44 +42,59 @@ schema = StructType([StructField("user", IntegerType(), False), StructField("item", IntegerType(), False), StructField("label", IntegerType(), False)]) df = spark.createDataFrame(rdd, schema) -train, test = df.randomSplit([0.8, 0.2], seed=1) +train_df, test_df = df.randomSplit([0.8, 0.2], seed=1) ``` Finally, use [sklearn-style Estimator APIs in Orca](distributed-training-inference.md) to perform distributed _TensorFlow_, _PyTorch_, _Keras_ and _BigDL_ training and inference: ```python -from tensorflow import keras from bigdl.orca.learn.tf2.estimator import Estimator def model_creator(config): - user_input = keras.layers.Input(shape=(1,), dtype="int32", name="use_input") - item_input = keras.layers.Input(shape=(1,), dtype="int32", name="item_input") + from tensorflow import keras - mlp_embed_user = keras.layers.Embedding(input_dim=num_users, output_dim=config["embed_dim"], - input_length=1)(user_input) - mlp_embed_item = keras.layers.Embedding(input_dim=num_items, output_dim=config["embed_dim"], - input_length=1)(item_input) + user_input = keras.layers.Input(shape=(1,), dtype="int32", name="use_input") + item_input = keras.layers.Input(shape=(1,), dtype="int32", name="item_input") - user_latent = keras.layers.Flatten()(mlp_embed_user) - item_latent = keras.layers.Flatten()(mlp_embed_item) + mlp_embed_user = keras.layers.Embedding(input_dim=config["num_users"], output_dim=config["embed_dim"], + input_length=1)(user_input) + mlp_embed_item = keras.layers.Embedding(input_dim=config["num_items"], output_dim=config["embed_dim"], + input_length=1)(item_input) - mlp_latent = keras.layers.concatenate([user_latent, item_latent], axis=1) - predictions = keras.layers.Dense(2, activation="sigmoid")(mlp_latent) - model = keras.models.Model(inputs=[user_input, item_input], outputs=predictions) - model.compile(optimizer='adam', - loss='sparse_categorical_crossentropy', - metrics=['accuracy']) - return model + user_latent = keras.layers.Flatten()(mlp_embed_user) + item_latent = keras.layers.Flatten()(mlp_embed_item) -est = Estimator.from_keras(model_creator=model_creator, backend="spark", config={"embed_dim": 8}) -est.fit(data=train, - batch_size=64, + mlp_latent = keras.layers.concatenate([user_latent, item_latent], axis=1) + predictions = keras.layers.Dense(1, activation="sigmoid")(mlp_latent) + model = keras.models.Model(inputs=[user_input, item_input], outputs=predictions) + model.compile(optimizer='adam', + loss='binary_crossentropy', + metrics=['accuracy']) + return model + + +batch_size = 64 +train_steps = int(train_df.count() / batch_size) +val_steps = int(test_df.count() / batch_size) + +est = Estimator.from_keras(model_creator=model_creator, backend="spark", + config={"embed_dim": 8, "num_users": num_users, "num_items": num_items}) +est.fit(data=train_df, + batch_size=batch_size, epochs=4, feature_cols=['user', 'item'], label_cols=['label'], - steps_per_epoch=int(train.count()/64), - validation_data=test, - validation_steps=int(test.count()/64)) + steps_per_epoch=train_steps, + validation_data=test_df, + validation_steps=val_steps) +prediction_df = est.predict(test_df, + batch_size=batch_size, + feature_cols=['user', 'item'], + steps=val_steps) +``` +Stop [Orca Context](orca-context.md) after you finish your program: + +```python stop_orca_context() ``` diff --git a/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md b/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md index 4d963b9b..a0fcefed 100644 --- a/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md +++ b/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md @@ -86,7 +86,7 @@ export HADOOP_CONF_DIR=/path/to/hadoop/conf ### 2.2 Install Python Libraries - See [here](../Overview/install.md#install-anaconda) to install conda and prepare the Python environment on the __Client Node__. -- See [here](../Overview/install.md#to-use-basic-orca-features) to install BigDL Orca in the created conda environment. +- See [here](../Overview/install.md#install-bigdl-orca) to install BigDL Orca in the created conda environment. - You should install all the other Python libraries that you need in your program in the conda environment as well. diff --git a/docs/readthedocs/source/doc/UserGuide/python.md b/docs/readthedocs/source/doc/UserGuide/python.md index 336d1541..3848b5c0 100644 --- a/docs/readthedocs/source/doc/UserGuide/python.md +++ b/docs/readthedocs/source/doc/UserGuide/python.md @@ -62,12 +62,12 @@ pip uninstall bigdl-dllib bigdl-core bigdl-tf bigdl-math bigdl-orca bigdl-chrono #### 1.3 BigDL on Spark 3 -You can install BigDL built on top of Spark 3.1.2 as follows: +You can install BigDL built on top of Spark 3.1.3 as follows: ```bash pip install bigdl-spark3 # Install the latest release version pip install --pre --upgrade bigdl-spark3 # Install the latest nightly build version ``` -You can find the list of the nightly build versions built on top of Spark 3.1.2 [here](https://pypi.org/project/bigdl-spark3/#history). +You can find the list of the nightly build versions built on top of Spark 3.1.3 [here](https://pypi.org/project/bigdl-spark3/#history). You could uninstall all the packages of BigDL on Spark3 as follows: @@ -123,7 +123,7 @@ For more details, please refer to [Orca Context](../Orca/Overview/orca-context.m BigDL has been tested on __Python 3.6 and 3.7__ with the following library versions: ```bash -pyspark==2.4.6 or 3.1.2 +pyspark==2.4.6 or 3.1.3 ray==1.9.2 tensorflow==1.15.0 or >2.0 pytorch>=1.5.0