Update Orca install doc (#6518)

* update install * update * update * update link * update 5min * update 5mins * update link
2022-11-10 10:28:46 +08:00 · 2022-11-10 10:28:46 +08:00 · 7da102243e
commit 7da102243e
parent 17fb75f8d7
4 changed files with 90 additions and 36 deletions
--- a/docs/readthedocs/source/doc/Orca/Overview/install.md
+++ b/docs/readthedocs/source/doc/Orca/Overview/install.md
@ -1,6 +1,10 @@
 # Installation

-## Install Java
+---
+## Prepare the environment
+You can follow the commands in this section to install Java and conda before installing BigDL Orca.
+
+### Install Java
 You need to download and install JDK in the environment, and properly set the environment variable `JAVA_HOME`. JDK8 is highly recommended.

 ```bash
@ -16,7 +20,7 @@ export PATH=$PATH:$JAVA_HOME/bin
 java -version  # Verify the version of JDK.
 ```

-## Install Anaconda
+### Install Anaconda
 We recommend using [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to prepare the Python environment.

 You can follow the steps below to install conda:
@ -37,7 +41,10 @@ conda create -n py37 python=3.7  # "py37" is conda environment name, you can use
 conda activate py37
 ```

-## To use basic Orca features
+---
+## Install BigDL Orca
+
+### To use basic Orca features
 You can install Orca in your created conda environment for distributed data processing, training and inference with the following command:
 ```bash
 pip install bigdl-orca  # For the official release version
@ -48,7 +55,9 @@ or for the nightly build version, use:
 pip install --pre --upgrade bigdl-orca  # For the latest nightly build version
 ```

-## To additionally use RayOnSpark
+Note that installing Orca will automatically install the dependencies including `bigdl-dllib`, `bigdl-tf`, `bigdl-math`, `packaging`, `filelock`, `pyzmq` and their dependencies if they haven't been detected in your conda environment._
+
+### To additionally use RayOnSpark

 If you wish to run [RayOnSpark](ray.md) or [sklearn-style Estimator APIs in Orca](distributed-training-inference.md) with "ray" backend, use the extra key `[ray]` during the installation above:

@ -64,7 +73,7 @@ pip install --pre --upgrade bigdl-orca[ray]  # For the latest nightly build vers
 Note that with the extra key of [ray], `pip` will automatically install the additional dependencies for RayOnSpark,
 including `ray[default]==1.9.2`, `aiohttp==3.8.1`, `async-timeout==4.0.1`, `aioredis==1.3.1`, `hiredis==2.0.0`, `prometheus-client==0.11.0`, `psutil`,  `setproctitle`.

-## To additionally use AutoML
+### To additionally use AutoML

 If you wish to run AutoML, use the extra key `[automl]` during the installation above:

@ -83,3 +92,32 @@ including `ray[tune]==1.9.2`, `scikit-learn`, `tensorboard`, `xgboost` together
 - To use [Pytorch AutoEstimator](distributed-tuning.md#pytorch-autoestimator), you need to install Pytorch with `pip install torch==1.8.1`.

 - To use [TensorFlow/Keras AutoEstimator](distributed-tuning.md#tensorflow-keras-autoestimator), you need to install TensorFlow with `pip install tensorflow==1.15.0`.
+
+### To install Orca for Spark3
+
+By default, Orca is built on top of Spark 2.4.6 (with pyspark==2.4.6 as a dependency). If you want to install Orca built on top of Spark 3.1.3 (with pyspark==3.1.3 as a dependency), you can use the following command instead:
+
+```bash
+# For the official release version
+pip install bigdl-orca-spark3
+pip install bigdl-orca-spark3[ray]
+pip install bigdl-orca-spark3[automl]
+
+# For the latest nightly build version
+pip install --pre --upgrade bigdl-orca-spark3
+pip install --pre --upgrade bigdl-orca-spark3[ray]
+pip install --pre --upgrade bigdl-orca-spark3[automl]
+```
+
+__Note__: You should only install Orca built on top of __ONE__ Spark version, but not both. If you want to switch the Spark version, please [**uninstall**](#to-uninstall-orca) Orca cleanly before reinstall.
+
+### To uninstall Orca
+```bash
+# For default Orca built on top of Spark 2.4.6
+pip uninstall bigdl-orca bigdl-dllib bigdl-tf bigdl-math bigdl-core
+
+# For Orca built on top of Spark 3.1.3
+pip uninstall bigdl-orca-spark3 bigdl-dllib-spark3 bigdl-tf bigdl-math bigdl-core
+```
+
+__Note__: If necessary, you need to manually uninstall `pyspark` and other [dependencies](https://github.com/intel-analytics/BigDL/tree/main/python/requirements/orca) introduced by Orca.
--- a/docs/readthedocs/source/doc/Orca/Overview/orca.md
+++ b/docs/readthedocs/source/doc/Orca/Overview/orca.md
@ -2,13 +2,15 @@

 ### Overview

-Most AI projects start with a Python notebook running on a single laptop; however, one usually needs to go through a mountain of pains to scale it to handle larger data set in a distributed fashion. The  _**Orca**_ library seamlessly scales out your single node Python notebook across large clusters (so as to process distributed Big Data).
+The  _**Orca**_ library in BigDL can seamlessly scale out your single node Python notebook across large clusters to process large-scale data.
+
+This page demonstrates how to scale the distributed training and inference of a standard TensorFlow model to a large cluster with minimum code changes to your notebook using Orca. We use [Neural Collaborative Filtering](https://arxiv.org/abs/1708.05031) for recommendation as an example.

 ---

 ### TensorFlow Bite-sized Example

-First of all, follow the steps [here](install.md#to-use-basic-orca-features) to install Orca in your environment.
+Before running this example, follow the steps [here](install.md) to prepare the environment and install Orca in your environment.

 This section uses **TensorFlow 2.x**, and you should also install TensorFlow before running this example:
 ```bash
@ -24,11 +26,10 @@ from bigdl.orca import init_orca_context, stop_orca_context, OrcaContext
 sc = init_orca_context(cluster_mode="local", cores=4, memory="10g", num_nodes=1)
 ```

-Next, perform [data-parallel processing in Orca](data-parallel-processing.md) (supporting standard Spark Dataframes, TensorFlow Dataset, PyTorch DataLoader, Pandas, etc.). Here to make things simple, we just generate some random data with Spark DataFrame:
+Next, perform [data-parallel processing in Orca](data-parallel-processing.md) (supporting standard Spark DataFrames, TensorFlow Dataset, PyTorch DataLoader, Pandas, etc.). Here to make things simple, we just generate some random data with Spark DataFrame:

 ```python
 import random
-from pyspark.sql.functions import array
 from pyspark.sql.types import StructType, StructField, IntegerType
 from bigdl.orca import OrcaContext

@ -41,44 +42,59 @@ schema = StructType([StructField("user", IntegerType(), False),
                     StructField("item", IntegerType(), False),
                     StructField("label", IntegerType(), False)])
 df = spark.createDataFrame(rdd, schema)
-train, test = df.randomSplit([0.8, 0.2], seed=1)
+train_df, test_df = df.randomSplit([0.8, 0.2], seed=1)
 ```

 Finally, use [sklearn-style Estimator APIs in Orca](distributed-training-inference.md) to perform distributed _TensorFlow_, _PyTorch_, _Keras_ and _BigDL_ training and inference:

 ```python
-from tensorflow import keras
 from bigdl.orca.learn.tf2.estimator import Estimator

 def model_creator(config):
-  user_input = keras.layers.Input(shape=(1,), dtype="int32", name="use_input")
-  item_input = keras.layers.Input(shape=(1,), dtype="int32", name="item_input")
+    from tensorflow import keras

-  mlp_embed_user = keras.layers.Embedding(input_dim=num_users, output_dim=config["embed_dim"],
-                               input_length=1)(user_input)
-  mlp_embed_item = keras.layers.Embedding(input_dim=num_items, output_dim=config["embed_dim"],
-                               input_length=1)(item_input)
+    user_input = keras.layers.Input(shape=(1,), dtype="int32", name="use_input")
+    item_input = keras.layers.Input(shape=(1,), dtype="int32", name="item_input")

-  user_latent = keras.layers.Flatten()(mlp_embed_user)
-  item_latent = keras.layers.Flatten()(mlp_embed_item)
+    mlp_embed_user = keras.layers.Embedding(input_dim=config["num_users"], output_dim=config["embed_dim"],
+                                            input_length=1)(user_input)
+    mlp_embed_item = keras.layers.Embedding(input_dim=config["num_items"], output_dim=config["embed_dim"],
+                                            input_length=1)(item_input)

-  mlp_latent = keras.layers.concatenate([user_latent, item_latent], axis=1)
-  predictions = keras.layers.Dense(2, activation="sigmoid")(mlp_latent)
-  model = keras.models.Model(inputs=[user_input, item_input], outputs=predictions)
-  model.compile(optimizer='adam',
-                loss='sparse_categorical_crossentropy',
-                metrics=['accuracy'])
-  return model
+    user_latent = keras.layers.Flatten()(mlp_embed_user)
+    item_latent = keras.layers.Flatten()(mlp_embed_item)

-est = Estimator.from_keras(model_creator=model_creator, backend="spark", config={"embed_dim": 8})
-est.fit(data=train,
-        batch_size=64,
+    mlp_latent = keras.layers.concatenate([user_latent, item_latent], axis=1)
+    predictions = keras.layers.Dense(1, activation="sigmoid")(mlp_latent)
+    model = keras.models.Model(inputs=[user_input, item_input], outputs=predictions)
+    model.compile(optimizer='adam',
+                  loss='binary_crossentropy',
+                  metrics=['accuracy'])
+    return model
+
+
+batch_size = 64
+train_steps = int(train_df.count() / batch_size)
+val_steps = int(test_df.count() / batch_size)
+
+est = Estimator.from_keras(model_creator=model_creator, backend="spark",
+                           config={"embed_dim": 8, "num_users": num_users, "num_items": num_items})
+est.fit(data=train_df,
+        batch_size=batch_size,
        epochs=4,
        feature_cols=['user', 'item'],
        label_cols=['label'],
-        steps_per_epoch=int(train.count()/64),
-        validation_data=test,
-        validation_steps=int(test.count()/64))
+        steps_per_epoch=train_steps,
+        validation_data=test_df,
+        validation_steps=val_steps)
+prediction_df = est.predict(test_df,
+                            batch_size=batch_size,
+                            feature_cols=['user', 'item'],
+                            steps=val_steps)
+```

+Stop [Orca Context](orca-context.md) after you finish your program:
+
+```python
 stop_orca_context()
 ```
--- a/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md
+++ b/docs/readthedocs/source/doc/Orca/Tutorial/yarn.md
@ -86,7 +86,7 @@ export HADOOP_CONF_DIR=/path/to/hadoop/conf
 ### 2.2 Install Python Libraries
 - See [here](../Overview/install.md#install-anaconda) to install conda and prepare the Python environment on the __Client Node__.

- See [here](../Overview/install.md#to-use-basic-orca-features) to install BigDL Orca in the created conda environment.
+- See [here](../Overview/install.md#install-bigdl-orca) to install BigDL Orca in the created conda environment.

 - You should install all the other Python libraries that you need in your program in the conda environment as well.

--- a/docs/readthedocs/source/doc/UserGuide/python.md
+++ b/docs/readthedocs/source/doc/UserGuide/python.md
@ -62,12 +62,12 @@ pip uninstall bigdl-dllib bigdl-core bigdl-tf bigdl-math bigdl-orca bigdl-chrono

 #### 1.3 BigDL on Spark 3

-You can install BigDL built on top of Spark 3.1.2 as follows:
+You can install BigDL built on top of Spark 3.1.3 as follows:
 ```bash
 pip install bigdl-spark3  # Install the latest release version
 pip install --pre --upgrade bigdl-spark3  # Install the latest nightly build version
 ```
-You can find the list of the nightly build versions built on top of Spark 3.1.2 [here](https://pypi.org/project/bigdl-spark3/#history).
+You can find the list of the nightly build versions built on top of Spark 3.1.3 [here](https://pypi.org/project/bigdl-spark3/#history).

 You could uninstall all the packages of BigDL on Spark3 as follows:

@ -123,7 +123,7 @@ For more details, please refer to [Orca Context](../Orca/Overview/orca-context.m
 BigDL has been tested on __Python 3.6 and 3.7__ with the following library versions:

 ```bash
-pyspark==2.4.6 or 3.1.2
+pyspark==2.4.6 or 3.1.3
 ray==1.9.2
 tensorflow==1.15.0 or >2.0
 pytorch>=1.5.0