Update Project Readme (#6086)
This commit is contained in:
parent
6140b81eab
commit
8fdd068ca0
1 changed files with 327 additions and 157 deletions
484
README.md
484
README.md
|
|
@ -2,227 +2,397 @@
|
|||
|
||||
<p align="center"> <img src="docs/readthedocs/image/bigdl_logo.jpg" height="140px"><br></p>
|
||||
|
||||
**Building Large-Scale AI Applications for Distributed Big Data**
|
||||
_**Fast, Distributed, Secure AI for Big Data**_
|
||||
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
BigDL makes it easy for data scientists and data engineers to build end-to-end, distributed AI applications. The **BigDL 2.0** release combines the [original BigDL](https://github.com/intel-analytics/BigDL/tree/branch-0.14) and [Analytics Zoo](https://github.com/intel-analytics/analytics-zoo) projects, providing the following features:
|
||||
|
||||
* [DLlib](#getting-started-with-dllib): distributed deep learning library for Apache Spark *(i.e., the original BigDL framework with Keras-style API and Spark ML pipeline support)*
|
||||
BigDL seamlessly scales your data analytics & AI applications from laptop to cloud, with the following libraries:
|
||||
|
||||
* [Orca](#getting-started-with-orca): seamlessly scale out TensorFlow and PyTorch pipelines for distributed Big Data
|
||||
|
||||
* [RayOnSpark](#getting-started-with-rayonspark): run Ray programs directly on Big Data clusters
|
||||
|
||||
* [Chronos](#getting-started-with-chronos): scalable time series analysis using AutoML
|
||||
|
||||
* [PPML](#ppml-privacy-preserving-machine-learning): privacy preserving big data analysis and machine learning (*experimental*)
|
||||
|
||||
* [Nano](https://bigdl.readthedocs.io/en/latest/doc/Nano/Overview/nano.html): automatically accelerate TensorFlow and PyTorch pipelines by applying modern CPU optimizations
|
||||
- [Orca](#orca): Distributed Big Data & AI (TF & PyTorch) Pipeline on Spark and Ray
|
||||
|
||||
- [Nano](#nano): Transparent Acceleration of Tensorflow & PyTorch Programs
|
||||
|
||||
- [DLlib](#dllib): “Equivalent of Spark MLlib” for Deep Learning
|
||||
|
||||
- [Chronos](#chronos): Scalable Time Series Analysis using AutoML
|
||||
|
||||
- [Friesian](#friesian): End-to-End Recommendation Systems
|
||||
|
||||
- [PPML](#ppml) (experimental): Secure Big Data and AI (with SGX Hardware Security)
|
||||
|
||||
For more information, you may [read the docs](https://bigdl.readthedocs.io/).
|
||||
|
||||
---
|
||||
|
||||
## Choosing the right BigDL library
|
||||
```mermaid
|
||||
flowchart TD;
|
||||
Feature1{{HW Secured Big Data & AI?}};
|
||||
Feature1-- No -->Feature2{{Python vs. Scala/Java?}};
|
||||
Feature1-- "Yes" -->ReferPPML([<em><strong>PPML</strong></em>]);
|
||||
Feature2-- Python -->Feature3{{What type of application?}};
|
||||
Feature2-- Scala/Java -->ReferDLlib([<em><strong>DLlib</strong></em>]);
|
||||
Feature3-- "Distributed Big Data + AI (TF/PyTorch)" -->ReferOrca([<em><strong>Orca</strong></em>]);
|
||||
Feature3-- Accelerating TensorFlow / PyTorch -->ReferNano([<em><strong>Nano</strong></em>]);
|
||||
Feature3-- DL for Spark MLlib -->ReferDLlib2([<em><strong>DLlib</strong></em>]);
|
||||
Feature3-- High Level App Framework -->Feature4{{Domain?}};
|
||||
Feature4-- Time Series -->ReferChronos([<em><strong>Chronos</strong></em>]);
|
||||
Feature4-- Recommendation System -->ReferFriesian([<em><strong>Friesian</strong></em>]);
|
||||
|
||||
click ReferNano "https://github.com/intel-analytics/bigdl#nano"
|
||||
click ReferOrca "https://github.com/intel-analytics/bigdl#orca"
|
||||
click ReferDLlib "https://github.com/intel-analytics/bigdl#dllib"
|
||||
click ReferDLlib2 "https://github.com/intel-analytics/bigdl#dllib"
|
||||
click ReferChronos "https://github.com/intel-analytics/bigdl#chronos"
|
||||
click ReferFriesian "https://github.com/intel-analytics/bigdl#friesian"
|
||||
click ReferPPML "https://github.com/intel-analytics/bigdl#ppml"
|
||||
|
||||
classDef ReferStyle1 fill:#5099ce,stroke:#5099ce;
|
||||
classDef Feature fill:#FFF,stroke:#08409c,stroke-width:1px;
|
||||
class ReferNano,ReferOrca,ReferDLlib,ReferDLlib2,ReferChronos,ReferFriesian,ReferPPML ReferStyle1;
|
||||
class Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7 Feature;
|
||||
|
||||
```
|
||||
---
|
||||
## Installing
|
||||
You can use BigDL on [Google Colab](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/colab.html) without any installation. BigDL also includes a set of [notebooks](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/notebooks.html) that you can directly open and run in Colab.
|
||||
|
||||
To install BigDL, we recommend using [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) environments.
|
||||
- To install BigDL, we recommend using [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) environment:
|
||||
|
||||
```bash
|
||||
conda create -n my_env
|
||||
conda activate my_env
|
||||
pip install bigdl
|
||||
```bash
|
||||
conda create -n my_env
|
||||
conda activate my_env
|
||||
pip install bigdl
|
||||
```
|
||||
To install latest nightly build, use `pip install --pre --upgrade bigdl`; see [Python](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/python.html) and [Scala](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/scala.html) user guide for more details.
|
||||
|
||||
- To install each individual library, such as Chronos, use `pip install bigdl-chronos`; see the [document website](https://bigdl.readthedocs.io/) for more details.
|
||||
---
|
||||
|
||||
## Getting Started
|
||||
### Orca
|
||||
|
||||
- The _Orca_ library seamlessly scales out your single node **TensorFlow**, **PyTorch** or **OpenVINO** programs across large clusters (so as to process distributed Big Data).
|
||||
|
||||
<details><summary>Show Orca example</summary>
|
||||
<br/>
|
||||
|
||||
You can build end-to-end, distributed data processing & AI programs using _Orca_ in 4 simple steps:
|
||||
|
||||
```python
|
||||
# 1. Initilize Orca Context (to run your program on K8s, YARN or local laptop)
|
||||
from bigdl.orca import init_orca_context, OrcaContext
|
||||
sc = init_orca_context(cluster_mode="k8s", cores=4, memory="10g", num_nodes=2)
|
||||
|
||||
# 2. Perform distribtued data processing (supporting Spark DataFrames,
|
||||
# TensorFlow Dataset, PyTorch DataLoader, Ray Dataset, Pandas, Pillow, etc.)
|
||||
spark = OrcaContext.get_spark_session()
|
||||
df = spark.read.parquet(file_path)
|
||||
df = df.withColumn('label', df.label-1)
|
||||
...
|
||||
|
||||
# 3. Build deep learning models using standard framework APIs
|
||||
# (supporting TensorFlow, PyTorch, Keras, OpenVino, etc.)
|
||||
from tensorflow import keras
|
||||
...
|
||||
model = keras.models.Model(inputs=[user, item], outputs=predictions)
|
||||
model.compile(...)
|
||||
|
||||
# 4. Use Orca Estimator for distributed training/inference
|
||||
from bigdl.orca.learn.tf.estimator import Estimator
|
||||
est = Estimator.from_keras(keras_model=model)
|
||||
est.fit(data=df,
|
||||
feature_cols=['user', 'item'],
|
||||
label_cols=['label'],
|
||||
...)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
*See Orca [user guide](https://bigdl.readthedocs.io/en/latest/doc/Orca/Overview/orca.html), as well as [TensorFlow](https://bigdl.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-tf-quickstart.html) and [PyTorch](https://bigdl.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-pytorch-quickstart.html) quickstart, for more details.*
|
||||
|
||||
- In addition, you can also run standard **Ray** programs on Spark cluster using _**RayOnSpark**_ in Orca.
|
||||
|
||||
<details><summary>Show RayOnSpark example</summary>
|
||||
<br/>
|
||||
|
||||
You can not only run Ray program on Spark cluster, but also write Ray code inline with Spark code (so as to process the in-memory Spark RDDs or DataFrames) using _RayOnSpark_ in Orca.
|
||||
|
||||
```python
|
||||
# 1. Initilize Orca Context (to run your program on K8s, YARN or local laptop)
|
||||
from bigdl.orca import init_orca_context, OrcaContext
|
||||
sc = init_orca_context(cluster_mode="yarn", cores=4, memory="10g", num_nodes=2, init_ray_on_spark=True)
|
||||
|
||||
# 2. Distribtued data processing using Spark
|
||||
spark = OrcaContext.get_spark_session()
|
||||
df = spark.read.parquet(file_path).withColumn(...)
|
||||
|
||||
# 3. Convert Spark DataFrame to Ray Dataset
|
||||
from bigdl.orca.data import spark_df_to_ray_dataset
|
||||
dataset = spark_df_to_ray_dataset(df)
|
||||
|
||||
# 4. Use Ray to operate on Ray Datasets
|
||||
import ray
|
||||
|
||||
@ray.remote
|
||||
def consume(data) -> int:
|
||||
num_batches = 0
|
||||
for batch in data.iter_batches(batch_size=10):
|
||||
num_batches += 1
|
||||
return num_batches
|
||||
|
||||
print(ray.get(consume.remote(dataset)))
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
*See RayOnSpark [user guide](https://bigdl.readthedocs.io/en/latest/doc/Ray/Overview/ray.html) and [quickstart](https://bigdl.readthedocs.io/en/latest/doc/Ray/QuickStart/ray-quickstart.html) for more details.*
|
||||
|
||||
### Nano
|
||||
You can transparently accelerate your TensorFlow or PyTorch programs on your laptop or server using *Nano*. With minimum code changes, *Nano* automatically applies modern CPU optimizations (e.g., SIMD, multiprocessing, low precision, etc.) to standard TensorFlow and PyTorch code, with up-to 10x speedup.
|
||||
|
||||
<details><summary>Show Nano inference example</summary>
|
||||
<br/>
|
||||
|
||||
You can automatically optimize a trained PyTorch model for inference or deployment using _Nano_:
|
||||
|
||||
```python
|
||||
model = ResNet18().load_state_dict(...)
|
||||
train_dataloader = ...
|
||||
val_dataloader = ...
|
||||
def accuracy (pred, target):
|
||||
...
|
||||
|
||||
from bigdl.nano.pytorch import InferenceOptimizer
|
||||
optimizer = InferenceOptimizer()
|
||||
optimizer.optimize(model,
|
||||
training_data=train_dataloader,
|
||||
validation_data=val_dataloader,
|
||||
metric=accuracy)
|
||||
new_model, config = optimizer.get_best_model()
|
||||
|
||||
optimizer.summary()
|
||||
```
|
||||
The output of `optimizer.summary()` will be something like:
|
||||
```
|
||||
-------------------------------- ---------------------- -------------- ----------------------
|
||||
| method | status | latency(ms) | accuracy |
|
||||
-------------------------------- ---------------------- -------------- ----------------------
|
||||
| original | successful | 43.688 | 0.969 |
|
||||
| fp32_ipex | successful | 33.383 | not recomputed |
|
||||
| bf16 | fail to forward | None | None |
|
||||
| bf16_ipex | early stopped | 203.897 | None |
|
||||
| int8 | successful | 10.74 | 0.969 |
|
||||
| jit_fp32 | successful | 38.732 | not recomputed |
|
||||
| jit_fp32_ipex | successful | 35.205 | not recomputed |
|
||||
| jit_fp32_ipex_channels_last | successful | 19.327 | not recomputed |
|
||||
| openvino_fp32 | successful | 10.215 | not recomputed |
|
||||
| openvino_int8 | successful | 8.192 | 0.969 |
|
||||
| onnxruntime_fp32 | successful | 20.931 | not recomputed |
|
||||
| onnxruntime_int8_qlinear | successful | 8.274 | 0.969 |
|
||||
| onnxruntime_int8_integer | fail to convert | None | None |
|
||||
-------------------------------- ---------------------- -------------- ----------------------
|
||||
|
||||
Optimization cost 64.3s in total.
|
||||
```
|
||||
|
||||
To install latest nightly build, use ```pip install --pre --upgrade bigdl```; see [Python](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/python.html) and [Scala](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/scala.html) user guide for more details.
|
||||
</details>
|
||||
|
||||
## Getting Started with DLlib
|
||||
**DLlib** is a distributed deep learning library for Apache Spark; with DLlib, users can write distributed deep learning applications as standard Spark programs (using either Scala or Python APIs).
|
||||
<details><summary>Show Nano Training example</summary>
|
||||
<br/>
|
||||
You may easily accelerate PyTorch training (e.g., IPEX, BF16, Multi-Instance Training, etc.) using Nano:
|
||||
|
||||
First, call `initNNContext` at the beginning of the code:
|
||||
```python
|
||||
model = ResNet18()
|
||||
optimizer = torch.optim.SGD(...)
|
||||
train_loader = ...
|
||||
val_loader = ...
|
||||
|
||||
from bigdl.nano.pytorch import TorchNano
|
||||
|
||||
# Define your training loop inside `TorchNano.train`
|
||||
class Trainer(TorchNano):
|
||||
def train(self):
|
||||
# call `setup` to prepare for model, optimizer(s) and dataloader(s) for accelerated training
|
||||
model, optimizer, (train_loader, val_loader) = self.setup(model, optimizer,
|
||||
train_loader, val_loader)
|
||||
|
||||
for epoch in range(num_epochs):
|
||||
model.train()
|
||||
for data, target in train_loader:
|
||||
optimizer.zero_grad()
|
||||
output = model(data)
|
||||
# replace the loss.backward() with self.backward(loss)
|
||||
loss = loss_fuc(output, target)
|
||||
self.backward(loss)
|
||||
optimizer.step()
|
||||
|
||||
# Accelerated training (IPEX, BF16 and Multi-Instance Training)
|
||||
Trainer(use_ipex=True, precision='bf16', num_processes=2).train()
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
*See Nano [user guide](https://bigdl.readthedocs.io/en/latest/doc/Nano/Overview/nano.html) and [tutotial](https://github.com/intel-analytics/BigDL/tree/main/python/nano/tutorial) for more details.*
|
||||
|
||||
### DLlib
|
||||
|
||||
With _DLlib_, you can write distributed deep learning applications as standard (**Scala** or **Python**) Spark programs, using the same **Spark DataFrames** and **ML Pipeline** APIs.
|
||||
|
||||
<details><summary>Show DLlib Scala example</summary>
|
||||
<br/>
|
||||
|
||||
You can build distributed deep learning applications for Spark using *DLlib* Scala APIs in 3 simple steps:
|
||||
|
||||
```scala
|
||||
// 1. Call `initNNContext` at the beginning of the code:
|
||||
import com.intel.analytics.bigdl.dllib.NNContext
|
||||
val sc = NNContext.initNNContext()
|
||||
```
|
||||
|
||||
Then, define the BigDL model using Keras-style API:
|
||||
|
||||
```scala
|
||||
// 2. Define the deep learning model using Keras-style API in DLlib:
|
||||
import com.intel.analytics.bigdl.dllib.keras.layers._
|
||||
import com.intel.analytics.bigdl.dllib.keras.Model
|
||||
val input = Input[Float](inputShape = Shape(10))
|
||||
val dense = Dense[Float](12).inputs(input)
|
||||
val output = Activation[Float]("softmax").inputs(dense)
|
||||
val model = Model(input, output)
|
||||
```
|
||||
|
||||
After that, use `NNEstimator` to train/predict/evaluate the model using Spark Dataframes and ML pipelines:
|
||||
|
||||
```scala
|
||||
val trainingDF = spark.read.parquet("train_data")
|
||||
// 3. Use `NNEstimator` to train/predict/evaluate the model using Spark DataFrame and ML pipeline APIs
|
||||
import org.apache.spark.sql.SparkSession
|
||||
import org.apache.spark.ml.feature.MinMaxScaler
|
||||
import org.apache.spark.ml.Pipeline
|
||||
import com.intel.analytics.bigdl.dllib.nnframes.NNEstimator
|
||||
import com.intel.analytics.bigdl.dllib.nn.CrossEntropyCriterion
|
||||
import com.intel.analytics.bigdl.dllib.optim.Adam
|
||||
val spark = SparkSession.builder().getOrCreate()
|
||||
val trainDF = spark.read.parquet("train_data")
|
||||
val validationDF = spark.read.parquet("val_data")
|
||||
val scaler = new MinMaxScaler().setInputCol("in").setOutputCol("value")
|
||||
val estimator = NNEstimator(model, CrossEntropyCriterion())
|
||||
.setBatchSize(size).setOptimMethod(new Adam()).setMaxEpoch(epoch)
|
||||
.setBatchSize(128).setOptimMethod(new Adam()).setMaxEpoch(5)
|
||||
val pipeline = new Pipeline().setStages(Array(scaler, estimator))
|
||||
|
||||
val pipelineModel = pipeline.fit(trainingDF)
|
||||
val pipelineModel = pipeline.fit(trainDF)
|
||||
val predictions = pipelineModel.transform(validationDF)
|
||||
```
|
||||
See the [NNframes](https://bigdl.readthedocs.io/en/latest/doc/DLlib/Overview/nnframes.html) and [Keras API](https://bigdl.readthedocs.io/en/latest/doc/DLlib/Overview/keras-api.html) user guides for more details.
|
||||
|
||||
## Getting Started with Orca
|
||||
</details>
|
||||
|
||||
Most AI projects start with a Python notebook running on a single laptop; however, one usually needs to go through a mountain of pains to scale it to handle larger data set in a distributed fashion. The _**Orca**_ library seamlessly scales out your single node TensorFlow or PyTorch notebook across large clusters (so as to process distributed Big Data).
|
||||
<details><summary>Show DLlib Python example</summary>
|
||||
<br/>
|
||||
|
||||
First, initialize [Orca Context](https://bigdl.readthedocs.io/en/latest/doc/Orca/Overview/orca-context.html):
|
||||
You can build distributed deep learning applications for Spark using *DLlib* Python APIs in 3 simple steps:
|
||||
|
||||
```python
|
||||
from bigdl.orca import init_orca_context, OrcaContext
|
||||
# 1. Call `init_nncontext` at the beginning of the code:
|
||||
from bigdl.dllib.nncontext import init_nncontext
|
||||
sc = init_nncontext()
|
||||
|
||||
# cluster_mode can be "local", "k8s" or "yarn"
|
||||
sc = init_orca_context(cluster_mode="yarn", cores=4, memory="10g", num_nodes=2)
|
||||
# 2. Define the deep learning model using Keras-style API in DLlib:
|
||||
from bigdl.dllib.keras.layers import Input, Dense, Activation
|
||||
from bigdl.dllib.keras.models import Model
|
||||
input = Input(shape=(10,))
|
||||
dense = Dense(12)(input)
|
||||
output = Activation("softmax")(dense)
|
||||
model = Model(input, output)
|
||||
|
||||
# 3. Use `NNEstimator` to train/predict/evaluate the model using Spark DataFrame and ML pipeline APIs
|
||||
from pyspark.sql import SparkSession
|
||||
from pyspark.ml.feature import MinMaxScaler
|
||||
from pyspark.ml import Pipeline
|
||||
from bigdl.dllib.nnframes import NNEstimator
|
||||
from bigdl.dllib.nn.criterion import CrossEntropyCriterion
|
||||
from bigdl.dllib.optim.optimizer import Adam
|
||||
spark = SparkSession.builder.getOrCreate()
|
||||
train_df = spark.read.parquet("train_data")
|
||||
validation_df = spark.read.parquet("val_data")
|
||||
scaler = MinMaxScaler().setInputCol("in").setOutputCol("value")
|
||||
estimator = NNEstimator(model, CrossEntropyCriterion())\
|
||||
.setBatchSize(128)\
|
||||
.setOptimMethod(Adam())\
|
||||
.setMaxEpoch(5)
|
||||
pipeline = Pipeline(stages=[scaler, estimator])
|
||||
|
||||
pipelineModel = pipeline.fit(train_df)
|
||||
predictions = pipelineModel.transform(validation_df)
|
||||
```
|
||||
|
||||
Next, perform [data-parallel processing in Orca](https://bigdl.readthedocs.io/en/latest/doc/Orca/Overview/data-parallel-processing.html) (supporting standard Spark Dataframes, TensorFlow Dataset, PyTorch DataLoader, Pandas, Pillow, etc.):
|
||||
</details>
|
||||
|
||||
*See DLlib [NNFrames](https://bigdl.readthedocs.io/en/latest/doc/DLlib/Overview/nnframes.html) and [Keras API](https://bigdl.readthedocs.io/en/latest/doc/DLlib/Overview/keras-api.html) user guides for more details.*
|
||||
|
||||
### Chronos
|
||||
|
||||
The *Chronos* library makes it easy to build fast, accurate and scalable **time series analysis** applications (with AutoML).
|
||||
|
||||
<details><summary>Show Chronos example</summary>
|
||||
<br/>
|
||||
|
||||
You can train a time series forecaster using _Chronos_ in 3 simple steps:
|
||||
|
||||
```python
|
||||
from pyspark.sql.functions import array
|
||||
from bigdl.chronos.forecaster import TCNForecaster
|
||||
from bigdl.chronos.data.repo_dataset import get_public_dataset
|
||||
|
||||
spark = OrcaContext.get_spark_session()
|
||||
df = spark.read.parquet(file_path)
|
||||
df = df.withColumn('user', array('user')) \
|
||||
.withColumn('item', array('item'))
|
||||
# 1. Process time series data using `TSDataset`
|
||||
tsdata_train, tsdata_val, tsdata_test = get_public_dataset(name='nyc_taxi')
|
||||
for tsdata in [tsdata_train, tsdata_val, tsdata_test]:
|
||||
data.roll(lookback=100, horizon=1)
|
||||
|
||||
# 2. Create a `TCNForecaster` (automatically configured based on train_data)
|
||||
forecaster = TCNForecaster.from_tsdataset(train_data)
|
||||
|
||||
# 3. Train the forecaster for prediction
|
||||
forecaster.fit(train_data)
|
||||
|
||||
pred = forecaster.predict(test_data)
|
||||
```
|
||||
|
||||
Finally, use [sklearn-style Estimator APIs in Orca](https://bigdl.readthedocs.io/en/latest/doc/Orca/Overview/distributed-training-inference.html) to perform distributed _TensorFlow_, _PyTorch_ or _Keras_ training and inference:
|
||||
|
||||
```python
|
||||
from tensorflow import keras
|
||||
from bigdl.orca.learn.tf.estimator import Estimator
|
||||
|
||||
user = keras.layers.Input(shape=[1])
|
||||
item = keras.layers.Input(shape=[1])
|
||||
feat = keras.layers.concatenate([user, item], axis=1)
|
||||
predictions = keras.layers.Dense(2, activation='softmax')(feat)
|
||||
model = keras.models.Model(inputs=[user, item], outputs=predictions)
|
||||
model.compile(optimizer='rmsprop',
|
||||
loss='sparse_categorical_crossentropy',
|
||||
metrics=['accuracy'])
|
||||
|
||||
est = Estimator.from_keras(keras_model=model)
|
||||
est.fit(data=df,
|
||||
batch_size=64,
|
||||
epochs=4,
|
||||
feature_cols=['user', 'item'],
|
||||
label_cols=['label'])
|
||||
```
|
||||
|
||||
See [TensorFlow](https://bigdl.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-tf-quickstart.html) and [PyTorch](https://bigdl.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-pytorch-quickstart.html) quickstart, as well as the [document website](https://bigdl.readthedocs.io/), for more details.
|
||||
|
||||
## Getting Started with RayOnSpark
|
||||
|
||||
Ray is an open source distributed framework for emerging AI applications. _**RayOnSpark**_ allows users to directly run Ray programs on existing Big Data clusters, and directly write Ray code inline with their Spark code (so as to process the in-memory Spark RDDs or DataFrames).
|
||||
|
||||
```python
|
||||
from bigdl.orca import init_orca_context
|
||||
|
||||
# cluster_mode can be "local", "k8s" or "yarn"
|
||||
sc = init_orca_context(cluster_mode="yarn", cores=4, memory="10g", num_nodes=2, init_ray_on_spark=True)
|
||||
|
||||
import ray
|
||||
|
||||
@ray.remote
|
||||
class Counter(object):
|
||||
def __init__(self):
|
||||
self.n = 0
|
||||
|
||||
def increment(self):
|
||||
self.n += 1
|
||||
return self.n
|
||||
|
||||
counters = [Counter.remote() for i in range(5)]
|
||||
print(ray.get([c.increment.remote() for c in counters]))
|
||||
```
|
||||
|
||||
See the RayOnSpark [user guide](https://bigdl.readthedocs.io/en/latest/doc/Ray/Overview/ray.html) and [quickstart](https://bigdl.readthedocs.io/en/latest/doc/Ray/QuickStart/ray-quickstart.html) for more details.
|
||||
|
||||
## Getting Started with Chronos
|
||||
|
||||
Time series prediction takes observations from previous time steps as input and predicts the values at future time steps. The _**Chronos**_ library makes it easy to build end-to-end time series analysis by applying AutoML to extremely large-scale time series prediction.
|
||||
|
||||
To train a time series model with AutoML, first initialize [Orca Context](https://bigdl.readthedocs.io/en/latest/doc/Orca/Overview/orca-context.html):
|
||||
|
||||
```python
|
||||
from bigdl.orca import init_orca_context
|
||||
|
||||
#cluster_mode can be "local", "k8s" or "yarn"
|
||||
init_orca_context(cluster_mode="yarn", cores=4, memory="10g", num_nodes=2, init_ray_on_spark=True)
|
||||
```
|
||||
|
||||
Then, create _TSDataset_ for your data.
|
||||
```python
|
||||
from bigdl.chronos.data import TSDataset
|
||||
|
||||
tsdata_train, tsdata_valid, tsdata_test\
|
||||
= TSDataset.from_pandas(df,
|
||||
dt_col="dt_col",
|
||||
target_col="target_col",
|
||||
with_split=True,
|
||||
val_ratio=0.1,
|
||||
test_ratio=0.1)
|
||||
```
|
||||
|
||||
Next, create an _AutoTSEstimator_.
|
||||
|
||||
To apply AutoML, use `AutoTSEstimator` instead of normal forecasters.
|
||||
```python
|
||||
# Create and fit an `AutoTSEstimator`
|
||||
from bigdl.chronos.autots import AutoTSEstimator
|
||||
autotsest = AutoTSEstimator(model="tcn", future_seq_len=10)
|
||||
|
||||
autotsest = AutoTSEstimator(model='lstm')
|
||||
tsppl = autotsest.fit(data=tsdata_train, validation_data=tsdata_val)
|
||||
pred = tsppl.predict(tsdata_test)
|
||||
```
|
||||
|
||||
Finally, call ```fit``` on _AutoTSEstimator_, which applies AutoML to find the best model and hyper-parameters; it returns a _TSPipeline_ which can be used for prediction or evaluation.
|
||||
</details>
|
||||
|
||||
```python
|
||||
#train a pipeline with AutoML support
|
||||
ts_pipeline = autotsest.fit(data=tsdata_train,
|
||||
validation_data=tsdata_valid)
|
||||
*See Chronos [user guide](https://bigdl.readthedocs.io/en/latest/doc/Chronos/Overview/chronos.html) and [quick start](https://bigdl.readthedocs.io/en/latest/doc/Chronos/QuickStart/chronos-autotsest-quickstart.html) for more details.*
|
||||
|
||||
#predict
|
||||
ts_pipeline.predict(tsdata_test)
|
||||
```
|
||||
### Friesian
|
||||
The *Chronos* library makes it easy to build end-to-end, large-scale **recommedation system** (including *offline* feature transformation and traning, *near-line* feature and model update, and *online* serving pipeline).
|
||||
|
||||
See the Chronos [user guide](https://bigdl.readthedocs.io/en/latest/doc/Chronos/Overview/chronos.html) and [example](https://bigdl.readthedocs.io/en/latest/doc/Chronos/QuickStart/chronos-autotsest-quickstart.html) for more details.
|
||||
*See Freisian [readme](https://github.com/intel-analytics/BigDL/blob/main/python/friesian/README.md) for more details.*
|
||||
|
||||
## PPML (Privacy Preserving Machine Learning)
|
||||
### PPML
|
||||
|
||||
***BigDL PPML*** provides a *Trusted Cluster Environment* for protecting the end-to-end Big Data AI pipeline. It combines various low level hardware and software security technologies (e.g., Intel SGX, LibOS such as Graphene and Occlum, Federated Learning, etc.), and allows users to run unmodified Big Data analysis and ML/DL programs (such as Apache Spark, Apache Flink, Tensorflow, PyTorch, etc.) in a secure fashion on (private or public) cloud.
|
||||
*BigDL PPML* provides a **hardware (Intel SGX) protected** *Trusted Cluster Environment* for running distributed Big Data & AI applications (in a secure fashion on private or public cloud).
|
||||
|
||||
See the [PPML user guide](https://bigdl.readthedocs.io/en/latest/doc/PPML/Overview/ppml.html) for more details.
|
||||
*See PPML [tutorial](https://github.com/intel-analytics/BigDL/blob/main/ppml/README.md) and [user guide](https://bigdl.readthedocs.io/en/latest/doc/PPML/Overview/ppml.html) for more details.*
|
||||
|
||||
## More information
|
||||
## Getting Support
|
||||
|
||||
- [Document Website](https://bigdl.readthedocs.io/)
|
||||
- [Mail List](mailto:bigdl-user-group+subscribe@googlegroups.com)
|
||||
- [User Group](https://groups.google.com/forum/#!forum/bigdl-user-group)
|
||||
- [Powered-By](https://bigdl.readthedocs.io/en/latest/doc/Application/powered-by.html)
|
||||
- [Presentations](https://bigdl.readthedocs.io/en/latest/doc/Application/presentations.html)
|
||||
- [Github Issues](https://github.com/intel-analytics/BigDL/issues)
|
||||
---
|
||||
|
||||
## Citing BigDL
|
||||
If you've found BigDL useful for your project, you may cite the [paper](https://arxiv.org/abs/1804.05839) as follows:
|
||||
## Citation
|
||||
|
||||
If you've found BigDL useful for your project, you may cite the [paper](https://arxiv.org/abs/2204.01715) as follows:
|
||||
|
||||
```
|
||||
@inproceedings{SOCC2019_BIGDL,
|
||||
title={BigDL: A Distributed Deep Learning Framework for Big Data},
|
||||
author={Dai, Jason (Jinquan) and Wang, Yiheng and Qiu, Xin and Ding, Ding and Zhang, Yao and Wang, Yanzhang and Jia, Xianyan and Zhang, Li (Cherry) and Wan, Yan and Li, Zhichao and Wang, Jiao and Huang, Shengsheng and Wu, Zhongyuan and Wang, Yang and Yang, Yuhao and She, Bowen and Shi, Dongjie and Lu, Qi and Huang, Kai and Song, Guoqiong},
|
||||
booktitle={Proceedings of the ACM Symposium on Cloud Computing},
|
||||
publisher={Association for Computing Machinery},
|
||||
pages={50--60},
|
||||
year={2019},
|
||||
series={SoCC'19},
|
||||
doi={10.1145/3357223.3362707},
|
||||
url={https://arxiv.org/pdf/1804.05839.pdf}
|
||||
}
|
||||
@INPROCEEDINGS{9880257,
|
||||
author={Dai, Jason Jinquan and Ding, Ding and Shi, Dongjie and Huang, Shengsheng and Wang, Jiao and Qiu, Xin and Huang, Kai and Song, Guoqiong and Wang, Yang and Gong, Qiyuan and Song, Jiaming and Yu, Shan and Zheng, Le and Chen, Yina and Deng, Junwei and Song, Ge},
|
||||
booktitle={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
|
||||
title={BigDL 2.0: Seamless Scaling of AI Pipelines from Laptops to Distributed Cluster},
|
||||
year={2022},
|
||||
volume={},
|
||||
number={},
|
||||
pages={21407-21414},
|
||||
doi={10.1109/CVPR52688.2022.02076}}
|
||||
```
|
||||
|
|
|
|||
Loading…
Reference in a new issue