restructure tensorflow inference document (#7417)
This commit is contained in:
parent
c31136df0b
commit
f903b78711
1 changed files with 84 additions and 81 deletions
|
|
@ -2,22 +2,69 @@
|
|||
|
||||
BigDL-Nano provides several APIs which can help users easily apply optimizations on inference pipelines to improve latency and throughput. Currently, performance accelerations are achieved by integrating extra runtimes as inference backend engines or using quantization methods on full-precision trained models to reduce computation during inference. InferenceOptimizer(`bigdl.nano.tf.keras.InferenceOptimizer`) provides the APIs for all optimizations that you need for inference.
|
||||
|
||||
For runtime acceleration, BigDL-Nano has enabled two kinds of runtime (OpenVINO and ONNXRuntime) for users in `InferenceOptimizer.trace()`.
|
||||
|
||||
```eval_rst
|
||||
.. warning::
|
||||
``model.trace`` will be deprecated in future release.
|
||||
## Automatically Choose the Best Optimization
|
||||
|
||||
Please use ``bigdl.nano.tf.keras.InferenceOptimizer.trace`` instead.
|
||||
We recommend you to use `InferenceOptimizer.optimize` to compare different optimization methods and choose the best one.
|
||||
|
||||
Taking MobileNetV2 as an example, you can use runtime acceleration as below:
|
||||
|
||||
```python
|
||||
import tensorflow as tf
|
||||
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
|
||||
import numpy as np
|
||||
from bigdl.nano.tf.keras import InferenceOptimizer
|
||||
|
||||
# step 1: create your model
|
||||
model = MobileNetV2(weights=None, input_shape=[40, 40, 3], classes=10)
|
||||
|
||||
# step 2: prepare your data and dataset
|
||||
train_examples = np.random.random((100, 40, 40, 3))
|
||||
train_labels = np.random.randint(0, 10, size=(100,))
|
||||
train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels))
|
||||
|
||||
# (Optional) step 3: Something else, like training ...
|
||||
model.fit(train_dataset)
|
||||
|
||||
# step 4: try all supproted optimizations
|
||||
opt = InferenceOptimizer()
|
||||
opt.optimize(model, x=train_dataset)
|
||||
|
||||
# get the best optimization
|
||||
best_model, _option = opt.get_best_model()
|
||||
|
||||
# use the quantized model as before
|
||||
y_hat = best_model(train_examples)
|
||||
best_model.predict(train_dataset)
|
||||
```
|
||||
|
||||
For quantization, BigDL-Nano provides only post-training quantization in `InferenceOptimizer.quantize()` for users to infer with models of 8-bit precision. Quantization-Aware Training is not available for now. Model conversion to 16-bit like BF16 and FP16 is coming soon.
|
||||
`InferenceOptimizer.optimize` will try all supported optimizations and choose the best one. e.g. the output may like this
|
||||
|
||||
```
|
||||
==========================Optimization Results==========================
|
||||
-------------------------------- ---------------------- --------------
|
||||
| method | status | latency(ms) |
|
||||
-------------------------------- ---------------------- --------------
|
||||
| original | successful | 82.109 |
|
||||
| int8 | successful | 4.398 |
|
||||
| openvino_fp32 | successful | 3.847 |
|
||||
| openvino_int8 | successful | 2.177 |
|
||||
| onnxruntime_fp32 | successful | 3.28 |
|
||||
| onnxruntime_int8_qlinear | successful | 3.071 |
|
||||
| onnxruntime_int8_integer | fail to convert | None |
|
||||
-------------------------------- ---------------------- --------------
|
||||
```
|
||||
|
||||
```eval_rst
|
||||
.. warning::
|
||||
``model.quantize`` will be deprecated in future release.
|
||||
.. tip::
|
||||
It also uses parameter ``x`` and ``y`` to receive calibration data like ``InferenceOptimizer.quantize``.
|
||||
|
||||
Please use ``bigdl.nano.tf.keras.InferenceOptimizer.quantize`` instead.
|
||||
There are some other useful parameters
|
||||
|
||||
- ``includes``: A str list. If set, ``optimize`` will only try optimizations in this parameter.
|
||||
- ``excludes``: A str list. If set, ``optimize`` will try all optimizations (or optimizations specified by ``includes``) except for those in this parameter.
|
||||
|
||||
See its `API document <../../PythonAPI/Nano/tensorflow.html#bigdl.nano.tf.keras.InferenceOptimizer.optimize>`_ for more advanced usage.
|
||||
```
|
||||
|
||||
Before you go ahead with these APIs, you have to make sure BigDL-Nano is correctly installed for TensorFlow. If not, please follow [this](./install.md) to set up your environment.
|
||||
|
|
@ -41,32 +88,27 @@ Before you go ahead with these APIs, you have to make sure BigDL-Nano is correct
|
|||
We recommand installing all dependencies by ``pip install bigdl-nano[tensorflow,inference]``, because you may run into version issues if you install dependencies manually.
|
||||
```
|
||||
|
||||
## Runtime Acceleration
|
||||
## Manually Chose Optimizations
|
||||
|
||||
### Runtime Acceleration
|
||||
|
||||
For runtime acceleration, BigDL-Nano has enabled two kinds of runtime (OpenVINO and ONNXRuntime) for users in `InferenceOptimizer.trace()`.
|
||||
|
||||
```eval_rst
|
||||
.. warning::
|
||||
``model.trace`` will be deprecated in future release.
|
||||
|
||||
Please use ``bigdl.nano.tf.keras.InferenceOptimizer.trace`` instead.
|
||||
```
|
||||
|
||||
All available runtime accelerations are integrated in `InferenceOptimizer.trace(accelerator='onnxruntime'/'openvino')` with different accelerator values.
|
||||
|
||||
Taking MobileNetV2 as an example, you can use runtime acceleration as below:
|
||||
Taking the example in [Automatically Choose the Best Optimization](#automatically-choose-the-best-optimization), you can use runtime acceleration as following:
|
||||
|
||||
```python
|
||||
import tensorflow as tf
|
||||
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
|
||||
import numpy as np
|
||||
from bigdl.nano.tf.keras import InferenceOptimizer
|
||||
|
||||
# step 1: create your model
|
||||
model = MobileNetV2(weights=None, input_shape=[40, 40, 3], classes=10)
|
||||
|
||||
# step 2: prepare your data and dataset
|
||||
train_examples = np.random.random((100, 40, 40, 3))
|
||||
train_labels = np.random.randint(0, 10, size=(100,))
|
||||
train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels))
|
||||
|
||||
# (Optional) step 3: Something else, like training ...
|
||||
model.fit(train_dataset)
|
||||
|
||||
# step 4: execute quantization using `OpenVINO` acceleration
|
||||
# execute quantization using `OpenVINO` acceleration
|
||||
traced_model = InferenceOptimizer.trace(model, accelerator="openvino")
|
||||
# or setp 4: execute quantization using `ONNXRuntime` acceleration
|
||||
# execute quantization using `ONNXRuntime` acceleration
|
||||
traced_model = InferenceOptimizer.trace(model, accelerator="onnxruntime")
|
||||
|
||||
# run simple prediction
|
||||
|
|
@ -76,13 +118,22 @@ y_hat = traced_model(train_examples)
|
|||
traced_model.predict(train_dataset)
|
||||
```
|
||||
|
||||
## Quantization
|
||||
### Quantization
|
||||
|
||||
Quantization is widely used to compress models to a lower precision, which not only reduces the model size but also accelerates inference. BigDL-Nano provides `InferenceOptimizer.quantize()` API for users to quickly obtain a quantized model with accuracy control by specifying a few arguments.
|
||||
|
||||
BigDL-Nano currently provides only post-training quantization in `InferenceOptimizer.quantize()` for users to infer with models of 8-bit precision. Quantization-Aware Training is not available for now. Model conversion to 16-bit like BF16 and FP16 is coming soon.
|
||||
|
||||
```eval_rst
|
||||
.. warning::
|
||||
``model.quantize`` will be deprecated in future release.
|
||||
|
||||
Please use ``bigdl.nano.tf.keras.InferenceOptimizer.quantize`` instead.
|
||||
```
|
||||
|
||||
To use INC as your quantization engine, you can choose `accelerator=None/'onnxruntime'`. Otherwise, `accelerator='openvino'` means using OpenVINO POT (Post-training Optimization) to do quantization.
|
||||
|
||||
### Quantization without Accuracy Control
|
||||
#### Quantization without Accuracy Control
|
||||
|
||||
Taking the example in [Runtime Acceleration](#runtime-acceleration), you can use quantization as following:
|
||||
|
||||
|
|
@ -119,7 +170,7 @@ This is a most basic usage to quantize a model with defaults, INT8 precision, an
|
|||
- ``y``: Target data. Like the input data ``x``, it could be either Numpy array(s) or TensorFlow tensor(s). Its length should be consistent with ``x``. If ``x`` is a ``Dataset``, ``y`` will be ignored (since targets will be obtained from ``x``)
|
||||
```
|
||||
|
||||
### Quantization with Accuracy Control
|
||||
#### Quantization with Accuracy Control
|
||||
|
||||
By default, `InferenceOptimizer.quantize()` doesn't search the tuning space and returns the fully-quantized model without considering the accuracy drop. If you need to search quantization tuning space for a model with accuracy control, you may need to specify a few parameters.
|
||||
|
||||
|
|
@ -162,51 +213,3 @@ q_model = InferenceOptimizer.quantize(model,
|
|||
y_hat = q_model(train_examples)
|
||||
q_model.predict(train_dataset)
|
||||
```
|
||||
|
||||
### Automatically Choose the Best Optimization
|
||||
|
||||
If you have no idea about which one optimization to choose or you just want to compare them and choose the best one, you can use `InferenceOptimizer.optimize`.
|
||||
|
||||
Still taking the example in [Runtime Acceleration](#runtime-acceleration), you can use it as following:
|
||||
|
||||
```python
|
||||
# try all supproted optimizations
|
||||
opt = InferenceOptimizer()
|
||||
opt.optimize(model, x=train_dataset)
|
||||
|
||||
# get the best optimization
|
||||
best_model, _option = opt.get_best_model()
|
||||
|
||||
# use the quantized model as before
|
||||
y_hat = best_model(train_examples)
|
||||
best_model.predict(train_dataset)
|
||||
```
|
||||
|
||||
`InferenceOptimizer.optimize` will try all supported optimizations and choose the best one. e.g. the output may like this
|
||||
|
||||
```
|
||||
==========================Optimization Results==========================
|
||||
-------------------------------- ---------------------- --------------
|
||||
| method | status | latency(ms) |
|
||||
-------------------------------- ---------------------- --------------
|
||||
| original | successful | 82.109 |
|
||||
| int8 | successful | 4.398 |
|
||||
| openvino_fp32 | successful | 3.847 |
|
||||
| openvino_int8 | successful | 2.177 |
|
||||
| onnxruntime_fp32 | successful | 3.28 |
|
||||
| onnxruntime_int8_qlinear | successful | 3.071 |
|
||||
| onnxruntime_int8_integer | fail to convert | None |
|
||||
-------------------------------- ---------------------- --------------
|
||||
```
|
||||
|
||||
```eval_rst
|
||||
.. tip::
|
||||
It also uses parameter ``x`` and ``y`` to receive calibration data like ``InferenceOptimizer.quantize``.
|
||||
|
||||
There are some other useful parameters
|
||||
|
||||
- ``includes``: A str list. If set, ``optimize`` will only try optimizations in this parameter.
|
||||
- ``excludes``: A str list. If set, ``optimize`` will try all optimizations (or optimizations specified by ``includes``) except for those in this parameter.
|
||||
|
||||
See its `API document <../../PythonAPI/Nano/tensorflow.html#bigdl.nano.tf.keras.InferenceOptimizer.optimize>`_ for more advanced usage.
|
||||
```
|
||||
|
|
|
|||
Loading…
Reference in a new issue