[Nano] Revise PyTorch Inference key feature doc regarding context manager (#7449)
* Revise PyTorch Inference key feature doc regarding context manager * Small fixes and revise the installation notes * Small fix * Update based on comments * Update based on comments
This commit is contained in:
		
							parent
							
								
									670c54c6d7
								
							
						
					
					
						commit
						9c129ec158
					
				
					 1 changed files with 53 additions and 22 deletions
				
			
		| 
						 | 
				
			
			@ -26,21 +26,25 @@ Before you go ahead with these APIs, you have to make sure BigDL-Nano is correct
 | 
			
		|||
.. note::
 | 
			
		||||
    You can install all required dependencies by
 | 
			
		||||
 | 
			
		||||
    ::
 | 
			
		||||
    .. code-block:: bash
 | 
			
		||||
 | 
			
		||||
        pip install --pre --upgrade bigdl-nano[pytorch,inference]
 | 
			
		||||
 | 
			
		||||
    This will install all dependencies required by BigDL-Nano PyTorch inference.
 | 
			
		||||
    This will install all dependencies required by BigDL-Nano PyTorch inference. It's recommanded since it will install all dependencies required by BigDL-Nano PyTorch inference with no version conflict issue.
 | 
			
		||||
 | 
			
		||||
    Or if you just want to use one of supported optimizations:
 | 
			
		||||
    Or if you just want to use one of supported optimizations, you could install BigDL-Nano for PyTorch with manually installed dependencies:
 | 
			
		||||
 | 
			
		||||
    .. code-block:: bash
 | 
			
		||||
 | 
			
		||||
        pip install --pre --upgrade bigdl-nano[pytorch]
 | 
			
		||||
 | 
			
		||||
    with
 | 
			
		||||
 | 
			
		||||
    - `INC (Intel Neural Compressor) <https://github.com/intel/neural-compressor>`_: ``pip install neural-compressor``
 | 
			
		||||
 | 
			
		||||
    - `OpenVINO <https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html>`_: ``pip install openvino-dev``
 | 
			
		||||
 | 
			
		||||
    - `ONNXRuntime <https://onnxruntime.ai/>`_: ``pip install onnx onnxruntime onnxruntime-extensions onnxsim neural-compressor``
 | 
			
		||||
 | 
			
		||||
    We recommand installing all dependencies by ``pip install --pre --upgrade bigdl-nano[pytorch,inference]``, because you may run into version issues if you install dependencies manually.
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## Graph Mode Acceleration
 | 
			
		||||
| 
						 | 
				
			
			@ -71,9 +75,20 @@ You can simply append the following part to enable your [ONNXRuntime](https://on
 | 
			
		|||
ort_model = InferenceOptimizer.trace(model, accelerator='onnxruntime', input_sample=x)
 | 
			
		||||
 | 
			
		||||
# step 5: use returned model for transparent acceleration
 | 
			
		||||
# The usage is almost the same with any PyTorch module
 | 
			
		||||
# The usage is almost the same with any PyTorch module,
 | 
			
		||||
# except for the change to wrap the inference process with Nano context manager
 | 
			
		||||
with InferenceOptimizer.get_context(ort_model):
 | 
			
		||||
    y_hat = ort_model(x)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
```eval_rst
 | 
			
		||||
.. note::
 | 
			
		||||
    For all Nano optimized models, you need to wrap the inference process with the automatic context manager provided by Nano through the API ``InferenceOptimizer.get_context(model=...)``.
 | 
			
		||||
 | 
			
		||||
    Please note that the context manager is not needed for `multi-instance inference <#multi-instance-acceleration>`_.
 | 
			
		||||
 | 
			
		||||
    For more details about the context manager, you could refer to section `Automatic Context Management <#automatic-context-management>`_.
 | 
			
		||||
```
 | 
			
		||||
### OpenVINO Acceleration
 | 
			
		||||
The [OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html) usage is quite similar to ONNXRuntime, the following usage is for OpenVINO:
 | 
			
		||||
```python
 | 
			
		||||
| 
						 | 
				
			
			@ -82,7 +97,9 @@ The [OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-
 | 
			
		|||
ov_model = InferenceOptimizer.trace(model, accelerator='openvino', input_sample=x)
 | 
			
		||||
 | 
			
		||||
# step 5: use returned model for transparent acceleration
 | 
			
		||||
# The usage is almost the same with any PyTorch module
 | 
			
		||||
# The usage is almost the same with any PyTorch module,
 | 
			
		||||
# except for the change to wrap the inference process with Nano context manager
 | 
			
		||||
with InferenceOptimizer.get_context(ov_model):
 | 
			
		||||
    y_hat = ov_model(x)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -97,7 +114,9 @@ jit_model = InferenceOptimizer.trace(model, accelerator='jit',
 | 
			
		|||
                                     use_ipex=True, input_sample=x)
 | 
			
		||||
 | 
			
		||||
# step 5: use returned model for transparent acceleration
 | 
			
		||||
# The usage is almost the same with any PyTorch module
 | 
			
		||||
# The usage is almost the same with any PyTorch module,
 | 
			
		||||
# except for the change to wrap the inference process with Nano context manager
 | 
			
		||||
with InferenceOptimizer.get_context(jit_model):
 | 
			
		||||
    y_hat = jit_model(x)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -118,6 +137,7 @@ Without extra accelerator, `InferenceOptimizer.quantize()` returns a PyTorch mod
 | 
			
		|||
```python
 | 
			
		||||
q_model = InferenceOptimizer.quantize(model, calib_data=dataloader)
 | 
			
		||||
# run simple prediction with transparent acceleration
 | 
			
		||||
with InferenceOptimizer.get_context(q_model):
 | 
			
		||||
    y_hat = q_model(x)
 | 
			
		||||
```
 | 
			
		||||
This is a most basic usage to quantize a model with defaults, INT8 precision, and without search tuning space to control accuracy drop.
 | 
			
		||||
| 
						 | 
				
			
			@ -129,6 +149,7 @@ Still taking the example in [Runtime Acceleration](pytorch_inference.md#runtime-
 | 
			
		|||
```python
 | 
			
		||||
ort_q_model = InferenceOptimizer.quantize(model, accelerator='onnxruntime', calib_data=dataloader)
 | 
			
		||||
# run simple prediction with transparent acceleration
 | 
			
		||||
with InferenceOptimizer.get_context(ort_q_model):
 | 
			
		||||
    y_hat = ort_q_model(x)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -138,6 +159,7 @@ Take the example in [Runtime Acceleration](#runtime-acceleration), and add quant
 | 
			
		|||
```python
 | 
			
		||||
ov_q_model = InferenceOptimizer.quantize(model, accelerator='openvino', calib_data=dataloader)
 | 
			
		||||
# run simple prediction with transparent acceleration
 | 
			
		||||
with InferenceOptimizer.get_context(ov_q_model):
 | 
			
		||||
    y_hat = ov_q_model(x)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -204,8 +226,6 @@ with InferenceOptimizer.get_context(bf16_model):
 | 
			
		|||
```eval_rst
 | 
			
		||||
.. note::
 | 
			
		||||
    For BFloat16 quantization, make sure your inference is under ``with InferenceOptimizer.get_context(bf16_model):``. Otherwise, the whole inference process is actually FP32 precision.
 | 
			
		||||
 | 
			
		||||
    For more details about the context manager provided by ``InferenceOptimizer.get_context()``, you could refer related `How-to guide <https://bigdl.readthedocs.io/en/latest/doc/Nano/Howto/Inference/PyTorch/pytorch_context_manager.html>`_.
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
#### Channels Last Memory Format
 | 
			
		||||
| 
						 | 
				
			
			@ -326,15 +346,22 @@ multi_model = InferenceOptimizer.to_multi_instance(model, num_processes=4, cores
 | 
			
		|||
multi_model = InferenceOptimizer.to_multi_instance(model, cpu_for_each_process=[[0], [1], [2,3], [4,5]])
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
```eval_rst
 | 
			
		||||
.. note::
 | 
			
		||||
    During multi-instance infernece, the context manager ``InferenceOptimizer.get_context(model=...)`` is not needed to be maunally added.
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## Automatic Context Management
 | 
			
		||||
BigDL-Nano provides ``InferenceOptimizer.get_context(model=...)`` API to enable automatic context management for PyTorch inference. With only one line of code change, BigDL-Nano will automatically provide suitable context management for each accelerated model, it usually contains part of or all of following three types of context managers:
 | 
			
		||||
BigDL-Nano provides ``InferenceOptimizer.get_context(model=...)`` API to enable automatic context management for PyTorch inference. With only one line of code change, BigDL-Nano will automatically provide suitable context management for each accelerated model optimized by ``InferenceOptimizer.trace``/``quantize``/``optimize``, it usually contains part of or all of following four types of context managers:
 | 
			
		||||
 | 
			
		||||
1. ``torch.no_grad()`` to disable gradients, which will be used for all model
 | 
			
		||||
1. ``torch.inference_mode(True)`` to disable gradients, which will be used for all models. For the case when ``torch <= 1.12``, ``torch.no_grad()`` will be used for PyTorch mixed precision inference as a replacement of ``torch.inference_mode(True)``
 | 
			
		||||
 | 
			
		||||
2. ``torch.cpu.amp.autocast(dtype=torch.bfloat16)`` to run in mixed precision, which will be provided for bf16 related model
 | 
			
		||||
2. ``torch.cpu.amp.autocast(dtype=torch.bfloat16)`` to run in mixed precision, which will be provided for bf16 related models
 | 
			
		||||
 | 
			
		||||
3. ``torch.set_num_threads()`` to control thread number, which will be used only if you specify thread_num when applying ``InferenceOptimizer.trace``/``quantize``/``optimize``
 | 
			
		||||
 | 
			
		||||
4. ``torch.jit.enable_onednn_fusion(True)`` to support ONEDNN fusion for jit when using jit as accelerator
 | 
			
		||||
 | 
			
		||||
For model accelerated by ``InferenceOptimizer.trace``, usage now looks like below codes, here we just take ``ipex`` for example:
 | 
			
		||||
```python
 | 
			
		||||
from bigdl.nano.pytorch import InferenceOptimizer
 | 
			
		||||
| 
						 | 
				
			
			@ -369,6 +396,10 @@ with InferenceOptimizer.get_context(ipex_model, classifer):
 | 
			
		|||
    assert torch.get_num_threads() == 4  # this line just to let you know Nano has provided thread control automatically : )
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
```eval_rst
 | 
			
		||||
.. seealso::
 | 
			
		||||
   You could refer to the related `how-to guide <../Howto/Inference/PyTorch/pytorch_context_manager.nblink>`_ for more detailed usage of the context manager.
 | 
			
		||||
```
 | 
			
		||||
## One-click Accleration Without Code Change
 | 
			
		||||
```eval_rst
 | 
			
		||||
.. note::
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
		Loading…
	
		Reference in a new issue