From 55fa3e13e6b3cc5e1854433516b4f982f2cb8852 Mon Sep 17 00:00:00 2001 From: Yang Wang Date: Wed, 25 May 2022 10:26:46 +0800 Subject: [PATCH] Add Nano Known issues and some notebook link (#4667) * Add Nano Known issues and some notebook link * fix typo --- .../source/doc/Nano/Overview/known_issues.md | 53 +++++++++++++++++++ .../doc/Nano/QuickStart/pytorch_train.md | 10 +++- docs/readthedocs/source/index.rst | 1 + 3 files changed, 62 insertions(+), 2 deletions(-) create mode 100644 docs/readthedocs/source/doc/Nano/Overview/known_issues.md diff --git a/docs/readthedocs/source/doc/Nano/Overview/known_issues.md b/docs/readthedocs/source/doc/Nano/Overview/known_issues.md new file mode 100644 index 00000000..bdef8039 --- /dev/null +++ b/docs/readthedocs/source/doc/Nano/Overview/known_issues.md @@ -0,0 +1,53 @@ +# Nano Known Issues + +## **PyTorch Issues** + +### **AttributeError: module 'distutils' has no attribute 'version'** + +This usually is because the latest setuptools does not compatible with PyTorch 1.9. + +You can downgrade setuptools to 58.0.4 to solve this problem. + +For example, if your `setuptools` is installed by conda, you can run: + +```bash +conda install setuptools==58.0.4 +``` + +### **error while loading shared libraries: libunwind.so.8** + +You may see this error message when running `source bigdl-nano-init` +``` + Sed: error while loading shared libraries: libunwind.so.8: cannot open shared object file: No such file or directory. +``` +You can use the following command to fix this issue. + +* `apt-get install libunwind8-dev` + +### **Bus error (core dumped) in multi-instance training with spawn distributed backend** + +This usually is because you did not set enough shared memory size in your docker container. + +You can increase `--shm-size` to a larger value, e.g. a few GB, to your `docker run` command, or use `--ipc=host`. + +If you are running in k8s, you can mount larger storage in `/dev/shm`. For example, you can add the following `volume` and `volumeMount` in your pod and container definition. + +```yaml +spec: + containers: + ... + volumeMounts: + - mountPath: /dev/shm + name: cache-volume + volumes: + - emptyDir: + medium: Memory + sizeLimit: 8Gi + name: cache-volume +``` + +## **TensorFlow Issues** + +### **Nano keras multi-instance training currently does not suport tensorflow dataset.from_generators, numpy_function, py_function** + +Nano keras multi-instance training will serialize TensorFlow dataset object into a `graph.pb` file, which does not work with `dataset.from_generators`, `dataset.numpy_function`, `dataset.py_function` due to limitations in TensorFlow. \ No newline at end of file diff --git a/docs/readthedocs/source/doc/Nano/QuickStart/pytorch_train.md b/docs/readthedocs/source/doc/Nano/QuickStart/pytorch_train.md index 60549be5..1999aecf 100644 --- a/docs/readthedocs/source/doc/Nano/QuickStart/pytorch_train.md +++ b/docs/readthedocs/source/doc/Nano/QuickStart/pytorch_train.md @@ -2,7 +2,7 @@ BigDL-Nano can be used to accelerate PyTorch or PyTorch-Lightning applications on training workloads. The optimizations in BigDL-Nano are delivered through an extended version of PyTorch-Lightning `Trainer`. These optimizations are either enabled by default or can be easily turned on by setting a parameter or calling a method. -We will briefly describe here the major features in BigDL-Nano for PyTorch training. You can find complete examples here [links to be added](). +We will briefly describe here the major features in BigDL-Nano for PyTorch training. You can find complete examples [here](https://github.com/intel-analytics/BigDL/tree/main/python/nano/notebooks/pytorch). ### Best Known Configurations @@ -40,13 +40,19 @@ trainer.fit(lightning_module, train_loader) #### IntelĀ® Extension for PyTorch -Intel Extension for Pytorch (a.k.a. IPEX) extends PyTorch with optimizations for an extra performance boost on Intel hardware. BigDL-Nano integrates IPEX through the `Trainer`. Users can turn on IPEX by setting `use_ipex=True`. +Intel Extension for Pytorch (a.k.a. IPEX) [link](https://github.com/intel/intel-extension-for-pytorch) extends PyTorch with optimizations for an extra performance boost on Intel hardware. BigDL-Nano integrates IPEX through the `Trainer`. Users can turn on IPEX by setting `use_ipex=True`. ```python from bigdl.nano.pytorch import Trainer trainer = Trainer(max_epoch=10, use_ipex=True) ``` +Note: BigDL-Nano does not install IPEX by default. You can install IPEX using the following command: + +```bash +python -m pip install torch_ipex==1.9.0 -f https://software.intel.com/ipex-whl-stable +``` + #### Multi-instance Training When training on a server with dozens of CPU cores, it is often beneficial to use multiple training instances in a data-parallel fashion to make full use of the CPU cores. However, using PyTorch's DDP API is a little cumbersome and error-prone, and if not configured correctly, it will make the training even slow. diff --git a/docs/readthedocs/source/index.rst b/docs/readthedocs/source/index.rst index 09b3e410..4e9d0268 100644 --- a/docs/readthedocs/source/index.rst +++ b/docs/readthedocs/source/index.rst @@ -52,6 +52,7 @@ BigDL Documentation doc/Nano/QuickStart/tensorflow_train.md doc/Nano/QuickStart/tensorflow_inference.md doc/Nano/QuickStart/hpo.md + doc/Nano/Overview/known_issues.md .. toctree:: :maxdepth: 1