[LLM] Improve LLM doc regarding windows gpu related info (#9880)

* Improve runtime configuration for windows * Add python 310/311 supports for wheel downloading * Add troubleshooting for windows gpu * Remove manually import ipex due to auto importer * Add info regarding cpu_embedding=True on iGPU * More info for Windows users * Small updates to API docs * Python style fix * Remove tip for loading from saved optimize_model for now * Updated based on comments * Update win info for multi-intel gpus selection * Small fix * Small fix
2024-01-11 14:37:16 +08:00 · 2024-01-11 14:37:16 +08:00 · 0aef35a965
commit 0aef35a965
parent 07485eff5a
6 changed files with 165 additions and 56 deletions
--- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/finetune.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/finetune.md
@ -10,16 +10,17 @@ We also support finetuning LLMs (large language models) using QLoRA with BigDL-L

 To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example.

-**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
+**Make sure you have prepared environment following instructions [here](../install_gpu.html).**

-```python
-import intel_extension_for_pytorch as ipex
+```eval_rst
+.. note::
+
+   If you are using an older version of ``bigdl-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code.
 ```

 First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`.

 ```python
-import intel_extension_for_pytorch as ipex
 from bigdl.llm.transformers import AutoModelForCausalLM

 model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf",
--- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/inference_on_gpu.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/inference_on_gpu.md
@ -4,10 +4,12 @@ Apart from the significant acceleration capabilites on Intel CPUs, BigDL-LLM als

 Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.

-**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
+**Make sure you have prepared environment following instructions [here](../install_gpu.html).**

-```python
-import intel_extension_for_pytorch as ipex
+```eval_rst
+.. note::
+
+   If you are using an older version of ``bigdl-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code.
 ```

 ## Load and Optimize Model
@ -26,7 +28,6 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
      .. code-block:: python

         # Take Llama-2-7b-chat-hf as an example
-         import intel_extension_for_pytorch as ipex
         from transformers import LlamaForCausalLM
         from bigdl.llm import optimize_model

@ -35,11 +36,16 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-

         model = model.to('xpu') # Important after obtaining the optimized model

+      .. tip::
+
+         When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``optimize_model`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
+         
+         See the `API doc <../../../PythonAPI/LLM/optimize.html#bigdl.llm.optimize_model>`_ for ``optimize_model`` to find more information.
+
      Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows:

      .. code-block:: python

-         import intel_extension_for_pytorch as ipex
         from transformers import LlamaForCausalLM
         from bigdl.llm.optimize import low_memory_init, load_low_bit

@ -59,7 +65,6 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
      .. code-block:: python

         # Take Llama-2-7b-chat-hf as an example
-         import intel_extension_for_pytorch as ipex
         from bigdl.llm.transformers import AutoModelForCausalLM

         # Load model in 4 bit, which convert the relevant layers in the model into INT4 format
@ -67,17 +72,26 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-

         model = model.to('xpu') # Important after obtaining the optimized model

+      .. tip::
+
+         When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True``` in the ``from_pretrained`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
+         
+         See the `API doc <../../../PythonAPI/LLM/transformers.html#hugging-face-transformers-automodel>`_ to find more information.
+
      Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows:

      .. code-block:: python

-         import intel_extension_for_pytorch as ipex
         from bigdl.llm.transformers import AutoModelForCausalLM

         saved_dir='./llama-2-bigdl-llm-4-bit'
         model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model

         model = model.to('xpu') # Important after obtaining the optimized model
+
+      .. tip::
+
+         When running saved optimized models on Intel iGPUs for Windows users, we also recommend setting ``cpu_embedding=True`` in the ``load_low_bit`` function.
 ```

 ## Run Optimized Model
@ -101,6 +115,11 @@ with torch.inference_mode():
   The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation.
 ```

+```eval_rst
+.. note::
+
+   If you are a Windows user, please also note that for **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
+```

 ```eval_rst
 .. seealso::
--- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.md
@ -5,10 +5,26 @@ In [Inference on GPU](inference_on_gpu.md) and [Finetune (QLoRA)](finetune.md),
 ## List devices

 The `sycl-ls` tool enumerates a list of devices available in the system. You can use it after you setup oneapi environment:
-```bash
-source /opt/intel/oneapi/setvars.sh
-sycl-ls
+
+```eval_rst
+.. tabs::
+   .. tab:: Windows
+
+      Please make sure you are using CMD (Anaconda Prompt if using conda):
+
+      .. code-block:: cmd
+
+        call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
+        sycl-ls
+
+   .. tab:: Linux
+
+      .. code-block:: bash
+
+         source /opt/intel/oneapi/setvars.sh
+         sycl-ls
 ```
+
 If you have two Arc770 GPUs, you can get something like below:
 ```
 [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
@ -20,7 +36,7 @@ If you have two Arc770 GPUs, you can get something like below:
 [ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
 [ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.26241]
 ```
-This output shows there are two Arc A770 GPUs on this machine.
+This output shows there are two Arc A770 GPUs as well as an Intel iGPU on this machine.

 ## Devices selection
 To enable xpu, you should convert your model and input to xpu by below code:
@ -32,6 +48,7 @@ To select the desired devices, there are two ways: one is changing the code, ano

 ### 1. Select device in python
 To specify a xpu, you can change the `to('xpu')` to `to('xpu:[device_id]')`, this device_id is counted from zero.
+
 If you you want to use the second device, you can change the code like this: 
 ```
 model = model.to('xpu:1')
@ -41,11 +58,29 @@ input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu:1')
 ### 2. OneAPI device selector
 Device selection environment variable, `ONEAPI_DEVICE_SELECTOR`, can be used to limit the choice of Intel GPU devices. As upon `sycl-ls` shows, the last three lines are three Level Zero GPU devices. So we can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select devices.
 For example, you want to use the second A770 GPU, you can run the python like this:
-```
-ONEAPI_DEVICE_SELECTOR=level_zero:1 python generate.py
-```
-`ONEAPI_DEVICE_SELECTOR=level_zero:1` in upon command only affect in current python program. Also, you can export the environment, then run your python:
-```
-export ONEAPI_DEVICE_SELECTOR=level_zero:1
-python generate.py
+
+```eval_rst
+.. tabs::
+   .. tab:: Windows
+
+      .. code-block:: cmd
+
+         set ONEAPI_DEVICE_SELECTOR=level_zero:1 
+         python generate.py
+
+      Through ``set ONEAPI_DEVICE_SELECTOR=level_zero:1``, only the second A770 GPU will be available for the current environment.
+
+   .. tab:: Linux
+
+      .. code-block:: bash
+
+         ONEAPI_DEVICE_SELECTOR=level_zero:1 python generate.py
+
+      ``ONEAPI_DEVICE_SELECTOR=level_zero:1`` in upon command only affect in current python program. Also, you can export the environment variable, then run your python:
+
+      .. code-block:: bash
+
+         export ONEAPI_DEVICE_SELECTOR=level_zero:1
+         python generate.py
+
 ```
--- a/docs/readthedocs/source/doc/LLM/Overview/install_gpu.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/install_gpu.md
@ -61,32 +61,74 @@ pip install intel_extension_for_pytorch-2.1.10+xpu-cp39-cp39-win_amd64.whl
 pip install --pre --upgrade bigdl-llm[xpu]
 ```

+```eval_rst
+.. note::
+
+   All the wheel packages mentioned here are for Python 3.9. If you would like to use Python 3.10 or 3.11, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp39`` with ``cp310`` or ``cp311``, respectively.
+```
+
 ### Runtime Configuration

 To use GPU acceleration on Windows, several environment variables are required before running a GPU example.

-Make sure you are using CMD as PowerShell is not supported:
+Make sure you are using CMD (Anaconda Prompt if using conda) as PowerShell is not supported:

-```
+```cmd
 call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 ```

-Please also set the following environment variable for iGPU:
+Please also set the following environment variable if you would like to run LLMs on:

-```
-set SYCL_CACHE_PERSISTENT=1
-set BIGDL_LLM_XMX_DISABLED=1
+```eval_rst
+.. tabs::
+   .. tab:: Intel iGPU
+
+      .. code-block:: cmd
+
+         set SYCL_CACHE_PERSISTENT=1
+         set BIGDL_LLM_XMX_DISABLED=1
+
+   .. tab:: Intel Arc™ A300-Series or Pro A60
+
+      .. code-block:: cmd
+
+         set SYCL_CACHE_PERSISTENT=1
+
+   .. tab:: Other Intel dGPU Series
+
+      There is no need to set further environment variables.
 ```

 ```eval_rst
 .. note::

-   For the first time that **each model** runs on **iGPU**, it may take around several minutes to compile.
+   For **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 ```

-<!-- ### Troubleshooting -->
+### Troubleshooting

-<!-- todo -->
+#### 1. Error loading `intel_extension_for_pytorch`
+
+If you met error when importing `intel_extension_for_pytorch`, please ensure that you have completed the following steps:
+
+* Ensure that you have installed Visual Studio with "Desktop development with C++" workload.
+
+* Make sure that the correct version of oneAPI, specifically 2024.0, is installed.
+
+* Ensure that `libuv` is installed in your conda environment. This can be done during the creation of the environment with the command:
+  ```cmd
+  conda create -n llm python=3.9 libuv
+  ```
+  If you missed `libuv`, you can add it to your existing environment through
+  ```cmd
+  conda install libuv
+  ```
+
+* Make sure you have configured oneAPI environment variables in your command prompt through
+  ```cmd
+  call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
+  ```
+  Please note that you need to set these environment variables again once you have a new command prompt window.

 ## Linux

@ -131,7 +173,7 @@ BigDL-LLM for GPU supports on Linux has been verified on:

           We recommend you to use `this offline package <https://registrationcenter-download.intel.com/akdlm/IRC_NAS/20f4e6a1-6b0b-4752-b8c1-e5eacba10e01/l_BaseKit_p_2024.0.0.49564_offline.sh>`_ to install oneapi.

-   .. tab:: Pytorch 2.0
+   .. tab:: PyTorch 2.0

      To enable BigDL-LLM for Intel GPUs with PyTorch 2.0, here're several prerequisite steps for tools installation and environment preparation:

@ -164,7 +206,7 @@ We recommend using [miniconda](https://docs.conda.io/en/latest/miniconda.html) t

 ```eval_rst
 .. tabs::
-   .. tab:: Pytorch 2.1
+   .. tab:: PyTorch 2.1

      .. code-block:: bash

@ -182,7 +224,7 @@ We recommend using [miniconda](https://docs.conda.io/en/latest/miniconda.html) t
            pip install --pre --upgrade bigdl-llm[xpu_2.1] -f https://developer.intel.com/ipex-whl-stable-xpu
            

-   .. tab:: Pytorch 2.0
+   .. tab:: PyTorch 2.0

      .. code-block:: bash

@ -243,6 +285,12 @@ If you encounter network issues when installing IPEX, you can also install BigDL

 ```

+```eval_rst
+.. note::
+
+   All the wheel packages mentioned here are for Python 3.9. If you would like to use Python 3.10 or 3.11, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp39`` with ``cp310`` or ``cp311``, respectively.
+```
+
 ### Runtime Configuration

 To use GPU acceleration on Linux, several environment variables are required or recommended before running a GPU example.
@ -255,7 +303,7 @@ To use GPU acceleration on Linux, several environment variables are required or

      .. code-block:: bash

-         # Required step. Configure OneAPI environment variables
+         # Required step. Configure oneAPI environment variables
         source /opt/intel/oneapi/setvars.sh

         # Recommended Environment Variables
@ -268,7 +316,7 @@ To use GPU acceleration on Linux, several environment variables are required or

      .. code-block:: bash

-         # Required step. Configure OneAPI environment variables
+         # Required step. Configure oneAPI environment variables
         source /opt/intel/oneapi/setvars.sh

         # Recommended Environment Variables
@ -316,4 +364,4 @@ Error: libmkl_sycl_blas.so.4: cannot open shared object file: No such file or di
 The reason for such errors is that oneAPI has not been initialized properly before running BigDL-LLM code or before importing IPEX package.

 * Step 1: Make sure you execute setvars.sh of oneAPI Base Toolkit before running BigDL-LLM code.
-* Step 2: Make sure you install matching versions of BigDL-LLM/pytorch/IPEX and oneAPI Base Toolkit. BigDL-LLM with Pytorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. BigDL-LLM with Pytorch 2.0 should be used with oneAPI Base Toolkit version 2023.2.
+* Step 2: Make sure you install matching versions of BigDL-LLM/pytorch/IPEX and oneAPI Base Toolkit. BigDL-LLM with PyTorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. BigDL-LLM with PyTorch 2.0 should be used with oneAPI Base Toolkit version 2023.2.
--- a/python/llm/src/bigdl/llm/optimize.py
+++ b/python/llm/src/bigdl/llm/optimize.py
@ -199,15 +199,19 @@ def optimize_model(model, low_bit='sym_int4', optimize_llm=True, modules_to_not_
    A method to optimize any pytorch model.

    :param model: The original PyTorch model (nn.module)
-    :param low_bit: Supported low-bit options are "sym_int4", "asym_int4", "sym_int5",
-        "asym_int5" or "sym_int8".
-    :param optimize_llm: Whether to further optimize llm model.
+    :param low_bit: str value, options are ``'sym_int4'``, ``'asym_int4'``, ``'sym_int5'``,
+                    ``'asym_int5'``, ``'sym_int8'``, ``'nf3'``, ``'nf4'``, ``'fp4'``,
+                    ``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``, ``'fp16'`` or ``'bf16'``,
+                    ``'sym_int4'`` means symmetric int 4, ``'asym_int4'`` means
+                    asymmetric int 4, ``'nf4'`` means 4-bit NormalFloat, etc.
+                    Relevant low bit optimizations will be applied to the model.
+    :param optimize_llm: Whether to further optimize llm model. Default to be ``True``.
    :param modules_to_not_convert: list of str value, modules (nn.Module) that are skipped
-        when conducting model optimizations. Default to be None.
+        when conducting model optimizations. Default to be ``None``.
    :param cpu_embedding: Whether to replace the Embedding layer, may need to set it
-        to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
+        to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
    :param lightweight_bmm: Whether to replace the torch.bmm ops, may need to set it
-        to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
+        to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.

    :return: The optimized model.

--- a/python/llm/src/bigdl/llm/transformers/model.py
+++ b/python/llm/src/bigdl/llm/transformers/model.py
@ -101,22 +101,24 @@ class _BaseAutoModelClass:
        Three new arguments are added to extend Hugging Face's from_pretrained method as follows:

        :param load_in_4bit: boolean value, True means loading linear's weight to symmetric int 4 if
-                                the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4
-                                if the model is GPTQ model.
-                             Default to be False.
-        :param load_in_low_bit: str value, options are sym_int4, asym_int4, sym_int5, asym_int5
-                                , sym_int8, nf3, nf4, fp4, fp8, fp8_e4m3, fp8_e5m2, fp16 or bf16.
-                                sym_int4 means symmetric int 4, asym_int4 means asymmetric int 4,
-                                nf4 means 4-bit NormalFloat, etc. Relevant low bit optimizations
-                                will be applied to the model.
+                             the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4
+                             if the model is GPTQ model.
+                             Default to be ``False``.
+        :param load_in_low_bit: str value, options are ``'sym_int4'``, ``'asym_int4'``,
+                                ``'sym_int5'``, ``'asym_int5'``, ``'sym_int8'``, ``'nf3'``,
+                                ``'nf4'``, ``'fp4'``, ``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``,
+                                ``'fp16'`` or ``'bf16'``, ``'sym_int4'`` means symmetric int 4,
+                                ``'asym_int4'`` means asymmetric int 4, ``'nf4'`` means 4-bit
+                                NormalFloat, etc. Relevant low bit optimizations will be applied
+                                to the model.
        :param optimize_model: boolean value, Whether to further optimize the low_bit llm model.
-                               Default to be True.
+                               Default to be ``True``.
        :param modules_to_not_convert: list of str value, modules (nn.Module) that are skipped when
-                                       conducting model optimizations. Default to be None.
+                                       conducting model optimizations. Default to be ``None``.
        :param cpu_embedding: Whether to replace the Embedding layer, may need to set it
-            to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
+            to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
        :param lightweight_bmm: Whether to replace the torch.bmm ops, may need to set it
-            to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
+            to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
        :return: a model instance
        """
        pretrained_model_name_or_path = kwargs.get("pretrained_model_name_or_path", None) \