[LLM] Improve LLM doc regarding windows gpu related info (#9880)

* Improve runtime configuration for windows

* Add python 310/311 supports for wheel downloading

* Add troubleshooting for windows gpu

* Remove manually import ipex due to auto importer

* Add info regarding cpu_embedding=True on iGPU

* More info for Windows users

* Small updates to API docs

* Python style fix

* Remove tip for loading from saved optimize_model for now

* Updated based on comments

* Update win info for multi-intel gpus selection

* Small fix

* Small fix
This commit is contained in:
Yuwen Hu 2024-01-11 14:37:16 +08:00 committed by GitHub
parent 07485eff5a
commit 0aef35a965
6 changed files with 165 additions and 56 deletions

View file

@ -10,16 +10,17 @@ We also support finetuning LLMs (large language models) using QLoRA with BigDL-L
To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example.
**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
**Make sure you have prepared environment following instructions [here](../install_gpu.html).**
```python
import intel_extension_for_pytorch as ipex
```eval_rst
.. note::
If you are using an older version of ``bigdl-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code.
```
First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`.
```python
import intel_extension_for_pytorch as ipex
from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf",

View file

@ -4,10 +4,12 @@ Apart from the significant acceleration capabilites on Intel CPUs, BigDL-LLM als
Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.
**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
**Make sure you have prepared environment following instructions [here](../install_gpu.html).**
```python
import intel_extension_for_pytorch as ipex
```eval_rst
.. note::
If you are using an older version of ``bigdl-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code.
```
## Load and Optimize Model
@ -26,7 +28,6 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
.. code-block:: python
# Take Llama-2-7b-chat-hf as an example
import intel_extension_for_pytorch as ipex
from transformers import LlamaForCausalLM
from bigdl.llm import optimize_model
@ -35,11 +36,16 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
model = model.to('xpu') # Important after obtaining the optimized model
.. tip::
When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``optimize_model`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
See the `API doc <../../../PythonAPI/LLM/optimize.html#bigdl.llm.optimize_model>`_ for ``optimize_model`` to find more information.
Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows:
.. code-block:: python
import intel_extension_for_pytorch as ipex
from transformers import LlamaForCausalLM
from bigdl.llm.optimize import low_memory_init, load_low_bit
@ -59,7 +65,6 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
.. code-block:: python
# Take Llama-2-7b-chat-hf as an example
import intel_extension_for_pytorch as ipex
from bigdl.llm.transformers import AutoModelForCausalLM
# Load model in 4 bit, which convert the relevant layers in the model into INT4 format
@ -67,17 +72,26 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
model = model.to('xpu') # Important after obtaining the optimized model
.. tip::
When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True``` in the ``from_pretrained`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
See the `API doc <../../../PythonAPI/LLM/transformers.html#hugging-face-transformers-automodel>`_ to find more information.
Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows:
.. code-block:: python
import intel_extension_for_pytorch as ipex
from bigdl.llm.transformers import AutoModelForCausalLM
saved_dir='./llama-2-bigdl-llm-4-bit'
model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model
model = model.to('xpu') # Important after obtaining the optimized model
.. tip::
When running saved optimized models on Intel iGPUs for Windows users, we also recommend setting ``cpu_embedding=True`` in the ``load_low_bit`` function.
```
## Run Optimized Model
@ -101,6 +115,11 @@ with torch.inference_mode():
The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation.
```
```eval_rst
.. note::
If you are a Windows user, please also note that for **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
```
```eval_rst
.. seealso::

View file

@ -5,10 +5,26 @@ In [Inference on GPU](inference_on_gpu.md) and [Finetune (QLoRA)](finetune.md),
## List devices
The `sycl-ls` tool enumerates a list of devices available in the system. You can use it after you setup oneapi environment:
```bash
source /opt/intel/oneapi/setvars.sh
sycl-ls
```eval_rst
.. tabs::
.. tab:: Windows
Please make sure you are using CMD (Anaconda Prompt if using conda):
.. code-block:: cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
sycl-ls
.. tab:: Linux
.. code-block:: bash
source /opt/intel/oneapi/setvars.sh
sycl-ls
```
If you have two Arc770 GPUs, you can get something like below:
```
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
@ -20,7 +36,7 @@ If you have two Arc770 GPUs, you can get something like below:
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.26241]
```
This output shows there are two Arc A770 GPUs on this machine.
This output shows there are two Arc A770 GPUs as well as an Intel iGPU on this machine.
## Devices selection
To enable xpu, you should convert your model and input to xpu by below code:
@ -32,6 +48,7 @@ To select the desired devices, there are two ways: one is changing the code, ano
### 1. Select device in python
To specify a xpu, you can change the `to('xpu')` to `to('xpu:[device_id]')`, this device_id is counted from zero.
If you you want to use the second device, you can change the code like this:
```
model = model.to('xpu:1')
@ -41,11 +58,29 @@ input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu:1')
### 2. OneAPI device selector
Device selection environment variable, `ONEAPI_DEVICE_SELECTOR`, can be used to limit the choice of Intel GPU devices. As upon `sycl-ls` shows, the last three lines are three Level Zero GPU devices. So we can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select devices.
For example, you want to use the second A770 GPU, you can run the python like this:
```
ONEAPI_DEVICE_SELECTOR=level_zero:1 python generate.py
```
`ONEAPI_DEVICE_SELECTOR=level_zero:1` in upon command only affect in current python program. Also, you can export the environment, then run your python:
```
export ONEAPI_DEVICE_SELECTOR=level_zero:1
python generate.py
```eval_rst
.. tabs::
.. tab:: Windows
.. code-block:: cmd
set ONEAPI_DEVICE_SELECTOR=level_zero:1
python generate.py
Through ``set ONEAPI_DEVICE_SELECTOR=level_zero:1``, only the second A770 GPU will be available for the current environment.
.. tab:: Linux
.. code-block:: bash
ONEAPI_DEVICE_SELECTOR=level_zero:1 python generate.py
``ONEAPI_DEVICE_SELECTOR=level_zero:1`` in upon command only affect in current python program. Also, you can export the environment variable, then run your python:
.. code-block:: bash
export ONEAPI_DEVICE_SELECTOR=level_zero:1
python generate.py
```

View file

@ -61,32 +61,74 @@ pip install intel_extension_for_pytorch-2.1.10+xpu-cp39-cp39-win_amd64.whl
pip install --pre --upgrade bigdl-llm[xpu]
```
```eval_rst
.. note::
All the wheel packages mentioned here are for Python 3.9. If you would like to use Python 3.10 or 3.11, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp39`` with ``cp310`` or ``cp311``, respectively.
```
### Runtime Configuration
To use GPU acceleration on Windows, several environment variables are required before running a GPU example.
Make sure you are using CMD as PowerShell is not supported:
Make sure you are using CMD (Anaconda Prompt if using conda) as PowerShell is not supported:
```
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
Please also set the following environment variable for iGPU:
Please also set the following environment variable if you would like to run LLMs on:
```
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```eval_rst
.. tabs::
.. tab:: Intel iGPU
.. code-block:: cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
.. tab:: Intel Arc™ A300-Series or Pro A60
.. code-block:: cmd
set SYCL_CACHE_PERSISTENT=1
.. tab:: Other Intel dGPU Series
There is no need to set further environment variables.
```
```eval_rst
.. note::
For the first time that **each model** runs on **iGPU**, it may take around several minutes to compile.
For **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
```
<!-- ### Troubleshooting -->
### Troubleshooting
<!-- todo -->
#### 1. Error loading `intel_extension_for_pytorch`
If you met error when importing `intel_extension_for_pytorch`, please ensure that you have completed the following steps:
* Ensure that you have installed Visual Studio with "Desktop development with C++" workload.
* Make sure that the correct version of oneAPI, specifically 2024.0, is installed.
* Ensure that `libuv` is installed in your conda environment. This can be done during the creation of the environment with the command:
```cmd
conda create -n llm python=3.9 libuv
```
If you missed `libuv`, you can add it to your existing environment through
```cmd
conda install libuv
```
* Make sure you have configured oneAPI environment variables in your command prompt through
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
Please note that you need to set these environment variables again once you have a new command prompt window.
## Linux
@ -131,7 +173,7 @@ BigDL-LLM for GPU supports on Linux has been verified on:
We recommend you to use `this offline package <https://registrationcenter-download.intel.com/akdlm/IRC_NAS/20f4e6a1-6b0b-4752-b8c1-e5eacba10e01/l_BaseKit_p_2024.0.0.49564_offline.sh>`_ to install oneapi.
.. tab:: Pytorch 2.0
.. tab:: PyTorch 2.0
To enable BigDL-LLM for Intel GPUs with PyTorch 2.0, here're several prerequisite steps for tools installation and environment preparation:
@ -164,7 +206,7 @@ We recommend using [miniconda](https://docs.conda.io/en/latest/miniconda.html) t
```eval_rst
.. tabs::
.. tab:: Pytorch 2.1
.. tab:: PyTorch 2.1
.. code-block:: bash
@ -182,7 +224,7 @@ We recommend using [miniconda](https://docs.conda.io/en/latest/miniconda.html) t
pip install --pre --upgrade bigdl-llm[xpu_2.1] -f https://developer.intel.com/ipex-whl-stable-xpu
.. tab:: Pytorch 2.0
.. tab:: PyTorch 2.0
.. code-block:: bash
@ -243,6 +285,12 @@ If you encounter network issues when installing IPEX, you can also install BigDL
```
```eval_rst
.. note::
All the wheel packages mentioned here are for Python 3.9. If you would like to use Python 3.10 or 3.11, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp39`` with ``cp310`` or ``cp311``, respectively.
```
### Runtime Configuration
To use GPU acceleration on Linux, several environment variables are required or recommended before running a GPU example.
@ -255,7 +303,7 @@ To use GPU acceleration on Linux, several environment variables are required or
.. code-block:: bash
# Required step. Configure OneAPI environment variables
# Required step. Configure oneAPI environment variables
source /opt/intel/oneapi/setvars.sh
# Recommended Environment Variables
@ -268,7 +316,7 @@ To use GPU acceleration on Linux, several environment variables are required or
.. code-block:: bash
# Required step. Configure OneAPI environment variables
# Required step. Configure oneAPI environment variables
source /opt/intel/oneapi/setvars.sh
# Recommended Environment Variables
@ -316,4 +364,4 @@ Error: libmkl_sycl_blas.so.4: cannot open shared object file: No such file or di
The reason for such errors is that oneAPI has not been initialized properly before running BigDL-LLM code or before importing IPEX package.
* Step 1: Make sure you execute setvars.sh of oneAPI Base Toolkit before running BigDL-LLM code.
* Step 2: Make sure you install matching versions of BigDL-LLM/pytorch/IPEX and oneAPI Base Toolkit. BigDL-LLM with Pytorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. BigDL-LLM with Pytorch 2.0 should be used with oneAPI Base Toolkit version 2023.2.
* Step 2: Make sure you install matching versions of BigDL-LLM/pytorch/IPEX and oneAPI Base Toolkit. BigDL-LLM with PyTorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. BigDL-LLM with PyTorch 2.0 should be used with oneAPI Base Toolkit version 2023.2.

View file

@ -199,15 +199,19 @@ def optimize_model(model, low_bit='sym_int4', optimize_llm=True, modules_to_not_
A method to optimize any pytorch model.
:param model: The original PyTorch model (nn.module)
:param low_bit: Supported low-bit options are "sym_int4", "asym_int4", "sym_int5",
"asym_int5" or "sym_int8".
:param optimize_llm: Whether to further optimize llm model.
:param low_bit: str value, options are ``'sym_int4'``, ``'asym_int4'``, ``'sym_int5'``,
``'asym_int5'``, ``'sym_int8'``, ``'nf3'``, ``'nf4'``, ``'fp4'``,
``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``, ``'fp16'`` or ``'bf16'``,
``'sym_int4'`` means symmetric int 4, ``'asym_int4'`` means
asymmetric int 4, ``'nf4'`` means 4-bit NormalFloat, etc.
Relevant low bit optimizations will be applied to the model.
:param optimize_llm: Whether to further optimize llm model. Default to be ``True``.
:param modules_to_not_convert: list of str value, modules (nn.Module) that are skipped
when conducting model optimizations. Default to be None.
when conducting model optimizations. Default to be ``None``.
:param cpu_embedding: Whether to replace the Embedding layer, may need to set it
to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
:param lightweight_bmm: Whether to replace the torch.bmm ops, may need to set it
to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
:return: The optimized model.

View file

@ -101,22 +101,24 @@ class _BaseAutoModelClass:
Three new arguments are added to extend Hugging Face's from_pretrained method as follows:
:param load_in_4bit: boolean value, True means loading linear's weight to symmetric int 4 if
the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4
if the model is GPTQ model.
Default to be False.
:param load_in_low_bit: str value, options are sym_int4, asym_int4, sym_int5, asym_int5
, sym_int8, nf3, nf4, fp4, fp8, fp8_e4m3, fp8_e5m2, fp16 or bf16.
sym_int4 means symmetric int 4, asym_int4 means asymmetric int 4,
nf4 means 4-bit NormalFloat, etc. Relevant low bit optimizations
will be applied to the model.
the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4
if the model is GPTQ model.
Default to be ``False``.
:param load_in_low_bit: str value, options are ``'sym_int4'``, ``'asym_int4'``,
``'sym_int5'``, ``'asym_int5'``, ``'sym_int8'``, ``'nf3'``,
``'nf4'``, ``'fp4'``, ``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``,
``'fp16'`` or ``'bf16'``, ``'sym_int4'`` means symmetric int 4,
``'asym_int4'`` means asymmetric int 4, ``'nf4'`` means 4-bit
NormalFloat, etc. Relevant low bit optimizations will be applied
to the model.
:param optimize_model: boolean value, Whether to further optimize the low_bit llm model.
Default to be True.
Default to be ``True``.
:param modules_to_not_convert: list of str value, modules (nn.Module) that are skipped when
conducting model optimizations. Default to be None.
conducting model optimizations. Default to be ``None``.
:param cpu_embedding: Whether to replace the Embedding layer, may need to set it
to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
:param lightweight_bmm: Whether to replace the torch.bmm ops, may need to set it
to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
:return: a model instance
"""
pretrained_model_name_or_path = kwargs.get("pretrained_model_name_or_path", None) \