[LLM] Improve LLM doc regarding windows gpu related info (#9880)
* Improve runtime configuration for windows * Add python 310/311 supports for wheel downloading * Add troubleshooting for windows gpu * Remove manually import ipex due to auto importer * Add info regarding cpu_embedding=True on iGPU * More info for Windows users * Small updates to API docs * Python style fix * Remove tip for loading from saved optimize_model for now * Updated based on comments * Update win info for multi-intel gpus selection * Small fix * Small fix
This commit is contained in:
parent
07485eff5a
commit
0aef35a965
6 changed files with 165 additions and 56 deletions
|
|
@ -10,16 +10,17 @@ We also support finetuning LLMs (large language models) using QLoRA with BigDL-L
|
||||||
|
|
||||||
To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example.
|
To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example.
|
||||||
|
|
||||||
**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
|
**Make sure you have prepared environment following instructions [here](../install_gpu.html).**
|
||||||
|
|
||||||
```python
|
```eval_rst
|
||||||
import intel_extension_for_pytorch as ipex
|
.. note::
|
||||||
|
|
||||||
|
If you are using an older version of ``bigdl-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code.
|
||||||
```
|
```
|
||||||
|
|
||||||
First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`.
|
First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
import intel_extension_for_pytorch as ipex
|
|
||||||
from bigdl.llm.transformers import AutoModelForCausalLM
|
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||||
|
|
||||||
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf",
|
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf",
|
||||||
|
|
|
||||||
|
|
@ -4,10 +4,12 @@ Apart from the significant acceleration capabilites on Intel CPUs, BigDL-LLM als
|
||||||
|
|
||||||
Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.
|
Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.
|
||||||
|
|
||||||
**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
|
**Make sure you have prepared environment following instructions [here](../install_gpu.html).**
|
||||||
|
|
||||||
```python
|
```eval_rst
|
||||||
import intel_extension_for_pytorch as ipex
|
.. note::
|
||||||
|
|
||||||
|
If you are using an older version of ``bigdl-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code.
|
||||||
```
|
```
|
||||||
|
|
||||||
## Load and Optimize Model
|
## Load and Optimize Model
|
||||||
|
|
@ -26,7 +28,6 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
# Take Llama-2-7b-chat-hf as an example
|
# Take Llama-2-7b-chat-hf as an example
|
||||||
import intel_extension_for_pytorch as ipex
|
|
||||||
from transformers import LlamaForCausalLM
|
from transformers import LlamaForCausalLM
|
||||||
from bigdl.llm import optimize_model
|
from bigdl.llm import optimize_model
|
||||||
|
|
||||||
|
|
@ -35,11 +36,16 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
|
||||||
|
|
||||||
model = model.to('xpu') # Important after obtaining the optimized model
|
model = model.to('xpu') # Important after obtaining the optimized model
|
||||||
|
|
||||||
|
.. tip::
|
||||||
|
|
||||||
|
When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``optimize_model`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
|
||||||
|
|
||||||
|
See the `API doc <../../../PythonAPI/LLM/optimize.html#bigdl.llm.optimize_model>`_ for ``optimize_model`` to find more information.
|
||||||
|
|
||||||
Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows:
|
Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows:
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
import intel_extension_for_pytorch as ipex
|
|
||||||
from transformers import LlamaForCausalLM
|
from transformers import LlamaForCausalLM
|
||||||
from bigdl.llm.optimize import low_memory_init, load_low_bit
|
from bigdl.llm.optimize import low_memory_init, load_low_bit
|
||||||
|
|
||||||
|
|
@ -59,7 +65,6 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
# Take Llama-2-7b-chat-hf as an example
|
# Take Llama-2-7b-chat-hf as an example
|
||||||
import intel_extension_for_pytorch as ipex
|
|
||||||
from bigdl.llm.transformers import AutoModelForCausalLM
|
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||||
|
|
||||||
# Load model in 4 bit, which convert the relevant layers in the model into INT4 format
|
# Load model in 4 bit, which convert the relevant layers in the model into INT4 format
|
||||||
|
|
@ -67,17 +72,26 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
|
||||||
|
|
||||||
model = model.to('xpu') # Important after obtaining the optimized model
|
model = model.to('xpu') # Important after obtaining the optimized model
|
||||||
|
|
||||||
|
.. tip::
|
||||||
|
|
||||||
|
When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True``` in the ``from_pretrained`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
|
||||||
|
|
||||||
|
See the `API doc <../../../PythonAPI/LLM/transformers.html#hugging-face-transformers-automodel>`_ to find more information.
|
||||||
|
|
||||||
Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows:
|
Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows:
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
import intel_extension_for_pytorch as ipex
|
|
||||||
from bigdl.llm.transformers import AutoModelForCausalLM
|
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||||
|
|
||||||
saved_dir='./llama-2-bigdl-llm-4-bit'
|
saved_dir='./llama-2-bigdl-llm-4-bit'
|
||||||
model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model
|
model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model
|
||||||
|
|
||||||
model = model.to('xpu') # Important after obtaining the optimized model
|
model = model.to('xpu') # Important after obtaining the optimized model
|
||||||
|
|
||||||
|
.. tip::
|
||||||
|
|
||||||
|
When running saved optimized models on Intel iGPUs for Windows users, we also recommend setting ``cpu_embedding=True`` in the ``load_low_bit`` function.
|
||||||
```
|
```
|
||||||
|
|
||||||
## Run Optimized Model
|
## Run Optimized Model
|
||||||
|
|
@ -101,6 +115,11 @@ with torch.inference_mode():
|
||||||
The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation.
|
The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation.
|
||||||
```
|
```
|
||||||
|
|
||||||
|
```eval_rst
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
If you are a Windows user, please also note that for **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
|
||||||
|
```
|
||||||
|
|
||||||
```eval_rst
|
```eval_rst
|
||||||
.. seealso::
|
.. seealso::
|
||||||
|
|
|
||||||
|
|
@ -5,10 +5,26 @@ In [Inference on GPU](inference_on_gpu.md) and [Finetune (QLoRA)](finetune.md),
|
||||||
## List devices
|
## List devices
|
||||||
|
|
||||||
The `sycl-ls` tool enumerates a list of devices available in the system. You can use it after you setup oneapi environment:
|
The `sycl-ls` tool enumerates a list of devices available in the system. You can use it after you setup oneapi environment:
|
||||||
```bash
|
|
||||||
source /opt/intel/oneapi/setvars.sh
|
```eval_rst
|
||||||
sycl-ls
|
.. tabs::
|
||||||
|
.. tab:: Windows
|
||||||
|
|
||||||
|
Please make sure you are using CMD (Anaconda Prompt if using conda):
|
||||||
|
|
||||||
|
.. code-block:: cmd
|
||||||
|
|
||||||
|
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
|
||||||
|
sycl-ls
|
||||||
|
|
||||||
|
.. tab:: Linux
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
source /opt/intel/oneapi/setvars.sh
|
||||||
|
sycl-ls
|
||||||
```
|
```
|
||||||
|
|
||||||
If you have two Arc770 GPUs, you can get something like below:
|
If you have two Arc770 GPUs, you can get something like below:
|
||||||
```
|
```
|
||||||
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
|
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
|
||||||
|
|
@ -20,7 +36,7 @@ If you have two Arc770 GPUs, you can get something like below:
|
||||||
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
|
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
|
||||||
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.26241]
|
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.26241]
|
||||||
```
|
```
|
||||||
This output shows there are two Arc A770 GPUs on this machine.
|
This output shows there are two Arc A770 GPUs as well as an Intel iGPU on this machine.
|
||||||
|
|
||||||
## Devices selection
|
## Devices selection
|
||||||
To enable xpu, you should convert your model and input to xpu by below code:
|
To enable xpu, you should convert your model and input to xpu by below code:
|
||||||
|
|
@ -32,6 +48,7 @@ To select the desired devices, there are two ways: one is changing the code, ano
|
||||||
|
|
||||||
### 1. Select device in python
|
### 1. Select device in python
|
||||||
To specify a xpu, you can change the `to('xpu')` to `to('xpu:[device_id]')`, this device_id is counted from zero.
|
To specify a xpu, you can change the `to('xpu')` to `to('xpu:[device_id]')`, this device_id is counted from zero.
|
||||||
|
|
||||||
If you you want to use the second device, you can change the code like this:
|
If you you want to use the second device, you can change the code like this:
|
||||||
```
|
```
|
||||||
model = model.to('xpu:1')
|
model = model.to('xpu:1')
|
||||||
|
|
@ -41,11 +58,29 @@ input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu:1')
|
||||||
### 2. OneAPI device selector
|
### 2. OneAPI device selector
|
||||||
Device selection environment variable, `ONEAPI_DEVICE_SELECTOR`, can be used to limit the choice of Intel GPU devices. As upon `sycl-ls` shows, the last three lines are three Level Zero GPU devices. So we can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select devices.
|
Device selection environment variable, `ONEAPI_DEVICE_SELECTOR`, can be used to limit the choice of Intel GPU devices. As upon `sycl-ls` shows, the last three lines are three Level Zero GPU devices. So we can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select devices.
|
||||||
For example, you want to use the second A770 GPU, you can run the python like this:
|
For example, you want to use the second A770 GPU, you can run the python like this:
|
||||||
```
|
|
||||||
ONEAPI_DEVICE_SELECTOR=level_zero:1 python generate.py
|
```eval_rst
|
||||||
```
|
.. tabs::
|
||||||
`ONEAPI_DEVICE_SELECTOR=level_zero:1` in upon command only affect in current python program. Also, you can export the environment, then run your python:
|
.. tab:: Windows
|
||||||
```
|
|
||||||
export ONEAPI_DEVICE_SELECTOR=level_zero:1
|
.. code-block:: cmd
|
||||||
python generate.py
|
|
||||||
|
set ONEAPI_DEVICE_SELECTOR=level_zero:1
|
||||||
|
python generate.py
|
||||||
|
|
||||||
|
Through ``set ONEAPI_DEVICE_SELECTOR=level_zero:1``, only the second A770 GPU will be available for the current environment.
|
||||||
|
|
||||||
|
.. tab:: Linux
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
ONEAPI_DEVICE_SELECTOR=level_zero:1 python generate.py
|
||||||
|
|
||||||
|
``ONEAPI_DEVICE_SELECTOR=level_zero:1`` in upon command only affect in current python program. Also, you can export the environment variable, then run your python:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
export ONEAPI_DEVICE_SELECTOR=level_zero:1
|
||||||
|
python generate.py
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
|
||||||
|
|
@ -61,32 +61,74 @@ pip install intel_extension_for_pytorch-2.1.10+xpu-cp39-cp39-win_amd64.whl
|
||||||
pip install --pre --upgrade bigdl-llm[xpu]
|
pip install --pre --upgrade bigdl-llm[xpu]
|
||||||
```
|
```
|
||||||
|
|
||||||
|
```eval_rst
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
All the wheel packages mentioned here are for Python 3.9. If you would like to use Python 3.10 or 3.11, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp39`` with ``cp310`` or ``cp311``, respectively.
|
||||||
|
```
|
||||||
|
|
||||||
### Runtime Configuration
|
### Runtime Configuration
|
||||||
|
|
||||||
To use GPU acceleration on Windows, several environment variables are required before running a GPU example.
|
To use GPU acceleration on Windows, several environment variables are required before running a GPU example.
|
||||||
|
|
||||||
Make sure you are using CMD as PowerShell is not supported:
|
Make sure you are using CMD (Anaconda Prompt if using conda) as PowerShell is not supported:
|
||||||
|
|
||||||
```
|
```cmd
|
||||||
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
|
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
|
||||||
```
|
```
|
||||||
|
|
||||||
Please also set the following environment variable for iGPU:
|
Please also set the following environment variable if you would like to run LLMs on:
|
||||||
|
|
||||||
```
|
```eval_rst
|
||||||
set SYCL_CACHE_PERSISTENT=1
|
.. tabs::
|
||||||
set BIGDL_LLM_XMX_DISABLED=1
|
.. tab:: Intel iGPU
|
||||||
|
|
||||||
|
.. code-block:: cmd
|
||||||
|
|
||||||
|
set SYCL_CACHE_PERSISTENT=1
|
||||||
|
set BIGDL_LLM_XMX_DISABLED=1
|
||||||
|
|
||||||
|
.. tab:: Intel Arc™ A300-Series or Pro A60
|
||||||
|
|
||||||
|
.. code-block:: cmd
|
||||||
|
|
||||||
|
set SYCL_CACHE_PERSISTENT=1
|
||||||
|
|
||||||
|
.. tab:: Other Intel dGPU Series
|
||||||
|
|
||||||
|
There is no need to set further environment variables.
|
||||||
```
|
```
|
||||||
|
|
||||||
```eval_rst
|
```eval_rst
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
For the first time that **each model** runs on **iGPU**, it may take around several minutes to compile.
|
For **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
|
||||||
```
|
```
|
||||||
|
|
||||||
<!-- ### Troubleshooting -->
|
### Troubleshooting
|
||||||
|
|
||||||
<!-- todo -->
|
#### 1. Error loading `intel_extension_for_pytorch`
|
||||||
|
|
||||||
|
If you met error when importing `intel_extension_for_pytorch`, please ensure that you have completed the following steps:
|
||||||
|
|
||||||
|
* Ensure that you have installed Visual Studio with "Desktop development with C++" workload.
|
||||||
|
|
||||||
|
* Make sure that the correct version of oneAPI, specifically 2024.0, is installed.
|
||||||
|
|
||||||
|
* Ensure that `libuv` is installed in your conda environment. This can be done during the creation of the environment with the command:
|
||||||
|
```cmd
|
||||||
|
conda create -n llm python=3.9 libuv
|
||||||
|
```
|
||||||
|
If you missed `libuv`, you can add it to your existing environment through
|
||||||
|
```cmd
|
||||||
|
conda install libuv
|
||||||
|
```
|
||||||
|
|
||||||
|
* Make sure you have configured oneAPI environment variables in your command prompt through
|
||||||
|
```cmd
|
||||||
|
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
|
||||||
|
```
|
||||||
|
Please note that you need to set these environment variables again once you have a new command prompt window.
|
||||||
|
|
||||||
## Linux
|
## Linux
|
||||||
|
|
||||||
|
|
@ -131,7 +173,7 @@ BigDL-LLM for GPU supports on Linux has been verified on:
|
||||||
|
|
||||||
We recommend you to use `this offline package <https://registrationcenter-download.intel.com/akdlm/IRC_NAS/20f4e6a1-6b0b-4752-b8c1-e5eacba10e01/l_BaseKit_p_2024.0.0.49564_offline.sh>`_ to install oneapi.
|
We recommend you to use `this offline package <https://registrationcenter-download.intel.com/akdlm/IRC_NAS/20f4e6a1-6b0b-4752-b8c1-e5eacba10e01/l_BaseKit_p_2024.0.0.49564_offline.sh>`_ to install oneapi.
|
||||||
|
|
||||||
.. tab:: Pytorch 2.0
|
.. tab:: PyTorch 2.0
|
||||||
|
|
||||||
To enable BigDL-LLM for Intel GPUs with PyTorch 2.0, here're several prerequisite steps for tools installation and environment preparation:
|
To enable BigDL-LLM for Intel GPUs with PyTorch 2.0, here're several prerequisite steps for tools installation and environment preparation:
|
||||||
|
|
||||||
|
|
@ -164,7 +206,7 @@ We recommend using [miniconda](https://docs.conda.io/en/latest/miniconda.html) t
|
||||||
|
|
||||||
```eval_rst
|
```eval_rst
|
||||||
.. tabs::
|
.. tabs::
|
||||||
.. tab:: Pytorch 2.1
|
.. tab:: PyTorch 2.1
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
|
|
@ -182,7 +224,7 @@ We recommend using [miniconda](https://docs.conda.io/en/latest/miniconda.html) t
|
||||||
pip install --pre --upgrade bigdl-llm[xpu_2.1] -f https://developer.intel.com/ipex-whl-stable-xpu
|
pip install --pre --upgrade bigdl-llm[xpu_2.1] -f https://developer.intel.com/ipex-whl-stable-xpu
|
||||||
|
|
||||||
|
|
||||||
.. tab:: Pytorch 2.0
|
.. tab:: PyTorch 2.0
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
|
|
@ -243,6 +285,12 @@ If you encounter network issues when installing IPEX, you can also install BigDL
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
```eval_rst
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
All the wheel packages mentioned here are for Python 3.9. If you would like to use Python 3.10 or 3.11, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp39`` with ``cp310`` or ``cp311``, respectively.
|
||||||
|
```
|
||||||
|
|
||||||
### Runtime Configuration
|
### Runtime Configuration
|
||||||
|
|
||||||
To use GPU acceleration on Linux, several environment variables are required or recommended before running a GPU example.
|
To use GPU acceleration on Linux, several environment variables are required or recommended before running a GPU example.
|
||||||
|
|
@ -255,7 +303,7 @@ To use GPU acceleration on Linux, several environment variables are required or
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
# Required step. Configure OneAPI environment variables
|
# Required step. Configure oneAPI environment variables
|
||||||
source /opt/intel/oneapi/setvars.sh
|
source /opt/intel/oneapi/setvars.sh
|
||||||
|
|
||||||
# Recommended Environment Variables
|
# Recommended Environment Variables
|
||||||
|
|
@ -268,7 +316,7 @@ To use GPU acceleration on Linux, several environment variables are required or
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
# Required step. Configure OneAPI environment variables
|
# Required step. Configure oneAPI environment variables
|
||||||
source /opt/intel/oneapi/setvars.sh
|
source /opt/intel/oneapi/setvars.sh
|
||||||
|
|
||||||
# Recommended Environment Variables
|
# Recommended Environment Variables
|
||||||
|
|
@ -316,4 +364,4 @@ Error: libmkl_sycl_blas.so.4: cannot open shared object file: No such file or di
|
||||||
The reason for such errors is that oneAPI has not been initialized properly before running BigDL-LLM code or before importing IPEX package.
|
The reason for such errors is that oneAPI has not been initialized properly before running BigDL-LLM code or before importing IPEX package.
|
||||||
|
|
||||||
* Step 1: Make sure you execute setvars.sh of oneAPI Base Toolkit before running BigDL-LLM code.
|
* Step 1: Make sure you execute setvars.sh of oneAPI Base Toolkit before running BigDL-LLM code.
|
||||||
* Step 2: Make sure you install matching versions of BigDL-LLM/pytorch/IPEX and oneAPI Base Toolkit. BigDL-LLM with Pytorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. BigDL-LLM with Pytorch 2.0 should be used with oneAPI Base Toolkit version 2023.2.
|
* Step 2: Make sure you install matching versions of BigDL-LLM/pytorch/IPEX and oneAPI Base Toolkit. BigDL-LLM with PyTorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. BigDL-LLM with PyTorch 2.0 should be used with oneAPI Base Toolkit version 2023.2.
|
||||||
|
|
|
||||||
|
|
@ -199,15 +199,19 @@ def optimize_model(model, low_bit='sym_int4', optimize_llm=True, modules_to_not_
|
||||||
A method to optimize any pytorch model.
|
A method to optimize any pytorch model.
|
||||||
|
|
||||||
:param model: The original PyTorch model (nn.module)
|
:param model: The original PyTorch model (nn.module)
|
||||||
:param low_bit: Supported low-bit options are "sym_int4", "asym_int4", "sym_int5",
|
:param low_bit: str value, options are ``'sym_int4'``, ``'asym_int4'``, ``'sym_int5'``,
|
||||||
"asym_int5" or "sym_int8".
|
``'asym_int5'``, ``'sym_int8'``, ``'nf3'``, ``'nf4'``, ``'fp4'``,
|
||||||
:param optimize_llm: Whether to further optimize llm model.
|
``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``, ``'fp16'`` or ``'bf16'``,
|
||||||
|
``'sym_int4'`` means symmetric int 4, ``'asym_int4'`` means
|
||||||
|
asymmetric int 4, ``'nf4'`` means 4-bit NormalFloat, etc.
|
||||||
|
Relevant low bit optimizations will be applied to the model.
|
||||||
|
:param optimize_llm: Whether to further optimize llm model. Default to be ``True``.
|
||||||
:param modules_to_not_convert: list of str value, modules (nn.Module) that are skipped
|
:param modules_to_not_convert: list of str value, modules (nn.Module) that are skipped
|
||||||
when conducting model optimizations. Default to be None.
|
when conducting model optimizations. Default to be ``None``.
|
||||||
:param cpu_embedding: Whether to replace the Embedding layer, may need to set it
|
:param cpu_embedding: Whether to replace the Embedding layer, may need to set it
|
||||||
to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
|
to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
|
||||||
:param lightweight_bmm: Whether to replace the torch.bmm ops, may need to set it
|
:param lightweight_bmm: Whether to replace the torch.bmm ops, may need to set it
|
||||||
to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
|
to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
|
||||||
|
|
||||||
:return: The optimized model.
|
:return: The optimized model.
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -101,22 +101,24 @@ class _BaseAutoModelClass:
|
||||||
Three new arguments are added to extend Hugging Face's from_pretrained method as follows:
|
Three new arguments are added to extend Hugging Face's from_pretrained method as follows:
|
||||||
|
|
||||||
:param load_in_4bit: boolean value, True means loading linear's weight to symmetric int 4 if
|
:param load_in_4bit: boolean value, True means loading linear's weight to symmetric int 4 if
|
||||||
the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4
|
the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4
|
||||||
if the model is GPTQ model.
|
if the model is GPTQ model.
|
||||||
Default to be False.
|
Default to be ``False``.
|
||||||
:param load_in_low_bit: str value, options are sym_int4, asym_int4, sym_int5, asym_int5
|
:param load_in_low_bit: str value, options are ``'sym_int4'``, ``'asym_int4'``,
|
||||||
, sym_int8, nf3, nf4, fp4, fp8, fp8_e4m3, fp8_e5m2, fp16 or bf16.
|
``'sym_int5'``, ``'asym_int5'``, ``'sym_int8'``, ``'nf3'``,
|
||||||
sym_int4 means symmetric int 4, asym_int4 means asymmetric int 4,
|
``'nf4'``, ``'fp4'``, ``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``,
|
||||||
nf4 means 4-bit NormalFloat, etc. Relevant low bit optimizations
|
``'fp16'`` or ``'bf16'``, ``'sym_int4'`` means symmetric int 4,
|
||||||
will be applied to the model.
|
``'asym_int4'`` means asymmetric int 4, ``'nf4'`` means 4-bit
|
||||||
|
NormalFloat, etc. Relevant low bit optimizations will be applied
|
||||||
|
to the model.
|
||||||
:param optimize_model: boolean value, Whether to further optimize the low_bit llm model.
|
:param optimize_model: boolean value, Whether to further optimize the low_bit llm model.
|
||||||
Default to be True.
|
Default to be ``True``.
|
||||||
:param modules_to_not_convert: list of str value, modules (nn.Module) that are skipped when
|
:param modules_to_not_convert: list of str value, modules (nn.Module) that are skipped when
|
||||||
conducting model optimizations. Default to be None.
|
conducting model optimizations. Default to be ``None``.
|
||||||
:param cpu_embedding: Whether to replace the Embedding layer, may need to set it
|
:param cpu_embedding: Whether to replace the Embedding layer, may need to set it
|
||||||
to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
|
to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
|
||||||
:param lightweight_bmm: Whether to replace the torch.bmm ops, may need to set it
|
:param lightweight_bmm: Whether to replace the torch.bmm ops, may need to set it
|
||||||
to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
|
to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
|
||||||
:return: a model instance
|
:return: a model instance
|
||||||
"""
|
"""
|
||||||
pretrained_model_name_or_path = kwargs.get("pretrained_model_name_or_path", None) \
|
pretrained_model_name_or_path = kwargs.get("pretrained_model_name_or_path", None) \
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue