[LLM] Improve LLM doc regarding windows gpu related info (#9880)
* Improve runtime configuration for windows * Add python 310/311 supports for wheel downloading * Add troubleshooting for windows gpu * Remove manually import ipex due to auto importer * Add info regarding cpu_embedding=True on iGPU * More info for Windows users * Small updates to API docs * Python style fix * Remove tip for loading from saved optimize_model for now * Updated based on comments * Update win info for multi-intel gpus selection * Small fix * Small fix
This commit is contained in:
		
							parent
							
								
									07485eff5a
								
							
						
					
					
						commit
						0aef35a965
					
				
					 6 changed files with 165 additions and 56 deletions
				
			
		| 
						 | 
				
			
			@ -10,16 +10,17 @@ We also support finetuning LLMs (large language models) using QLoRA with BigDL-L
 | 
			
		|||
 | 
			
		||||
To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example.
 | 
			
		||||
 | 
			
		||||
**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
 | 
			
		||||
**Make sure you have prepared environment following instructions [here](../install_gpu.html).**
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
import intel_extension_for_pytorch as ipex
 | 
			
		||||
```eval_rst
 | 
			
		||||
.. note::
 | 
			
		||||
 | 
			
		||||
   If you are using an older version of ``bigdl-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code.
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`.
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
import intel_extension_for_pytorch as ipex
 | 
			
		||||
from bigdl.llm.transformers import AutoModelForCausalLM
 | 
			
		||||
 | 
			
		||||
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf",
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -4,10 +4,12 @@ Apart from the significant acceleration capabilites on Intel CPUs, BigDL-LLM als
 | 
			
		|||
 | 
			
		||||
Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.
 | 
			
		||||
 | 
			
		||||
**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
 | 
			
		||||
**Make sure you have prepared environment following instructions [here](../install_gpu.html).**
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
import intel_extension_for_pytorch as ipex
 | 
			
		||||
```eval_rst
 | 
			
		||||
.. note::
 | 
			
		||||
 | 
			
		||||
   If you are using an older version of ``bigdl-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code.
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## Load and Optimize Model
 | 
			
		||||
| 
						 | 
				
			
			@ -26,7 +28,6 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
 | 
			
		|||
      .. code-block:: python
 | 
			
		||||
 | 
			
		||||
         # Take Llama-2-7b-chat-hf as an example
 | 
			
		||||
         import intel_extension_for_pytorch as ipex
 | 
			
		||||
         from transformers import LlamaForCausalLM
 | 
			
		||||
         from bigdl.llm import optimize_model
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -35,11 +36,16 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
 | 
			
		|||
 | 
			
		||||
         model = model.to('xpu') # Important after obtaining the optimized model
 | 
			
		||||
 | 
			
		||||
      .. tip::
 | 
			
		||||
 | 
			
		||||
         When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``optimize_model`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
			
		||||
         
 | 
			
		||||
         See the `API doc <../../../PythonAPI/LLM/optimize.html#bigdl.llm.optimize_model>`_ for ``optimize_model`` to find more information.
 | 
			
		||||
 | 
			
		||||
      Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows:
 | 
			
		||||
 | 
			
		||||
      .. code-block:: python
 | 
			
		||||
 | 
			
		||||
         import intel_extension_for_pytorch as ipex
 | 
			
		||||
         from transformers import LlamaForCausalLM
 | 
			
		||||
         from bigdl.llm.optimize import low_memory_init, load_low_bit
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -59,7 +65,6 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
 | 
			
		|||
      .. code-block:: python
 | 
			
		||||
 | 
			
		||||
         # Take Llama-2-7b-chat-hf as an example
 | 
			
		||||
         import intel_extension_for_pytorch as ipex
 | 
			
		||||
         from bigdl.llm.transformers import AutoModelForCausalLM
 | 
			
		||||
 | 
			
		||||
         # Load model in 4 bit, which convert the relevant layers in the model into INT4 format
 | 
			
		||||
| 
						 | 
				
			
			@ -67,17 +72,26 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
 | 
			
		|||
 | 
			
		||||
         model = model.to('xpu') # Important after obtaining the optimized model
 | 
			
		||||
 | 
			
		||||
      .. tip::
 | 
			
		||||
 | 
			
		||||
         When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True``` in the ``from_pretrained`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
			
		||||
         
 | 
			
		||||
         See the `API doc <../../../PythonAPI/LLM/transformers.html#hugging-face-transformers-automodel>`_ to find more information.
 | 
			
		||||
 | 
			
		||||
      Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows:
 | 
			
		||||
 | 
			
		||||
      .. code-block:: python
 | 
			
		||||
 | 
			
		||||
         import intel_extension_for_pytorch as ipex
 | 
			
		||||
         from bigdl.llm.transformers import AutoModelForCausalLM
 | 
			
		||||
 | 
			
		||||
         saved_dir='./llama-2-bigdl-llm-4-bit'
 | 
			
		||||
         model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model
 | 
			
		||||
 | 
			
		||||
         model = model.to('xpu') # Important after obtaining the optimized model
 | 
			
		||||
 | 
			
		||||
      .. tip::
 | 
			
		||||
 | 
			
		||||
         When running saved optimized models on Intel iGPUs for Windows users, we also recommend setting ``cpu_embedding=True`` in the ``load_low_bit`` function.
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## Run Optimized Model
 | 
			
		||||
| 
						 | 
				
			
			@ -101,6 +115,11 @@ with torch.inference_mode():
 | 
			
		|||
   The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation.
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
```eval_rst
 | 
			
		||||
.. note::
 | 
			
		||||
 | 
			
		||||
   If you are a Windows user, please also note that for **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
```eval_rst
 | 
			
		||||
.. seealso::
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -5,10 +5,26 @@ In [Inference on GPU](inference_on_gpu.md) and [Finetune (QLoRA)](finetune.md),
 | 
			
		|||
## List devices
 | 
			
		||||
 | 
			
		||||
The `sycl-ls` tool enumerates a list of devices available in the system. You can use it after you setup oneapi environment:
 | 
			
		||||
```bash
 | 
			
		||||
source /opt/intel/oneapi/setvars.sh
 | 
			
		||||
sycl-ls
 | 
			
		||||
 | 
			
		||||
```eval_rst
 | 
			
		||||
.. tabs::
 | 
			
		||||
   .. tab:: Windows
 | 
			
		||||
 | 
			
		||||
      Please make sure you are using CMD (Anaconda Prompt if using conda):
 | 
			
		||||
 | 
			
		||||
      .. code-block:: cmd
 | 
			
		||||
 | 
			
		||||
        call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
			
		||||
        sycl-ls
 | 
			
		||||
 | 
			
		||||
   .. tab:: Linux
 | 
			
		||||
 | 
			
		||||
      .. code-block:: bash
 | 
			
		||||
 | 
			
		||||
         source /opt/intel/oneapi/setvars.sh
 | 
			
		||||
         sycl-ls
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
If you have two Arc770 GPUs, you can get something like below:
 | 
			
		||||
```
 | 
			
		||||
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
 | 
			
		||||
| 
						 | 
				
			
			@ -20,7 +36,7 @@ If you have two Arc770 GPUs, you can get something like below:
 | 
			
		|||
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
 | 
			
		||||
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.26241]
 | 
			
		||||
```
 | 
			
		||||
This output shows there are two Arc A770 GPUs on this machine.
 | 
			
		||||
This output shows there are two Arc A770 GPUs as well as an Intel iGPU on this machine.
 | 
			
		||||
 | 
			
		||||
## Devices selection
 | 
			
		||||
To enable xpu, you should convert your model and input to xpu by below code:
 | 
			
		||||
| 
						 | 
				
			
			@ -32,6 +48,7 @@ To select the desired devices, there are two ways: one is changing the code, ano
 | 
			
		|||
 | 
			
		||||
### 1. Select device in python
 | 
			
		||||
To specify a xpu, you can change the `to('xpu')` to `to('xpu:[device_id]')`, this device_id is counted from zero.
 | 
			
		||||
 | 
			
		||||
If you you want to use the second device, you can change the code like this: 
 | 
			
		||||
```
 | 
			
		||||
model = model.to('xpu:1')
 | 
			
		||||
| 
						 | 
				
			
			@ -41,11 +58,29 @@ input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu:1')
 | 
			
		|||
### 2. OneAPI device selector
 | 
			
		||||
Device selection environment variable, `ONEAPI_DEVICE_SELECTOR`, can be used to limit the choice of Intel GPU devices. As upon `sycl-ls` shows, the last three lines are three Level Zero GPU devices. So we can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select devices.
 | 
			
		||||
For example, you want to use the second A770 GPU, you can run the python like this:
 | 
			
		||||
```
 | 
			
		||||
ONEAPI_DEVICE_SELECTOR=level_zero:1 python generate.py
 | 
			
		||||
```
 | 
			
		||||
`ONEAPI_DEVICE_SELECTOR=level_zero:1` in upon command only affect in current python program. Also, you can export the environment, then run your python:
 | 
			
		||||
```
 | 
			
		||||
export ONEAPI_DEVICE_SELECTOR=level_zero:1
 | 
			
		||||
python generate.py
 | 
			
		||||
 | 
			
		||||
```eval_rst
 | 
			
		||||
.. tabs::
 | 
			
		||||
   .. tab:: Windows
 | 
			
		||||
 | 
			
		||||
      .. code-block:: cmd
 | 
			
		||||
 | 
			
		||||
         set ONEAPI_DEVICE_SELECTOR=level_zero:1 
 | 
			
		||||
         python generate.py
 | 
			
		||||
 | 
			
		||||
      Through ``set ONEAPI_DEVICE_SELECTOR=level_zero:1``, only the second A770 GPU will be available for the current environment.
 | 
			
		||||
 | 
			
		||||
   .. tab:: Linux
 | 
			
		||||
 | 
			
		||||
      .. code-block:: bash
 | 
			
		||||
 | 
			
		||||
         ONEAPI_DEVICE_SELECTOR=level_zero:1 python generate.py
 | 
			
		||||
 | 
			
		||||
      ``ONEAPI_DEVICE_SELECTOR=level_zero:1`` in upon command only affect in current python program. Also, you can export the environment variable, then run your python:
 | 
			
		||||
 | 
			
		||||
      .. code-block:: bash
 | 
			
		||||
 | 
			
		||||
         export ONEAPI_DEVICE_SELECTOR=level_zero:1
 | 
			
		||||
         python generate.py
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -61,32 +61,74 @@ pip install intel_extension_for_pytorch-2.1.10+xpu-cp39-cp39-win_amd64.whl
 | 
			
		|||
pip install --pre --upgrade bigdl-llm[xpu]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
```eval_rst
 | 
			
		||||
.. note::
 | 
			
		||||
 | 
			
		||||
   All the wheel packages mentioned here are for Python 3.9. If you would like to use Python 3.10 or 3.11, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp39`` with ``cp310`` or ``cp311``, respectively.
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### Runtime Configuration
 | 
			
		||||
 | 
			
		||||
To use GPU acceleration on Windows, several environment variables are required before running a GPU example.
 | 
			
		||||
 | 
			
		||||
Make sure you are using CMD as PowerShell is not supported:
 | 
			
		||||
Make sure you are using CMD (Anaconda Prompt if using conda) as PowerShell is not supported:
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
```cmd
 | 
			
		||||
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Please also set the following environment variable for iGPU:
 | 
			
		||||
Please also set the following environment variable if you would like to run LLMs on:
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
set SYCL_CACHE_PERSISTENT=1
 | 
			
		||||
set BIGDL_LLM_XMX_DISABLED=1
 | 
			
		||||
```eval_rst
 | 
			
		||||
.. tabs::
 | 
			
		||||
   .. tab:: Intel iGPU
 | 
			
		||||
 | 
			
		||||
      .. code-block:: cmd
 | 
			
		||||
 | 
			
		||||
         set SYCL_CACHE_PERSISTENT=1
 | 
			
		||||
         set BIGDL_LLM_XMX_DISABLED=1
 | 
			
		||||
 | 
			
		||||
   .. tab:: Intel Arc™ A300-Series or Pro A60
 | 
			
		||||
 | 
			
		||||
      .. code-block:: cmd
 | 
			
		||||
 | 
			
		||||
         set SYCL_CACHE_PERSISTENT=1
 | 
			
		||||
 | 
			
		||||
   .. tab:: Other Intel dGPU Series
 | 
			
		||||
 | 
			
		||||
      There is no need to set further environment variables.
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
```eval_rst
 | 
			
		||||
.. note::
 | 
			
		||||
 | 
			
		||||
   For the first time that **each model** runs on **iGPU**, it may take around several minutes to compile.
 | 
			
		||||
   For **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
<!-- ### Troubleshooting -->
 | 
			
		||||
### Troubleshooting
 | 
			
		||||
 | 
			
		||||
<!-- todo -->
 | 
			
		||||
#### 1. Error loading `intel_extension_for_pytorch`
 | 
			
		||||
 | 
			
		||||
If you met error when importing `intel_extension_for_pytorch`, please ensure that you have completed the following steps:
 | 
			
		||||
 | 
			
		||||
* Ensure that you have installed Visual Studio with "Desktop development with C++" workload.
 | 
			
		||||
 | 
			
		||||
* Make sure that the correct version of oneAPI, specifically 2024.0, is installed.
 | 
			
		||||
 | 
			
		||||
* Ensure that `libuv` is installed in your conda environment. This can be done during the creation of the environment with the command:
 | 
			
		||||
  ```cmd
 | 
			
		||||
  conda create -n llm python=3.9 libuv
 | 
			
		||||
  ```
 | 
			
		||||
  If you missed `libuv`, you can add it to your existing environment through
 | 
			
		||||
  ```cmd
 | 
			
		||||
  conda install libuv
 | 
			
		||||
  ```
 | 
			
		||||
 | 
			
		||||
* Make sure you have configured oneAPI environment variables in your command prompt through
 | 
			
		||||
  ```cmd
 | 
			
		||||
  call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
			
		||||
  ```
 | 
			
		||||
  Please note that you need to set these environment variables again once you have a new command prompt window.
 | 
			
		||||
 | 
			
		||||
## Linux
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -131,7 +173,7 @@ BigDL-LLM for GPU supports on Linux has been verified on:
 | 
			
		|||
 | 
			
		||||
           We recommend you to use `this offline package <https://registrationcenter-download.intel.com/akdlm/IRC_NAS/20f4e6a1-6b0b-4752-b8c1-e5eacba10e01/l_BaseKit_p_2024.0.0.49564_offline.sh>`_ to install oneapi.
 | 
			
		||||
 | 
			
		||||
   .. tab:: Pytorch 2.0
 | 
			
		||||
   .. tab:: PyTorch 2.0
 | 
			
		||||
 | 
			
		||||
      To enable BigDL-LLM for Intel GPUs with PyTorch 2.0, here're several prerequisite steps for tools installation and environment preparation:
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -164,7 +206,7 @@ We recommend using [miniconda](https://docs.conda.io/en/latest/miniconda.html) t
 | 
			
		|||
 | 
			
		||||
```eval_rst
 | 
			
		||||
.. tabs::
 | 
			
		||||
   .. tab:: Pytorch 2.1
 | 
			
		||||
   .. tab:: PyTorch 2.1
 | 
			
		||||
 | 
			
		||||
      .. code-block:: bash
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -182,7 +224,7 @@ We recommend using [miniconda](https://docs.conda.io/en/latest/miniconda.html) t
 | 
			
		|||
            pip install --pre --upgrade bigdl-llm[xpu_2.1] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
			
		||||
            
 | 
			
		||||
 | 
			
		||||
   .. tab:: Pytorch 2.0
 | 
			
		||||
   .. tab:: PyTorch 2.0
 | 
			
		||||
 | 
			
		||||
      .. code-block:: bash
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -243,6 +285,12 @@ If you encounter network issues when installing IPEX, you can also install BigDL
 | 
			
		|||
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
```eval_rst
 | 
			
		||||
.. note::
 | 
			
		||||
 | 
			
		||||
   All the wheel packages mentioned here are for Python 3.9. If you would like to use Python 3.10 or 3.11, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp39`` with ``cp310`` or ``cp311``, respectively.
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### Runtime Configuration
 | 
			
		||||
 | 
			
		||||
To use GPU acceleration on Linux, several environment variables are required or recommended before running a GPU example.
 | 
			
		||||
| 
						 | 
				
			
			@ -255,7 +303,7 @@ To use GPU acceleration on Linux, several environment variables are required or
 | 
			
		|||
 | 
			
		||||
      .. code-block:: bash
 | 
			
		||||
 | 
			
		||||
         # Required step. Configure OneAPI environment variables
 | 
			
		||||
         # Required step. Configure oneAPI environment variables
 | 
			
		||||
         source /opt/intel/oneapi/setvars.sh
 | 
			
		||||
 | 
			
		||||
         # Recommended Environment Variables
 | 
			
		||||
| 
						 | 
				
			
			@ -268,7 +316,7 @@ To use GPU acceleration on Linux, several environment variables are required or
 | 
			
		|||
 | 
			
		||||
      .. code-block:: bash
 | 
			
		||||
 | 
			
		||||
         # Required step. Configure OneAPI environment variables
 | 
			
		||||
         # Required step. Configure oneAPI environment variables
 | 
			
		||||
         source /opt/intel/oneapi/setvars.sh
 | 
			
		||||
 | 
			
		||||
         # Recommended Environment Variables
 | 
			
		||||
| 
						 | 
				
			
			@ -316,4 +364,4 @@ Error: libmkl_sycl_blas.so.4: cannot open shared object file: No such file or di
 | 
			
		|||
The reason for such errors is that oneAPI has not been initialized properly before running BigDL-LLM code or before importing IPEX package.
 | 
			
		||||
 | 
			
		||||
* Step 1: Make sure you execute setvars.sh of oneAPI Base Toolkit before running BigDL-LLM code.
 | 
			
		||||
* Step 2: Make sure you install matching versions of BigDL-LLM/pytorch/IPEX and oneAPI Base Toolkit. BigDL-LLM with Pytorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. BigDL-LLM with Pytorch 2.0 should be used with oneAPI Base Toolkit version 2023.2.
 | 
			
		||||
* Step 2: Make sure you install matching versions of BigDL-LLM/pytorch/IPEX and oneAPI Base Toolkit. BigDL-LLM with PyTorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. BigDL-LLM with PyTorch 2.0 should be used with oneAPI Base Toolkit version 2023.2.
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -199,15 +199,19 @@ def optimize_model(model, low_bit='sym_int4', optimize_llm=True, modules_to_not_
 | 
			
		|||
    A method to optimize any pytorch model.
 | 
			
		||||
 | 
			
		||||
    :param model: The original PyTorch model (nn.module)
 | 
			
		||||
    :param low_bit: Supported low-bit options are "sym_int4", "asym_int4", "sym_int5",
 | 
			
		||||
        "asym_int5" or "sym_int8".
 | 
			
		||||
    :param optimize_llm: Whether to further optimize llm model.
 | 
			
		||||
    :param low_bit: str value, options are ``'sym_int4'``, ``'asym_int4'``, ``'sym_int5'``,
 | 
			
		||||
                    ``'asym_int5'``, ``'sym_int8'``, ``'nf3'``, ``'nf4'``, ``'fp4'``,
 | 
			
		||||
                    ``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``, ``'fp16'`` or ``'bf16'``,
 | 
			
		||||
                    ``'sym_int4'`` means symmetric int 4, ``'asym_int4'`` means
 | 
			
		||||
                    asymmetric int 4, ``'nf4'`` means 4-bit NormalFloat, etc.
 | 
			
		||||
                    Relevant low bit optimizations will be applied to the model.
 | 
			
		||||
    :param optimize_llm: Whether to further optimize llm model. Default to be ``True``.
 | 
			
		||||
    :param modules_to_not_convert: list of str value, modules (nn.Module) that are skipped
 | 
			
		||||
        when conducting model optimizations. Default to be None.
 | 
			
		||||
        when conducting model optimizations. Default to be ``None``.
 | 
			
		||||
    :param cpu_embedding: Whether to replace the Embedding layer, may need to set it
 | 
			
		||||
        to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
 | 
			
		||||
        to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
 | 
			
		||||
    :param lightweight_bmm: Whether to replace the torch.bmm ops, may need to set it
 | 
			
		||||
        to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
 | 
			
		||||
        to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
 | 
			
		||||
 | 
			
		||||
    :return: The optimized model.
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -101,22 +101,24 @@ class _BaseAutoModelClass:
 | 
			
		|||
        Three new arguments are added to extend Hugging Face's from_pretrained method as follows:
 | 
			
		||||
 | 
			
		||||
        :param load_in_4bit: boolean value, True means loading linear's weight to symmetric int 4 if
 | 
			
		||||
                                the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4
 | 
			
		||||
                                if the model is GPTQ model.
 | 
			
		||||
                             Default to be False.
 | 
			
		||||
        :param load_in_low_bit: str value, options are sym_int4, asym_int4, sym_int5, asym_int5
 | 
			
		||||
                                , sym_int8, nf3, nf4, fp4, fp8, fp8_e4m3, fp8_e5m2, fp16 or bf16.
 | 
			
		||||
                                sym_int4 means symmetric int 4, asym_int4 means asymmetric int 4,
 | 
			
		||||
                                nf4 means 4-bit NormalFloat, etc. Relevant low bit optimizations
 | 
			
		||||
                                will be applied to the model.
 | 
			
		||||
                             the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4
 | 
			
		||||
                             if the model is GPTQ model.
 | 
			
		||||
                             Default to be ``False``.
 | 
			
		||||
        :param load_in_low_bit: str value, options are ``'sym_int4'``, ``'asym_int4'``,
 | 
			
		||||
                                ``'sym_int5'``, ``'asym_int5'``, ``'sym_int8'``, ``'nf3'``,
 | 
			
		||||
                                ``'nf4'``, ``'fp4'``, ``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``,
 | 
			
		||||
                                ``'fp16'`` or ``'bf16'``, ``'sym_int4'`` means symmetric int 4,
 | 
			
		||||
                                ``'asym_int4'`` means asymmetric int 4, ``'nf4'`` means 4-bit
 | 
			
		||||
                                NormalFloat, etc. Relevant low bit optimizations will be applied
 | 
			
		||||
                                to the model.
 | 
			
		||||
        :param optimize_model: boolean value, Whether to further optimize the low_bit llm model.
 | 
			
		||||
                               Default to be True.
 | 
			
		||||
                               Default to be ``True``.
 | 
			
		||||
        :param modules_to_not_convert: list of str value, modules (nn.Module) that are skipped when
 | 
			
		||||
                                       conducting model optimizations. Default to be None.
 | 
			
		||||
                                       conducting model optimizations. Default to be ``None``.
 | 
			
		||||
        :param cpu_embedding: Whether to replace the Embedding layer, may need to set it
 | 
			
		||||
            to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
 | 
			
		||||
            to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
 | 
			
		||||
        :param lightweight_bmm: Whether to replace the torch.bmm ops, may need to set it
 | 
			
		||||
            to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
 | 
			
		||||
            to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
 | 
			
		||||
        :return: a model instance
 | 
			
		||||
        """
 | 
			
		||||
        pretrained_model_name_or_path = kwargs.get("pretrained_model_name_or_path", None) \
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
		Loading…
	
		Reference in a new issue