diff --git a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/finetune.md b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/finetune.md index 69c926eb..70157907 100644 --- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/finetune.md +++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/finetune.md @@ -10,16 +10,17 @@ We also support finetuning LLMs (large language models) using QLoRA with BigDL-L To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example. -**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**: +**Make sure you have prepared environment following instructions [here](../install_gpu.html).** -```python -import intel_extension_for_pytorch as ipex +```eval_rst +.. note:: + + If you are using an older version of ``bigdl-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code. ``` First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`. ```python -import intel_extension_for_pytorch as ipex from bigdl.llm.transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", diff --git a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/inference_on_gpu.md b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/inference_on_gpu.md index 8ca89d8c..966b6a2e 100644 --- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/inference_on_gpu.md +++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/inference_on_gpu.md @@ -4,10 +4,12 @@ Apart from the significant acceleration capabilites on Intel CPUs, BigDL-LLM als Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example. -**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**: +**Make sure you have prepared environment following instructions [here](../install_gpu.html).** -```python -import intel_extension_for_pytorch as ipex +```eval_rst +.. note:: + + If you are using an older version of ``bigdl-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code. ``` ## Load and Optimize Model @@ -26,7 +28,6 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`- .. code-block:: python # Take Llama-2-7b-chat-hf as an example - import intel_extension_for_pytorch as ipex from transformers import LlamaForCausalLM from bigdl.llm import optimize_model @@ -35,11 +36,16 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`- model = model.to('xpu') # Important after obtaining the optimized model + .. tip:: + + When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``optimize_model`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. + + See the `API doc <../../../PythonAPI/LLM/optimize.html#bigdl.llm.optimize_model>`_ for ``optimize_model`` to find more information. + Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows: .. code-block:: python - import intel_extension_for_pytorch as ipex from transformers import LlamaForCausalLM from bigdl.llm.optimize import low_memory_init, load_low_bit @@ -59,7 +65,6 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`- .. code-block:: python # Take Llama-2-7b-chat-hf as an example - import intel_extension_for_pytorch as ipex from bigdl.llm.transformers import AutoModelForCausalLM # Load model in 4 bit, which convert the relevant layers in the model into INT4 format @@ -67,17 +72,26 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`- model = model.to('xpu') # Important after obtaining the optimized model + .. tip:: + + When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True``` in the ``from_pretrained`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. + + See the `API doc <../../../PythonAPI/LLM/transformers.html#hugging-face-transformers-automodel>`_ to find more information. + Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows: .. code-block:: python - import intel_extension_for_pytorch as ipex from bigdl.llm.transformers import AutoModelForCausalLM saved_dir='./llama-2-bigdl-llm-4-bit' model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model model = model.to('xpu') # Important after obtaining the optimized model + + .. tip:: + + When running saved optimized models on Intel iGPUs for Windows users, we also recommend setting ``cpu_embedding=True`` in the ``load_low_bit`` function. ``` ## Run Optimized Model @@ -101,6 +115,11 @@ with torch.inference_mode(): The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation. ``` +```eval_rst +.. note:: + + If you are a Windows user, please also note that for **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +``` ```eval_rst .. seealso:: diff --git a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.md b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.md index 3eb48769..88aee516 100644 --- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.md +++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.md @@ -5,10 +5,26 @@ In [Inference on GPU](inference_on_gpu.md) and [Finetune (QLoRA)](finetune.md), ## List devices The `sycl-ls` tool enumerates a list of devices available in the system. You can use it after you setup oneapi environment: -```bash -source /opt/intel/oneapi/setvars.sh -sycl-ls + +```eval_rst +.. tabs:: + .. tab:: Windows + + Please make sure you are using CMD (Anaconda Prompt if using conda): + + .. code-block:: cmd + + call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" + sycl-ls + + .. tab:: Linux + + .. code-block:: bash + + source /opt/intel/oneapi/setvars.sh + sycl-ls ``` + If you have two Arc770 GPUs, you can get something like below: ``` [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000] @@ -20,7 +36,7 @@ If you have two Arc770 GPUs, you can get something like below: [ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241] [ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.26241] ``` -This output shows there are two Arc A770 GPUs on this machine. +This output shows there are two Arc A770 GPUs as well as an Intel iGPU on this machine. ## Devices selection To enable xpu, you should convert your model and input to xpu by below code: @@ -31,7 +47,8 @@ input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') To select the desired devices, there are two ways: one is changing the code, another is adding an environment variable. See: ### 1. Select device in python -To specify a xpu, you can change the `to('xpu')` to `to('xpu:[device_id]')`, this device_id is counted from zero. +To specify a xpu, you can change the `to('xpu')` to `to('xpu:[device_id]')`, this device_id is counted from zero. + If you you want to use the second device, you can change the code like this: ``` model = model.to('xpu:1') @@ -41,11 +58,29 @@ input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu:1') ### 2. OneAPI device selector Device selection environment variable, `ONEAPI_DEVICE_SELECTOR`, can be used to limit the choice of Intel GPU devices. As upon `sycl-ls` shows, the last three lines are three Level Zero GPU devices. So we can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select devices. For example, you want to use the second A770 GPU, you can run the python like this: -``` -ONEAPI_DEVICE_SELECTOR=level_zero:1 python generate.py -``` -`ONEAPI_DEVICE_SELECTOR=level_zero:1` in upon command only affect in current python program. Also, you can export the environment, then run your python: -``` -export ONEAPI_DEVICE_SELECTOR=level_zero:1 -python generate.py + +```eval_rst +.. tabs:: + .. tab:: Windows + + .. code-block:: cmd + + set ONEAPI_DEVICE_SELECTOR=level_zero:1 + python generate.py + + Through ``set ONEAPI_DEVICE_SELECTOR=level_zero:1``, only the second A770 GPU will be available for the current environment. + + .. tab:: Linux + + .. code-block:: bash + + ONEAPI_DEVICE_SELECTOR=level_zero:1 python generate.py + + ``ONEAPI_DEVICE_SELECTOR=level_zero:1`` in upon command only affect in current python program. Also, you can export the environment variable, then run your python: + + .. code-block:: bash + + export ONEAPI_DEVICE_SELECTOR=level_zero:1 + python generate.py + ``` diff --git a/docs/readthedocs/source/doc/LLM/Overview/install_gpu.md b/docs/readthedocs/source/doc/LLM/Overview/install_gpu.md index 1febe099..47b7a187 100644 --- a/docs/readthedocs/source/doc/LLM/Overview/install_gpu.md +++ b/docs/readthedocs/source/doc/LLM/Overview/install_gpu.md @@ -61,32 +61,74 @@ pip install intel_extension_for_pytorch-2.1.10+xpu-cp39-cp39-win_amd64.whl pip install --pre --upgrade bigdl-llm[xpu] ``` +```eval_rst +.. note:: + + All the wheel packages mentioned here are for Python 3.9. If you would like to use Python 3.10 or 3.11, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp39`` with ``cp310`` or ``cp311``, respectively. +``` + ### Runtime Configuration To use GPU acceleration on Windows, several environment variables are required before running a GPU example. -Make sure you are using CMD as PowerShell is not supported: +Make sure you are using CMD (Anaconda Prompt if using conda) as PowerShell is not supported: -``` +```cmd call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" ``` -Please also set the following environment variable for iGPU: +Please also set the following environment variable if you would like to run LLMs on: -``` -set SYCL_CACHE_PERSISTENT=1 -set BIGDL_LLM_XMX_DISABLED=1 +```eval_rst +.. tabs:: + .. tab:: Intel iGPU + + .. code-block:: cmd + + set SYCL_CACHE_PERSISTENT=1 + set BIGDL_LLM_XMX_DISABLED=1 + + .. tab:: Intel Arc™ A300-Series or Pro A60 + + .. code-block:: cmd + + set SYCL_CACHE_PERSISTENT=1 + + .. tab:: Other Intel dGPU Series + + There is no need to set further environment variables. ``` ```eval_rst .. note:: - For the first time that **each model** runs on **iGPU**, it may take around several minutes to compile. + For **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. ``` - +### Troubleshooting - +#### 1. Error loading `intel_extension_for_pytorch` + +If you met error when importing `intel_extension_for_pytorch`, please ensure that you have completed the following steps: + +* Ensure that you have installed Visual Studio with "Desktop development with C++" workload. + +* Make sure that the correct version of oneAPI, specifically 2024.0, is installed. + +* Ensure that `libuv` is installed in your conda environment. This can be done during the creation of the environment with the command: + ```cmd + conda create -n llm python=3.9 libuv + ``` + If you missed `libuv`, you can add it to your existing environment through + ```cmd + conda install libuv + ``` + +* Make sure you have configured oneAPI environment variables in your command prompt through + ```cmd + call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" + ``` + Please note that you need to set these environment variables again once you have a new command prompt window. ## Linux @@ -131,7 +173,7 @@ BigDL-LLM for GPU supports on Linux has been verified on: We recommend you to use `this offline package `_ to install oneapi. - .. tab:: Pytorch 2.0 + .. tab:: PyTorch 2.0 To enable BigDL-LLM for Intel GPUs with PyTorch 2.0, here're several prerequisite steps for tools installation and environment preparation: @@ -164,7 +206,7 @@ We recommend using [miniconda](https://docs.conda.io/en/latest/miniconda.html) t ```eval_rst .. tabs:: - .. tab:: Pytorch 2.1 + .. tab:: PyTorch 2.1 .. code-block:: bash @@ -182,7 +224,7 @@ We recommend using [miniconda](https://docs.conda.io/en/latest/miniconda.html) t pip install --pre --upgrade bigdl-llm[xpu_2.1] -f https://developer.intel.com/ipex-whl-stable-xpu - .. tab:: Pytorch 2.0 + .. tab:: PyTorch 2.0 .. code-block:: bash @@ -243,6 +285,12 @@ If you encounter network issues when installing IPEX, you can also install BigDL ``` +```eval_rst +.. note:: + + All the wheel packages mentioned here are for Python 3.9. If you would like to use Python 3.10 or 3.11, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp39`` with ``cp310`` or ``cp311``, respectively. +``` + ### Runtime Configuration To use GPU acceleration on Linux, several environment variables are required or recommended before running a GPU example. @@ -255,7 +303,7 @@ To use GPU acceleration on Linux, several environment variables are required or .. code-block:: bash - # Required step. Configure OneAPI environment variables + # Required step. Configure oneAPI environment variables source /opt/intel/oneapi/setvars.sh # Recommended Environment Variables @@ -268,7 +316,7 @@ To use GPU acceleration on Linux, several environment variables are required or .. code-block:: bash - # Required step. Configure OneAPI environment variables + # Required step. Configure oneAPI environment variables source /opt/intel/oneapi/setvars.sh # Recommended Environment Variables @@ -316,4 +364,4 @@ Error: libmkl_sycl_blas.so.4: cannot open shared object file: No such file or di The reason for such errors is that oneAPI has not been initialized properly before running BigDL-LLM code or before importing IPEX package. * Step 1: Make sure you execute setvars.sh of oneAPI Base Toolkit before running BigDL-LLM code. -* Step 2: Make sure you install matching versions of BigDL-LLM/pytorch/IPEX and oneAPI Base Toolkit. BigDL-LLM with Pytorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. BigDL-LLM with Pytorch 2.0 should be used with oneAPI Base Toolkit version 2023.2. \ No newline at end of file +* Step 2: Make sure you install matching versions of BigDL-LLM/pytorch/IPEX and oneAPI Base Toolkit. BigDL-LLM with PyTorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. BigDL-LLM with PyTorch 2.0 should be used with oneAPI Base Toolkit version 2023.2. diff --git a/python/llm/src/bigdl/llm/optimize.py b/python/llm/src/bigdl/llm/optimize.py index 71fce857..75b1760b 100644 --- a/python/llm/src/bigdl/llm/optimize.py +++ b/python/llm/src/bigdl/llm/optimize.py @@ -199,15 +199,19 @@ def optimize_model(model, low_bit='sym_int4', optimize_llm=True, modules_to_not_ A method to optimize any pytorch model. :param model: The original PyTorch model (nn.module) - :param low_bit: Supported low-bit options are "sym_int4", "asym_int4", "sym_int5", - "asym_int5" or "sym_int8". - :param optimize_llm: Whether to further optimize llm model. + :param low_bit: str value, options are ``'sym_int4'``, ``'asym_int4'``, ``'sym_int5'``, + ``'asym_int5'``, ``'sym_int8'``, ``'nf3'``, ``'nf4'``, ``'fp4'``, + ``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``, ``'fp16'`` or ``'bf16'``, + ``'sym_int4'`` means symmetric int 4, ``'asym_int4'`` means + asymmetric int 4, ``'nf4'`` means 4-bit NormalFloat, etc. + Relevant low bit optimizations will be applied to the model. + :param optimize_llm: Whether to further optimize llm model. Default to be ``True``. :param modules_to_not_convert: list of str value, modules (nn.Module) that are skipped - when conducting model optimizations. Default to be None. + when conducting model optimizations. Default to be ``None``. :param cpu_embedding: Whether to replace the Embedding layer, may need to set it - to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`. + to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``. :param lightweight_bmm: Whether to replace the torch.bmm ops, may need to set it - to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`. + to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``. :return: The optimized model. diff --git a/python/llm/src/bigdl/llm/transformers/model.py b/python/llm/src/bigdl/llm/transformers/model.py index 1cf21936..f9af3faa 100644 --- a/python/llm/src/bigdl/llm/transformers/model.py +++ b/python/llm/src/bigdl/llm/transformers/model.py @@ -101,22 +101,24 @@ class _BaseAutoModelClass: Three new arguments are added to extend Hugging Face's from_pretrained method as follows: :param load_in_4bit: boolean value, True means loading linear's weight to symmetric int 4 if - the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4 - if the model is GPTQ model. - Default to be False. - :param load_in_low_bit: str value, options are sym_int4, asym_int4, sym_int5, asym_int5 - , sym_int8, nf3, nf4, fp4, fp8, fp8_e4m3, fp8_e5m2, fp16 or bf16. - sym_int4 means symmetric int 4, asym_int4 means asymmetric int 4, - nf4 means 4-bit NormalFloat, etc. Relevant low bit optimizations - will be applied to the model. + the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4 + if the model is GPTQ model. + Default to be ``False``. + :param load_in_low_bit: str value, options are ``'sym_int4'``, ``'asym_int4'``, + ``'sym_int5'``, ``'asym_int5'``, ``'sym_int8'``, ``'nf3'``, + ``'nf4'``, ``'fp4'``, ``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``, + ``'fp16'`` or ``'bf16'``, ``'sym_int4'`` means symmetric int 4, + ``'asym_int4'`` means asymmetric int 4, ``'nf4'`` means 4-bit + NormalFloat, etc. Relevant low bit optimizations will be applied + to the model. :param optimize_model: boolean value, Whether to further optimize the low_bit llm model. - Default to be True. + Default to be ``True``. :param modules_to_not_convert: list of str value, modules (nn.Module) that are skipped when - conducting model optimizations. Default to be None. + conducting model optimizations. Default to be ``None``. :param cpu_embedding: Whether to replace the Embedding layer, may need to set it - to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`. + to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``. :param lightweight_bmm: Whether to replace the torch.bmm ops, may need to set it - to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`. + to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``. :return: a model instance """ pretrained_model_name_or_path = kwargs.get("pretrained_model_name_or_path", None) \