LLM: improve gpu supports key feature doc page (#9212)
This commit is contained in:
parent
9dc76f19c0
commit
7e96d3e79a
7 changed files with 198 additions and 49 deletions
|
|
@ -47,6 +47,10 @@ subtrees:
|
||||||
- file: doc/LLM/Overview/KeyFeatures/langchain_api
|
- file: doc/LLM/Overview/KeyFeatures/langchain_api
|
||||||
# - file: doc/LLM/Overview/KeyFeatures/cli
|
# - file: doc/LLM/Overview/KeyFeatures/cli
|
||||||
- file: doc/LLM/Overview/KeyFeatures/gpu_supports
|
- file: doc/LLM/Overview/KeyFeatures/gpu_supports
|
||||||
|
subtrees:
|
||||||
|
- entries:
|
||||||
|
- file: doc/LLM/Overview/KeyFeatures/inference_on_gpu
|
||||||
|
- file: doc/LLM/Overview/KeyFeatures/finetune
|
||||||
- file: doc/LLM/Overview/examples
|
- file: doc/LLM/Overview/examples
|
||||||
title: "Examples"
|
title: "Examples"
|
||||||
subtrees:
|
subtrees:
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,63 @@
|
||||||
|
# Finetune (QLoRA)
|
||||||
|
|
||||||
|
We also support finetuning LLMs (large language models) using QLoRA with BigDL-LLM 4bit optimizations on Intel GPUs.
|
||||||
|
|
||||||
|
```eval_rst
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Currently, only Hugging Face Transformers models are supported running QLoRA finetuning.
|
||||||
|
```
|
||||||
|
|
||||||
|
To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example.
|
||||||
|
|
||||||
|
**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import intel_extension_for_pytorch as ipex
|
||||||
|
```
|
||||||
|
|
||||||
|
First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`.
|
||||||
|
|
||||||
|
```python
|
||||||
|
import intel_extension_for_pytorch as ipex
|
||||||
|
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||||
|
|
||||||
|
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf",
|
||||||
|
load_in_low_bit="nf4",
|
||||||
|
optimize_model=False,
|
||||||
|
torch_dtype=torch.float16,
|
||||||
|
modules_to_not_convert=["lm_head"],)
|
||||||
|
model = model.to('xpu')
|
||||||
|
```
|
||||||
|
|
||||||
|
Then, we have to apply some preprocessing to the model to prepare it for training.
|
||||||
|
```python
|
||||||
|
from bigdl.llm.transformers.qlora import prepare_model_for_kbit_training
|
||||||
|
model.gradient_checkpointing_enable()
|
||||||
|
model = prepare_model_for_kbit_training(model)
|
||||||
|
```
|
||||||
|
|
||||||
|
Next, we can obtain a Peft model from the optimized model and a configuration object containing the parameters as follows:
|
||||||
|
```python
|
||||||
|
from bigdl.llm.transformers.qlora import get_peft_model
|
||||||
|
from peft import LoraConfig
|
||||||
|
config = LoraConfig(r=8,
|
||||||
|
lora_alpha=32,
|
||||||
|
target_modules=["q_proj", "k_proj", "v_proj"],
|
||||||
|
lora_dropout=0.05,
|
||||||
|
bias="none",
|
||||||
|
task_type="CAUSAL_LM")
|
||||||
|
model = get_peft_model(model, config)
|
||||||
|
```
|
||||||
|
|
||||||
|
```eval_rst
|
||||||
|
.. important::
|
||||||
|
|
||||||
|
Instead of ``from peft import prepare_model_for_kbit_training, get_peft_model`` as we did for regular QLoRA using bitandbytes and cuda, we import them from ``bigdl.llm.transformers.qlora`` here to get a BigDL-LLM compatible Peft model. And the rest is just the same as regular LoRA finetuning process using ``peft``.
|
||||||
|
```
|
||||||
|
|
||||||
|
```eval_rst
|
||||||
|
.. seealso::
|
||||||
|
|
||||||
|
See the complete examples `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU>`_
|
||||||
|
```
|
||||||
|
|
@ -1,47 +0,0 @@
|
||||||
# GPU Supports
|
|
||||||
|
|
||||||
You may apply INT4 optimizations to any Hugging Face *Transformers* models on device with Intel GPUs as follows:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# import ipex
|
|
||||||
import intel_extension_for_pytorch as ipex
|
|
||||||
|
|
||||||
# load Hugging Face Transformers model with INT4 optimizations on Intel GPUs
|
|
||||||
from bigdl.llm.transformers import AutoModelForCausalLM
|
|
||||||
|
|
||||||
model = AutoModelForCausalLM.from_pretrained('/path/to/model/',
|
|
||||||
load_in_4bit=True,
|
|
||||||
optimize_model=False)
|
|
||||||
model = model.to('xpu')
|
|
||||||
```
|
|
||||||
|
|
||||||
```eval_rst
|
|
||||||
.. note::
|
|
||||||
|
|
||||||
You may apply INT8 optimizations as follows:
|
|
||||||
|
|
||||||
.. code-block:: python
|
|
||||||
|
|
||||||
model = AutoModelForCausalLM.from_pretrained('/path/to/model/',
|
|
||||||
load_in_low_bit="sym_int8",
|
|
||||||
optimize_model=False)
|
|
||||||
model = model.to('xpu')
|
|
||||||
```
|
|
||||||
|
|
||||||
After loading the Hugging Face *Transformers* model, you may easily run the optimized model as follows:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# run the optimized model
|
|
||||||
from transformers import AutoTokenizer
|
|
||||||
|
|
||||||
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
|
||||||
input_ids = tokenizer.encode(input_str, ...).to('xpu')
|
|
||||||
output_ids = model.generate(input_ids, ...)
|
|
||||||
output = tokenizer.batch_decode(output_ids)
|
|
||||||
```
|
|
||||||
|
|
||||||
```eval_rst
|
|
||||||
.. seealso::
|
|
||||||
|
|
||||||
See the complete examples `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/transformers/transformers_int4/GPU>`_
|
|
||||||
```
|
|
||||||
|
|
@ -0,0 +1,10 @@
|
||||||
|
GPU Supports
|
||||||
|
================================
|
||||||
|
|
||||||
|
BigDL-LLM not only supports running large language models for inference, but also supports QLoRA finetuning on Intel GPUs.
|
||||||
|
|
||||||
|
* |inference_on_gpu|_
|
||||||
|
* `Finetune (QLoRA) <./finetune.html>`_
|
||||||
|
|
||||||
|
.. |inference_on_gpu| replace:: Inference on GPU
|
||||||
|
.. _inference_on_gpu: ./inference_on_gpu.html
|
||||||
|
|
@ -10,10 +10,19 @@ You may run the LLMs using ``bigdl-llm`` through one of the following APIs:
|
||||||
* `Native Format <./native_format.html>`_
|
* `Native Format <./native_format.html>`_
|
||||||
|
|
||||||
* `LangChain API <./langchain_api.html>`_
|
* `LangChain API <./langchain_api.html>`_
|
||||||
* `GPU Supports <./gpu_supports.html>`_
|
* |gpu_supports|_
|
||||||
|
|
||||||
|
* |inference_on_gpu|_
|
||||||
|
* `Finetune (QLoRA) <./finetune.html>`_
|
||||||
|
|
||||||
.. |transformers_style_api| replace:: ``transformers``-style API
|
.. |transformers_style_api| replace:: ``transformers``-style API
|
||||||
.. _transformers_style_api: ./transformers_style_api.html
|
.. _transformers_style_api: ./transformers_style_api.html
|
||||||
|
|
||||||
.. |hugging_face_transformers_format| replace:: Hugging Face ``transformers`` Format
|
.. |hugging_face_transformers_format| replace:: Hugging Face ``transformers`` Format
|
||||||
.. _hugging_face_transformers_format: ./hugging_face_format.html
|
.. _hugging_face_transformers_format: ./hugging_face_format.html
|
||||||
|
|
||||||
|
.. |gpu_supports| replace:: GPU Supports
|
||||||
|
.. _gpu_supports: ./gpu_supports.html
|
||||||
|
|
||||||
|
.. |inference_on_gpu| replace:: Inference on GPU
|
||||||
|
.. _inference_on_gpu: ./inference_on_gpu.html
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,109 @@
|
||||||
|
# Inference on GPU
|
||||||
|
|
||||||
|
Apart from the significant acceleration capabilites on Intel CPUs, BigDL-LLM also supports optimizations and acceleration for running LLMs (large language models) on Intel GPUs. With BigDL-LLM, PyTorch models (in FP16/BF16/FP32) can be optimized with low-bit quantizations (supported precisions include INT4, INT5, INT8, etc).
|
||||||
|
|
||||||
|
Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.
|
||||||
|
|
||||||
|
**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import intel_extension_for_pytorch as ipex
|
||||||
|
```
|
||||||
|
|
||||||
|
## Load and Optimize Model
|
||||||
|
|
||||||
|
You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-style API](./transformers_style_api.html) on Intel GPUs according to your preference.
|
||||||
|
|
||||||
|
**Once you have the model with BigDL-LLM low bit optimization, set it to `to('xpu')`**.
|
||||||
|
|
||||||
|
```eval_rst
|
||||||
|
.. tabs::
|
||||||
|
|
||||||
|
.. tab:: PyTorch API
|
||||||
|
|
||||||
|
You could optimize any PyTorch model with "one-line code change", and the loading and optimizing process on Intel GPUs maybe as follows:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
# Take Llama-2-7b-chat-hf as an example
|
||||||
|
import intel_extension_for_pytorch as ipex
|
||||||
|
from transformers import LlamaForCausalLM
|
||||||
|
from bigdl.llm import optimize_model
|
||||||
|
|
||||||
|
model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype='auto', low_cpu_mem_usage=True)
|
||||||
|
model = optimize_model(model) # With only one line to enable BigDL-LLM INT4 optimization
|
||||||
|
|
||||||
|
model = model.to('xpu') # Important after obtaining the optimized model
|
||||||
|
|
||||||
|
Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
import intel_extension_for_pytorch as ipex
|
||||||
|
from transformers import LlamaForCausalLM
|
||||||
|
from bigdl.llm.optimize import low_memory_init, load_low_bit
|
||||||
|
|
||||||
|
saved_dir='./llama-2-bigdl-llm-4-bit'
|
||||||
|
with low_memory_init(): # Fast and low cost by loading model on meta device
|
||||||
|
model = LlamaForCausalLM.from_pretrained(saved_dir,
|
||||||
|
torch_dtype="auto",
|
||||||
|
trust_remote_code=True)
|
||||||
|
model = load_low_bit(model, saved_dir) # Load the optimized model
|
||||||
|
|
||||||
|
model = model.to('xpu') # Important after obtaining the optimized model
|
||||||
|
|
||||||
|
.. tab:: ``transformers``-style API
|
||||||
|
|
||||||
|
You could run any Hugging Face Transformers model with ``transformers``-style API, and the loading and optimizing process on Intel GPUs maybe as follows:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
# Take Llama-2-7b-chat-hf as an example
|
||||||
|
import intel_extension_for_pytorch as ipex
|
||||||
|
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||||
|
|
||||||
|
# Load model in 4 bit, which convert the relevant layers in the model into INT4 format
|
||||||
|
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', load_in_4bit=True)
|
||||||
|
|
||||||
|
model = model.to('xpu') # Important after obtaining the optimized model
|
||||||
|
|
||||||
|
Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
import intel_extension_for_pytorch as ipex
|
||||||
|
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||||
|
|
||||||
|
saved_dir='./llama-2-bigdl-llm-4-bit'
|
||||||
|
model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model
|
||||||
|
|
||||||
|
model = model.to('xpu') # Important after obtaining the optimized model
|
||||||
|
```
|
||||||
|
|
||||||
|
## Run Optimized Model
|
||||||
|
|
||||||
|
You could then do inference using the optimized model on Intel GPUs almostly the same as on CPUs. **The only difference is to set `to('xpu')` for input tensors.**
|
||||||
|
|
||||||
|
Continuing with the [example of Llama-2-7b-chat-hf](#load-and-optimize-model), running as follows:
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
|
||||||
|
with torch.inference_mode():
|
||||||
|
prompt = 'Q: What is CPU?\nA:'
|
||||||
|
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') # With .to('xpu') specifically for inference on Intel GPUs
|
||||||
|
output = model.generate(input_ids, max_new_tokens=32)
|
||||||
|
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
|
||||||
|
```
|
||||||
|
|
||||||
|
```eval_rst
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation.
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
```eval_rst
|
||||||
|
.. seealso::
|
||||||
|
|
||||||
|
See the complete examples `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU>`_
|
||||||
|
```
|
||||||
|
|
@ -26,6 +26,7 @@ BigDL-LLM for GPU supports has been verified on:
|
||||||
|
|
||||||
* Intel Arc™ A-Series Graphics
|
* Intel Arc™ A-Series Graphics
|
||||||
* Intel Data Center GPU Flex Series
|
* Intel Data Center GPU Flex Series
|
||||||
|
* Intel Data Center GPU Max Series
|
||||||
|
|
||||||
```eval_rst
|
```eval_rst
|
||||||
.. note::
|
.. note::
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue