LLM: improve gpu supports key feature doc page (#9212)
This commit is contained in:
parent
9dc76f19c0
commit
7e96d3e79a
7 changed files with 198 additions and 49 deletions
|
|
@ -47,6 +47,10 @@ subtrees:
|
|||
- file: doc/LLM/Overview/KeyFeatures/langchain_api
|
||||
# - file: doc/LLM/Overview/KeyFeatures/cli
|
||||
- file: doc/LLM/Overview/KeyFeatures/gpu_supports
|
||||
subtrees:
|
||||
- entries:
|
||||
- file: doc/LLM/Overview/KeyFeatures/inference_on_gpu
|
||||
- file: doc/LLM/Overview/KeyFeatures/finetune
|
||||
- file: doc/LLM/Overview/examples
|
||||
title: "Examples"
|
||||
subtrees:
|
||||
|
|
|
|||
|
|
@ -0,0 +1,63 @@
|
|||
# Finetune (QLoRA)
|
||||
|
||||
We also support finetuning LLMs (large language models) using QLoRA with BigDL-LLM 4bit optimizations on Intel GPUs.
|
||||
|
||||
```eval_rst
|
||||
.. note::
|
||||
|
||||
Currently, only Hugging Face Transformers models are supported running QLoRA finetuning.
|
||||
```
|
||||
|
||||
To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example.
|
||||
|
||||
**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
|
||||
|
||||
```python
|
||||
import intel_extension_for_pytorch as ipex
|
||||
```
|
||||
|
||||
First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`.
|
||||
|
||||
```python
|
||||
import intel_extension_for_pytorch as ipex
|
||||
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf",
|
||||
load_in_low_bit="nf4",
|
||||
optimize_model=False,
|
||||
torch_dtype=torch.float16,
|
||||
modules_to_not_convert=["lm_head"],)
|
||||
model = model.to('xpu')
|
||||
```
|
||||
|
||||
Then, we have to apply some preprocessing to the model to prepare it for training.
|
||||
```python
|
||||
from bigdl.llm.transformers.qlora import prepare_model_for_kbit_training
|
||||
model.gradient_checkpointing_enable()
|
||||
model = prepare_model_for_kbit_training(model)
|
||||
```
|
||||
|
||||
Next, we can obtain a Peft model from the optimized model and a configuration object containing the parameters as follows:
|
||||
```python
|
||||
from bigdl.llm.transformers.qlora import get_peft_model
|
||||
from peft import LoraConfig
|
||||
config = LoraConfig(r=8,
|
||||
lora_alpha=32,
|
||||
target_modules=["q_proj", "k_proj", "v_proj"],
|
||||
lora_dropout=0.05,
|
||||
bias="none",
|
||||
task_type="CAUSAL_LM")
|
||||
model = get_peft_model(model, config)
|
||||
```
|
||||
|
||||
```eval_rst
|
||||
.. important::
|
||||
|
||||
Instead of ``from peft import prepare_model_for_kbit_training, get_peft_model`` as we did for regular QLoRA using bitandbytes and cuda, we import them from ``bigdl.llm.transformers.qlora`` here to get a BigDL-LLM compatible Peft model. And the rest is just the same as regular LoRA finetuning process using ``peft``.
|
||||
```
|
||||
|
||||
```eval_rst
|
||||
.. seealso::
|
||||
|
||||
See the complete examples `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU>`_
|
||||
```
|
||||
|
|
@ -1,47 +0,0 @@
|
|||
# GPU Supports
|
||||
|
||||
You may apply INT4 optimizations to any Hugging Face *Transformers* models on device with Intel GPUs as follows:
|
||||
|
||||
```python
|
||||
# import ipex
|
||||
import intel_extension_for_pytorch as ipex
|
||||
|
||||
# load Hugging Face Transformers model with INT4 optimizations on Intel GPUs
|
||||
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained('/path/to/model/',
|
||||
load_in_4bit=True,
|
||||
optimize_model=False)
|
||||
model = model.to('xpu')
|
||||
```
|
||||
|
||||
```eval_rst
|
||||
.. note::
|
||||
|
||||
You may apply INT8 optimizations as follows:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained('/path/to/model/',
|
||||
load_in_low_bit="sym_int8",
|
||||
optimize_model=False)
|
||||
model = model.to('xpu')
|
||||
```
|
||||
|
||||
After loading the Hugging Face *Transformers* model, you may easily run the optimized model as follows:
|
||||
|
||||
```python
|
||||
# run the optimized model
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||
input_ids = tokenizer.encode(input_str, ...).to('xpu')
|
||||
output_ids = model.generate(input_ids, ...)
|
||||
output = tokenizer.batch_decode(output_ids)
|
||||
```
|
||||
|
||||
```eval_rst
|
||||
.. seealso::
|
||||
|
||||
See the complete examples `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/transformers/transformers_int4/GPU>`_
|
||||
```
|
||||
|
|
@ -0,0 +1,10 @@
|
|||
GPU Supports
|
||||
================================
|
||||
|
||||
BigDL-LLM not only supports running large language models for inference, but also supports QLoRA finetuning on Intel GPUs.
|
||||
|
||||
* |inference_on_gpu|_
|
||||
* `Finetune (QLoRA) <./finetune.html>`_
|
||||
|
||||
.. |inference_on_gpu| replace:: Inference on GPU
|
||||
.. _inference_on_gpu: ./inference_on_gpu.html
|
||||
|
|
@ -10,10 +10,19 @@ You may run the LLMs using ``bigdl-llm`` through one of the following APIs:
|
|||
* `Native Format <./native_format.html>`_
|
||||
|
||||
* `LangChain API <./langchain_api.html>`_
|
||||
* `GPU Supports <./gpu_supports.html>`_
|
||||
* |gpu_supports|_
|
||||
|
||||
* |inference_on_gpu|_
|
||||
* `Finetune (QLoRA) <./finetune.html>`_
|
||||
|
||||
.. |transformers_style_api| replace:: ``transformers``-style API
|
||||
.. _transformers_style_api: ./transformers_style_api.html
|
||||
|
||||
.. |hugging_face_transformers_format| replace:: Hugging Face ``transformers`` Format
|
||||
.. _hugging_face_transformers_format: ./hugging_face_format.html
|
||||
|
||||
.. |gpu_supports| replace:: GPU Supports
|
||||
.. _gpu_supports: ./gpu_supports.html
|
||||
|
||||
.. |inference_on_gpu| replace:: Inference on GPU
|
||||
.. _inference_on_gpu: ./inference_on_gpu.html
|
||||
|
|
|
|||
|
|
@ -0,0 +1,109 @@
|
|||
# Inference on GPU
|
||||
|
||||
Apart from the significant acceleration capabilites on Intel CPUs, BigDL-LLM also supports optimizations and acceleration for running LLMs (large language models) on Intel GPUs. With BigDL-LLM, PyTorch models (in FP16/BF16/FP32) can be optimized with low-bit quantizations (supported precisions include INT4, INT5, INT8, etc).
|
||||
|
||||
Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.
|
||||
|
||||
**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
|
||||
|
||||
```python
|
||||
import intel_extension_for_pytorch as ipex
|
||||
```
|
||||
|
||||
## Load and Optimize Model
|
||||
|
||||
You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-style API](./transformers_style_api.html) on Intel GPUs according to your preference.
|
||||
|
||||
**Once you have the model with BigDL-LLM low bit optimization, set it to `to('xpu')`**.
|
||||
|
||||
```eval_rst
|
||||
.. tabs::
|
||||
|
||||
.. tab:: PyTorch API
|
||||
|
||||
You could optimize any PyTorch model with "one-line code change", and the loading and optimizing process on Intel GPUs maybe as follows:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Take Llama-2-7b-chat-hf as an example
|
||||
import intel_extension_for_pytorch as ipex
|
||||
from transformers import LlamaForCausalLM
|
||||
from bigdl.llm import optimize_model
|
||||
|
||||
model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype='auto', low_cpu_mem_usage=True)
|
||||
model = optimize_model(model) # With only one line to enable BigDL-LLM INT4 optimization
|
||||
|
||||
model = model.to('xpu') # Important after obtaining the optimized model
|
||||
|
||||
Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import intel_extension_for_pytorch as ipex
|
||||
from transformers import LlamaForCausalLM
|
||||
from bigdl.llm.optimize import low_memory_init, load_low_bit
|
||||
|
||||
saved_dir='./llama-2-bigdl-llm-4-bit'
|
||||
with low_memory_init(): # Fast and low cost by loading model on meta device
|
||||
model = LlamaForCausalLM.from_pretrained(saved_dir,
|
||||
torch_dtype="auto",
|
||||
trust_remote_code=True)
|
||||
model = load_low_bit(model, saved_dir) # Load the optimized model
|
||||
|
||||
model = model.to('xpu') # Important after obtaining the optimized model
|
||||
|
||||
.. tab:: ``transformers``-style API
|
||||
|
||||
You could run any Hugging Face Transformers model with ``transformers``-style API, and the loading and optimizing process on Intel GPUs maybe as follows:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Take Llama-2-7b-chat-hf as an example
|
||||
import intel_extension_for_pytorch as ipex
|
||||
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||
|
||||
# Load model in 4 bit, which convert the relevant layers in the model into INT4 format
|
||||
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', load_in_4bit=True)
|
||||
|
||||
model = model.to('xpu') # Important after obtaining the optimized model
|
||||
|
||||
Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import intel_extension_for_pytorch as ipex
|
||||
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||
|
||||
saved_dir='./llama-2-bigdl-llm-4-bit'
|
||||
model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model
|
||||
|
||||
model = model.to('xpu') # Important after obtaining the optimized model
|
||||
```
|
||||
|
||||
## Run Optimized Model
|
||||
|
||||
You could then do inference using the optimized model on Intel GPUs almostly the same as on CPUs. **The only difference is to set `to('xpu')` for input tensors.**
|
||||
|
||||
Continuing with the [example of Llama-2-7b-chat-hf](#load-and-optimize-model), running as follows:
|
||||
```python
|
||||
import torch
|
||||
|
||||
with torch.inference_mode():
|
||||
prompt = 'Q: What is CPU?\nA:'
|
||||
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') # With .to('xpu') specifically for inference on Intel GPUs
|
||||
output = model.generate(input_ids, max_new_tokens=32)
|
||||
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
|
||||
```
|
||||
|
||||
```eval_rst
|
||||
.. note::
|
||||
|
||||
The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation.
|
||||
```
|
||||
|
||||
|
||||
```eval_rst
|
||||
.. seealso::
|
||||
|
||||
See the complete examples `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU>`_
|
||||
```
|
||||
|
|
@ -26,6 +26,7 @@ BigDL-LLM for GPU supports has been verified on:
|
|||
|
||||
* Intel Arc™ A-Series Graphics
|
||||
* Intel Data Center GPU Flex Series
|
||||
* Intel Data Center GPU Max Series
|
||||
|
||||
```eval_rst
|
||||
.. note::
|
||||
|
|
|
|||
Loading…
Reference in a new issue