LLM: improve gpu supports key feature doc page (#9212)
This commit is contained in:
		
							parent
							
								
									9dc76f19c0
								
							
						
					
					
						commit
						7e96d3e79a
					
				
					 7 changed files with 198 additions and 49 deletions
				
			
		| 
						 | 
					@ -47,6 +47,10 @@ subtrees:
 | 
				
			||||||
                - file: doc/LLM/Overview/KeyFeatures/langchain_api
 | 
					                - file: doc/LLM/Overview/KeyFeatures/langchain_api
 | 
				
			||||||
                # - file: doc/LLM/Overview/KeyFeatures/cli
 | 
					                # - file: doc/LLM/Overview/KeyFeatures/cli
 | 
				
			||||||
                - file: doc/LLM/Overview/KeyFeatures/gpu_supports
 | 
					                - file: doc/LLM/Overview/KeyFeatures/gpu_supports
 | 
				
			||||||
 | 
					                  subtrees:
 | 
				
			||||||
 | 
					                    - entries:
 | 
				
			||||||
 | 
					                      - file: doc/LLM/Overview/KeyFeatures/inference_on_gpu
 | 
				
			||||||
 | 
					                      - file: doc/LLM/Overview/KeyFeatures/finetune
 | 
				
			||||||
          - file: doc/LLM/Overview/examples
 | 
					          - file: doc/LLM/Overview/examples
 | 
				
			||||||
            title: "Examples"
 | 
					            title: "Examples"
 | 
				
			||||||
            subtrees:
 | 
					            subtrees:
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -0,0 +1,63 @@
 | 
				
			||||||
 | 
					# Finetune (QLoRA)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					We also support finetuning LLMs (large language models) using QLoRA with BigDL-LLM 4bit optimizations on Intel GPUs.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```eval_rst
 | 
				
			||||||
 | 
					.. note::
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					   Currently, only Hugging Face Transformers models are supported running QLoRA finetuning.
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```python
 | 
				
			||||||
 | 
					import intel_extension_for_pytorch as ipex
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```python
 | 
				
			||||||
 | 
					import intel_extension_for_pytorch as ipex
 | 
				
			||||||
 | 
					from bigdl.llm.transformers import AutoModelForCausalLM
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf",
 | 
				
			||||||
 | 
					                                             load_in_low_bit="nf4",
 | 
				
			||||||
 | 
					                                             optimize_model=False,
 | 
				
			||||||
 | 
					                                             torch_dtype=torch.float16,
 | 
				
			||||||
 | 
					                                             modules_to_not_convert=["lm_head"],)
 | 
				
			||||||
 | 
					model = model.to('xpu')
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Then, we have to apply some preprocessing to the model to prepare it for training.
 | 
				
			||||||
 | 
					```python
 | 
				
			||||||
 | 
					from bigdl.llm.transformers.qlora import prepare_model_for_kbit_training
 | 
				
			||||||
 | 
					model.gradient_checkpointing_enable()
 | 
				
			||||||
 | 
					model = prepare_model_for_kbit_training(model)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Next, we can obtain a Peft model from the optimized model and a configuration object containing the parameters as follows:
 | 
				
			||||||
 | 
					```python
 | 
				
			||||||
 | 
					from bigdl.llm.transformers.qlora import get_peft_model
 | 
				
			||||||
 | 
					from peft import LoraConfig
 | 
				
			||||||
 | 
					config = LoraConfig(r=8, 
 | 
				
			||||||
 | 
					                    lora_alpha=32, 
 | 
				
			||||||
 | 
					                    target_modules=["q_proj", "k_proj", "v_proj"], 
 | 
				
			||||||
 | 
					                    lora_dropout=0.05, 
 | 
				
			||||||
 | 
					                    bias="none", 
 | 
				
			||||||
 | 
					                    task_type="CAUSAL_LM")
 | 
				
			||||||
 | 
					model = get_peft_model(model, config)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```eval_rst
 | 
				
			||||||
 | 
					.. important::
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					   Instead of ``from peft import prepare_model_for_kbit_training, get_peft_model`` as we did for regular QLoRA using bitandbytes and cuda, we import them from ``bigdl.llm.transformers.qlora`` here to get a BigDL-LLM compatible Peft model. And the rest is just the same as regular LoRA finetuning process using ``peft``.
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```eval_rst
 | 
				
			||||||
 | 
					.. seealso::
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					   See the complete examples `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU>`_
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -1,47 +0,0 @@
 | 
				
			||||||
# GPU Supports
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
You may apply INT4 optimizations to any Hugging Face *Transformers* models on device with Intel GPUs as follows:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```python
 | 
					 | 
				
			||||||
# import ipex
 | 
					 | 
				
			||||||
import intel_extension_for_pytorch as ipex
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
# load Hugging Face Transformers model with INT4 optimizations on Intel GPUs
 | 
					 | 
				
			||||||
from bigdl.llm.transformers import AutoModelForCausalLM
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
model = AutoModelForCausalLM.from_pretrained('/path/to/model/',
 | 
					 | 
				
			||||||
                                             load_in_4bit=True,
 | 
					 | 
				
			||||||
                                             optimize_model=False)
 | 
					 | 
				
			||||||
model = model.to('xpu')
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```eval_rst
 | 
					 | 
				
			||||||
.. note::
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
   You may apply INT8 optimizations as follows:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
   .. code-block:: python
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
      model = AutoModelForCausalLM.from_pretrained('/path/to/model/',
 | 
					 | 
				
			||||||
                                                   load_in_low_bit="sym_int8",
 | 
					 | 
				
			||||||
                                                   optimize_model=False)
 | 
					 | 
				
			||||||
      model = model.to('xpu')
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
After loading the Hugging Face *Transformers* model, you may easily run the optimized model as follows:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```python
 | 
					 | 
				
			||||||
# run the optimized model
 | 
					 | 
				
			||||||
from transformers import AutoTokenizer
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
tokenizer = AutoTokenizer.from_pretrained(model_path)
 | 
					 | 
				
			||||||
input_ids = tokenizer.encode(input_str, ...).to('xpu')
 | 
					 | 
				
			||||||
output_ids = model.generate(input_ids, ...)
 | 
					 | 
				
			||||||
output = tokenizer.batch_decode(output_ids)
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```eval_rst
 | 
					 | 
				
			||||||
.. seealso::
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
   See the complete examples `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/transformers/transformers_int4/GPU>`_
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
| 
						 | 
					@ -0,0 +1,10 @@
 | 
				
			||||||
 | 
					GPU Supports
 | 
				
			||||||
 | 
					================================
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					BigDL-LLM not only supports running large language models for inference, but also supports QLoRA finetuning on Intel GPUs.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					* |inference_on_gpu|_
 | 
				
			||||||
 | 
					* `Finetune (QLoRA) <./finetune.html>`_
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					.. |inference_on_gpu| replace:: Inference on GPU
 | 
				
			||||||
 | 
					.. _inference_on_gpu: ./inference_on_gpu.html
 | 
				
			||||||
| 
						 | 
					@ -10,10 +10,19 @@ You may run the LLMs using ``bigdl-llm`` through one of the following APIs:
 | 
				
			||||||
  * `Native Format <./native_format.html>`_
 | 
					  * `Native Format <./native_format.html>`_
 | 
				
			||||||
 | 
					
 | 
				
			||||||
* `LangChain API <./langchain_api.html>`_
 | 
					* `LangChain API <./langchain_api.html>`_
 | 
				
			||||||
* `GPU Supports <./gpu_supports.html>`_
 | 
					* |gpu_supports|_
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					  * |inference_on_gpu|_
 | 
				
			||||||
 | 
					  * `Finetune (QLoRA) <./finetune.html>`_
 | 
				
			||||||
 | 
					
 | 
				
			||||||
.. |transformers_style_api| replace:: ``transformers``-style API
 | 
					.. |transformers_style_api| replace:: ``transformers``-style API
 | 
				
			||||||
.. _transformers_style_api: ./transformers_style_api.html
 | 
					.. _transformers_style_api: ./transformers_style_api.html
 | 
				
			||||||
 | 
					
 | 
				
			||||||
.. |hugging_face_transformers_format| replace:: Hugging Face ``transformers`` Format
 | 
					.. |hugging_face_transformers_format| replace:: Hugging Face ``transformers`` Format
 | 
				
			||||||
.. _hugging_face_transformers_format: ./hugging_face_format.html
 | 
					.. _hugging_face_transformers_format: ./hugging_face_format.html
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					.. |gpu_supports| replace:: GPU Supports
 | 
				
			||||||
 | 
					.. _gpu_supports: ./gpu_supports.html
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					.. |inference_on_gpu| replace:: Inference on GPU
 | 
				
			||||||
 | 
					.. _inference_on_gpu: ./inference_on_gpu.html
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -0,0 +1,109 @@
 | 
				
			||||||
 | 
					# Inference on GPU
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Apart from the significant acceleration capabilites on Intel CPUs, BigDL-LLM also supports optimizations and acceleration for running LLMs (large language models) on Intel GPUs. With BigDL-LLM, PyTorch models (in FP16/BF16/FP32) can be optimized with low-bit quantizations (supported precisions include INT4, INT5, INT8, etc).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```python
 | 
				
			||||||
 | 
					import intel_extension_for_pytorch as ipex
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Load and Optimize Model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-style API](./transformers_style_api.html) on Intel GPUs according to your preference.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					**Once you have the model with BigDL-LLM low bit optimization, set it to `to('xpu')`**.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```eval_rst
 | 
				
			||||||
 | 
					.. tabs::
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					   .. tab:: PyTorch API
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					      You could optimize any PyTorch model with "one-line code change", and the loading and optimizing process on Intel GPUs maybe as follows:
 | 
				
			||||||
 | 
					      
 | 
				
			||||||
 | 
					      .. code-block:: python
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					         # Take Llama-2-7b-chat-hf as an example
 | 
				
			||||||
 | 
					         import intel_extension_for_pytorch as ipex
 | 
				
			||||||
 | 
					         from transformers import LlamaForCausalLM
 | 
				
			||||||
 | 
					         from bigdl.llm import optimize_model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					         model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype='auto', low_cpu_mem_usage=True)
 | 
				
			||||||
 | 
					         model = optimize_model(model) # With only one line to enable BigDL-LLM INT4 optimization
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					         model = model.to('xpu') # Important after obtaining the optimized model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					      Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					      .. code-block:: python
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					         import intel_extension_for_pytorch as ipex
 | 
				
			||||||
 | 
					         from transformers import LlamaForCausalLM
 | 
				
			||||||
 | 
					         from bigdl.llm.optimize import low_memory_init, load_low_bit
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					         saved_dir='./llama-2-bigdl-llm-4-bit'
 | 
				
			||||||
 | 
					         with low_memory_init(): # Fast and low cost by loading model on meta device
 | 
				
			||||||
 | 
					            model = LlamaForCausalLM.from_pretrained(saved_dir,
 | 
				
			||||||
 | 
					                                                     torch_dtype="auto",
 | 
				
			||||||
 | 
					                                                     trust_remote_code=True)
 | 
				
			||||||
 | 
					         model = load_low_bit(model, saved_dir) # Load the optimized model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					         model = model.to('xpu') # Important after obtaining the optimized model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					   .. tab:: ``transformers``-style API
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					      You could run any Hugging Face Transformers model with ``transformers``-style API, and the loading and optimizing process on Intel GPUs maybe as follows:
 | 
				
			||||||
 | 
					      
 | 
				
			||||||
 | 
					      .. code-block:: python
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					         # Take Llama-2-7b-chat-hf as an example
 | 
				
			||||||
 | 
					         import intel_extension_for_pytorch as ipex
 | 
				
			||||||
 | 
					         from bigdl.llm.transformers import AutoModelForCausalLM
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					         # Load model in 4 bit, which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					         model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', load_in_4bit=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					         model = model.to('xpu') # Important after obtaining the optimized model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					      Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					      .. code-block:: python
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					         import intel_extension_for_pytorch as ipex
 | 
				
			||||||
 | 
					         from bigdl.llm.transformers import AutoModelForCausalLM
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					         saved_dir='./llama-2-bigdl-llm-4-bit'
 | 
				
			||||||
 | 
					         model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					         model = model.to('xpu') # Important after obtaining the optimized model
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Run Optimized Model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					You could then do inference using the optimized model on Intel GPUs almostly the same as on CPUs. **The only difference is to set `to('xpu')` for input tensors.**
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Continuing with the [example of Llama-2-7b-chat-hf](#load-and-optimize-model), running as follows:
 | 
				
			||||||
 | 
					```python
 | 
				
			||||||
 | 
					import torch
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					with torch.inference_mode():
 | 
				
			||||||
 | 
					   prompt = 'Q: What is CPU?\nA:'
 | 
				
			||||||
 | 
					   input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') # With .to('xpu') specifically for inference on Intel GPUs
 | 
				
			||||||
 | 
					   output = model.generate(input_ids, max_new_tokens=32)
 | 
				
			||||||
 | 
					   output_str = tokenizer.decode(output[0], skip_special_tokens=True)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```eval_rst
 | 
				
			||||||
 | 
					.. note::
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					   The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation.
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```eval_rst
 | 
				
			||||||
 | 
					.. seealso::
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					   See the complete examples `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU>`_
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -26,6 +26,7 @@ BigDL-LLM for GPU supports has been verified on:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
* Intel Arc™ A-Series Graphics
 | 
					* Intel Arc™ A-Series Graphics
 | 
				
			||||||
* Intel Data Center GPU Flex Series
 | 
					* Intel Data Center GPU Flex Series
 | 
				
			||||||
 | 
					* Intel Data Center GPU Max Series
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```eval_rst
 | 
					```eval_rst
 | 
				
			||||||
.. note::
 | 
					.. note::
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
		Loading…
	
		Reference in a new issue