Update mddocs for part of Overview (2/2) and Inference (#11377)
* updated link * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed, deleted some leftover texts * converted to md file type, need to be reviewed * converted to md file type, need to be reviewed * testing Github Tags * testing Github Tags * added Github Tags * added Github Tags * added Github Tags * Small fix * Small fix * Small fix * Small fix * Small fix * Further fix * Fix index * Small fix * Fix --------- Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
This commit is contained in:
		
							parent
							
								
									33b9a9c4c9
								
							
						
					
					
						commit
						1a1a97c9e4
					
				
					 11 changed files with 135 additions and 209 deletions
				
			
		| 
						 | 
					@ -5,33 +5,34 @@
 | 
				
			||||||
### GGUF format usage with IPEX-LLM?
 | 
					### GGUF format usage with IPEX-LLM?
 | 
				
			||||||
 | 
					
 | 
				
			||||||
IPEX-LLM supports running GGUF/AWQ/GPTQ models on both [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations) and [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations).
 | 
					IPEX-LLM supports running GGUF/AWQ/GPTQ models on both [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations) and [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Please also refer to [here](https://github.com/intel-analytics/ipex-llm?tab=readme-ov-file#latest-update-) for our latest support.
 | 
					Please also refer to [here](https://github.com/intel-analytics/ipex-llm?tab=readme-ov-file#latest-update-) for our latest support.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## How to Resolve Errors
 | 
					## How to Resolve Errors
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Fail to install `ipex-llm` through `pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/` or `pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/`
 | 
					### Fail to install `ipex-llm` through `pip install --pre --upgrade ipex-llm[xpu] --extra-index-urlhttps://pytorch-extension.intel.com/release-whl/stable/xpu/us/` or `pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/` 
 | 
				
			||||||
 | 
					 | 
				
			||||||
You could try to install IPEX-LLM dependencies for Intel XPU from source archives:
 | 
					You could try to install IPEX-LLM dependencies for Intel XPU from source archives:
 | 
				
			||||||
- For Windows system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#install-ipex-llm-from-wheel) for the steps.
 | 
					- For Windows system, refer to [here](../install_gpu.md#install-ipex-llm-from-wheel) for the steps.
 | 
				
			||||||
- For Linux system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id3) for the steps.
 | 
					- For Linux system, refer to [here](../install_gpu.md#prerequisites-1) for the steps.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### PyTorch is not linked with support for xpu devices
 | 
					### PyTorch is not linked with support for xpu devices
 | 
				
			||||||
 | 
					
 | 
				
			||||||
1. Before running on Intel GPUs, please make sure you've prepared environment follwing [installation instruction](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html).
 | 
					1. Before running on Intel GPUs, please make sure you've prepared environment follwing [installation instruction](../install_gpu.md).
 | 
				
			||||||
2. If you are using an older version of `ipex-llm` (specifically, older than 2.5.0b20240104), you need to manually add `import intel_extension_for_pytorch as ipex` at the beginning of your code.
 | 
					2. If you are using an older version of `ipex-llm` (specifically, older than 2.5.0b20240104), you need to manually add `import intel_extension_for_pytorch as ipex` at the beginning of your code.
 | 
				
			||||||
3. After optimizing the model with IPEX-LLM, you need to move model to GPU through `model = model.to('xpu')`.
 | 
					3. After optimizing the model with IPEX-LLM, you need to move model to GPU through `model = model.to('xpu')`.
 | 
				
			||||||
4. If you have mutil GPUs, you could refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.html) for details about GPU selection.
 | 
					4. If you have mutil GPUs, you could refer to [here](../KeyFeatures/multi_gpus_selection.md) for details about GPU selection.
 | 
				
			||||||
5. If you do inference using the optimized model on Intel GPUs, you also need to set `to('xpu')` for input tensors.
 | 
					5. If you do inference using the optimized model on Intel GPUs, you also need to set `to('xpu')` for input tensors.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Import `intel_extension_for_pytorch` error on Windows GPU
 | 
					### Import `intel_extension_for_pytorch` error on Windows GPU
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Please refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#error-loading-intel-extension-for-pytorch) for detailed guide. We list the possible missing requirements in environment which could lead to this error.
 | 
					Please refer to [here](../install_gpu.md#1-error-loading-intel_extension_for_pytorch)
 | 
				
			||||||
 | 
					for detailed guide. We list the possible missing requirements in environment which could lead to this error.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### XPU device count is zero
 | 
					### XPU device count is zero
 | 
				
			||||||
 | 
					
 | 
				
			||||||
It's recommended to reinstall driver:
 | 
					It's recommended to reinstall driver:
 | 
				
			||||||
- For Windows system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#prerequisites) for the steps.
 | 
					- For Windows system, refer to [here](../install_gpu.md#windows) for the steps.
 | 
				
			||||||
- For Linux system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id1) for the steps.
 | 
					- For Linux system, refer to [here](../install_gpu.md#prerequisites-1) for the steps.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Error such as `The size of tensor a (33) must match the size of tensor b (17) at non-singleton dimension 2` duing attention forward function
 | 
					### Error such as `The size of tensor a (33) must match the size of tensor b (17) at non-singleton dimension 2` duing attention forward function
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -1,11 +1,7 @@
 | 
				
			||||||
# CLI (Command Line Interface) Tool
 | 
					# CLI (Command Line Interface) Tool
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```eval_rst
 | 
					> [!NOTE] 
 | 
				
			||||||
 | 
					> Currently `ipex-llm` CLI supports *LLaMA* (e.g., vicuna), *GPT-NeoX* (e.g., redpajama), *BLOOM* (e.g., pheonix) and *GPT2* (e.g., starcoder) model architecture; for other models, you may use the `transformers`-style or LangChain APIs.
 | 
				
			||||||
.. note:: 
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
   Currently ``ipex-llm`` CLI supports *LLaMA* (e.g., vicuna), *GPT-NeoX* (e.g., redpajama), *BLOOM* (e.g., pheonix) and *GPT2* (e.g., starcoder) model architecture; for other models, you may use the ``transformers``-style or LangChain APIs.
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Convert Model
 | 
					## Convert Model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -2,21 +2,15 @@
 | 
				
			||||||
 | 
					
 | 
				
			||||||
We also support finetuning LLMs (large language models) using QLoRA with IPEX-LLM 4bit optimizations on Intel GPUs.
 | 
					We also support finetuning LLMs (large language models) using QLoRA with IPEX-LLM 4bit optimizations on Intel GPUs.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```eval_rst
 | 
					> [!NOTE] 
 | 
				
			||||||
.. note::
 | 
					> Currently, only Hugging Face Transformers models are supported running QLoRA finetuning.
 | 
				
			||||||
 | 
					 | 
				
			||||||
   Currently, only Hugging Face Transformers models are supported running QLoRA finetuning.
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example.
 | 
					To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
**Make sure you have prepared environment following instructions [here](../install_gpu.html).**
 | 
					**Make sure you have prepared environment following instructions [here](../install_gpu.md).**
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```eval_rst
 | 
					> [!NOTE]
 | 
				
			||||||
.. note::
 | 
					> If you are using an older version of `ipex-llm` (specifically, older than 2.5.0b20240104), you need to manually add `import intel_extension_for_pytorch as ipex` at the beginning of your code.
 | 
				
			||||||
 | 
					 | 
				
			||||||
   If you are using an older version of ``ipex-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code.
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`.
 | 
					First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -32,6 +26,7 @@ model = model.to('xpu')
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Then, we have to apply some preprocessing to the model to prepare it for training.
 | 
					Then, we have to apply some preprocessing to the model to prepare it for training.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```python
 | 
					```python
 | 
				
			||||||
from ipex_llm.transformers.qlora import prepare_model_for_kbit_training
 | 
					from ipex_llm.transformers.qlora import prepare_model_for_kbit_training
 | 
				
			||||||
model.gradient_checkpointing_enable()
 | 
					model.gradient_checkpointing_enable()
 | 
				
			||||||
| 
						 | 
					@ -39,6 +34,7 @@ model = prepare_model_for_kbit_training(model)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Next, we can obtain a Peft model from the optimized model and a configuration object containing the parameters as follows:
 | 
					Next, we can obtain a Peft model from the optimized model and a configuration object containing the parameters as follows:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```python
 | 
					```python
 | 
				
			||||||
from ipex_llm.transformers.qlora import get_peft_model
 | 
					from ipex_llm.transformers.qlora import get_peft_model
 | 
				
			||||||
from peft import LoraConfig
 | 
					from peft import LoraConfig
 | 
				
			||||||
| 
						 | 
					@ -51,14 +47,8 @@ config = LoraConfig(r=8,
 | 
				
			||||||
model = get_peft_model(model, config)
 | 
					model = get_peft_model(model, config)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```eval_rst
 | 
					> [!IMPORTANT]
 | 
				
			||||||
.. important::
 | 
					> Instead of `from peft import prepare_model_for_kbit_training, get_peft_model` as we did for regular QLoRA using bitandbytes and cuda, we import them from `ipex_llm.transformers.qlora` here to get a IPEX-LLM compatible Peft model. And the rest is just the same as regular LoRA finetuning process using `peft`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
   Instead of ``from peft import prepare_model_for_kbit_training, get_peft_model`` as we did for regular QLoRA using bitandbytes and cuda, we import them from ``ipex_llm.transformers.qlora`` here to get a IPEX-LLM compatible Peft model. And the rest is just the same as regular LoRA finetuning process using ``peft``.
 | 
					> [!TIP]
 | 
				
			||||||
```
 | 
					> See the complete examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU)
 | 
				
			||||||
 | 
					 | 
				
			||||||
```eval_rst
 | 
					 | 
				
			||||||
.. seealso::
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
   See the complete examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU>`_
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
							
								
								
									
										7
									
								
								docs/mddocs/Overview/KeyFeatures/gpu_supports.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										7
									
								
								docs/mddocs/Overview/KeyFeatures/gpu_supports.md
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,7 @@
 | 
				
			||||||
 | 
					# GPU Supports
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					IPEX-LLM not only supports running large language models for inference, but also supports QLoRA finetuning on Intel GPUs.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					* [Inference on GPU](./inference_on_gpu.md)
 | 
				
			||||||
 | 
					* [Finetune (QLoRA)](./finetune.md)
 | 
				
			||||||
 | 
					* [Multi GPUs selection](./multi_gpus_selection.md)
 | 
				
			||||||
| 
						 | 
					@ -1,14 +0,0 @@
 | 
				
			||||||
GPU Supports
 | 
					 | 
				
			||||||
================================
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
IPEX-LLM not only supports running large language models for inference, but also supports QLoRA finetuning on Intel GPUs.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
* |inference_on_gpu|_
 | 
					 | 
				
			||||||
* `Finetune (QLoRA) <./finetune.html>`_
 | 
					 | 
				
			||||||
* `Multi GPUs selection <./multi_gpus_selection.html>`_
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
.. |inference_on_gpu| replace:: Inference on GPU
 | 
					 | 
				
			||||||
.. _inference_on_gpu: ./inference_on_gpu.html
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
.. |multi_gpus_selection| replace:: Multi GPUs selection
 | 
					 | 
				
			||||||
.. _multi_gpus_selection: ./multi_gpus_selection.html
 | 
					 | 
				
			||||||
| 
						 | 
					@ -22,21 +22,18 @@ output_ids = model.generate(input_ids, ...)
 | 
				
			||||||
output = tokenizer.batch_decode(output_ids)
 | 
					output = tokenizer.batch_decode(output_ids)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```eval_rst
 | 
					> [!TIP]
 | 
				
			||||||
.. seealso::
 | 
					> See the complete CPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels) and GPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels>).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
   See the complete CPU examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels>`_ and GPU examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels>`_.
 | 
					> [!NOTE]
 | 
				
			||||||
 | 
					> You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows:
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> ```python
 | 
				
			||||||
 | 
					> model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5")
 | 
				
			||||||
 | 
					> ```
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> See the CPU example [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types) and GPU example [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
.. note::
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
   You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
   .. code-block:: python
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
      model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5")
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
   See the CPU example `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types>`_ and GPU example `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types>`_.
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Save & Load
 | 
					## Save & Load
 | 
				
			||||||
After the model is optimized using INT4 (or INT8/INT5), you may save and load the optimized model as follows:
 | 
					After the model is optimized using INT4 (or INT8/INT5), you may save and load the optimized model as follows:
 | 
				
			||||||
| 
						 | 
					@ -47,8 +44,5 @@ model.save_low_bit(model_path)
 | 
				
			||||||
new_model = AutoModelForCausalLM.load_low_bit(model_path)
 | 
					new_model = AutoModelForCausalLM.load_low_bit(model_path)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```eval_rst
 | 
					> [!TIP]
 | 
				
			||||||
.. seealso::
 | 
					> See the complete CPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load) and GPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load).
 | 
				
			||||||
 | 
					 | 
				
			||||||
   See the CPU example `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load>`_ and GPU example `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load>`_
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										13
									
								
								docs/mddocs/Overview/KeyFeatures/index.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										13
									
								
								docs/mddocs/Overview/KeyFeatures/index.md
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,13 @@
 | 
				
			||||||
 | 
					# IPEX-LLM Key Features
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					You may run the LLMs using `ipex-llm` through one of the following APIs:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					* [PyTorch API](./optimize_model.md)
 | 
				
			||||||
 | 
					* [`transformers`-style API](./transformers_style_api.md)
 | 
				
			||||||
 | 
					  * [Hugging Face `transformers` Format](./hugging_face_format.md)
 | 
				
			||||||
 | 
					  * [Native Format](./native_format.md)
 | 
				
			||||||
 | 
					* [LangChain API](./langchain_api.md)
 | 
				
			||||||
 | 
					* [GPU Supports](./gpu_supports.md)
 | 
				
			||||||
 | 
					  * [Inference on GPU](./inference_on_gpu.md)
 | 
				
			||||||
 | 
					  * [Finetune (QLoRA)](./finetune.md)
 | 
				
			||||||
 | 
					  * [Multi GPUs selection](./multi_gpus_selection.md)
 | 
				
			||||||
| 
						 | 
					@ -1,33 +0,0 @@
 | 
				
			||||||
IPEX-LLM Key Features
 | 
					 | 
				
			||||||
================================
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
You may run the LLMs using ``ipex-llm`` through one of the following APIs:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
* `PyTorch API <./optimize_model.html>`_
 | 
					 | 
				
			||||||
* |transformers_style_api|_
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
  * |hugging_face_transformers_format|_
 | 
					 | 
				
			||||||
  * `Native Format <./native_format.html>`_
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
* `LangChain API <./langchain_api.html>`_
 | 
					 | 
				
			||||||
* |gpu_supports|_
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
  * |inference_on_gpu|_
 | 
					 | 
				
			||||||
  * `Finetune (QLoRA) <./finetune.html>`_
 | 
					 | 
				
			||||||
  * `Multi GPUs selection <./multi_gpus_selection.html>`_
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
.. |transformers_style_api| replace:: ``transformers``-style API
 | 
					 | 
				
			||||||
.. _transformers_style_api: ./transformers_style_api.html
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
.. |hugging_face_transformers_format| replace:: Hugging Face ``transformers`` Format
 | 
					 | 
				
			||||||
.. _hugging_face_transformers_format: ./hugging_face_format.html
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
.. |gpu_supports| replace:: GPU Supports
 | 
					 | 
				
			||||||
.. _gpu_supports: ./gpu_supports.html
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
.. |inference_on_gpu| replace:: Inference on GPU
 | 
					 | 
				
			||||||
.. _inference_on_gpu: ./inference_on_gpu.html
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
.. |multi_gpus_selection| replace:: Multi GPUs selection
 | 
					 | 
				
			||||||
.. _multi_gpus_selection: ./multi_gpus_selection.html
 | 
					 | 
				
			||||||
| 
						 | 
					@ -4,29 +4,22 @@ Apart from the significant acceleration capabilites on Intel CPUs, IPEX-LLM also
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.
 | 
					Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
**Make sure you have prepared environment following instructions [here](../install_gpu.html).**
 | 
					**Make sure you have prepared environment following instructions [here](../install_gpu.md).**
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```eval_rst
 | 
					> [!NOTE]
 | 
				
			||||||
.. note::
 | 
					> If you are using an older version of `ipex-llm` (specifically, older than 2.5.0b20240104), you need to manually add `import intel_extension_for_pytorch as ipex` at the beginning of your code. 
 | 
				
			||||||
 | 
					 | 
				
			||||||
   If you are using an older version of ``ipex-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code.
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Load and Optimize Model
 | 
					## Load and Optimize Model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-style API](./transformers_style_api.html) on Intel GPUs according to your preference.
 | 
					You could choose to use [PyTorch API](./optimize_model.md) or [`transformers`-style API](./transformers_style_api.md) on Intel GPUs according to your preference.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
**Once you have the model with IPEX-LLM low bit optimization, set it to `to('xpu')`**.
 | 
					**Once you have the model with IPEX-LLM low bit optimization, set it to `to('xpu')`**.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```eval_rst
 | 
					- For **PyTorch API**:
 | 
				
			||||||
.. tabs::
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
   .. tab:: PyTorch API
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
  You could optimize any PyTorch model with "one-line code change", and the loading and optimizing process on Intel GPUs maybe as follows:
 | 
					  You could optimize any PyTorch model with "one-line code change", and the loading and optimizing process on Intel GPUs maybe as follows:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
      .. code-block:: python
 | 
					  ```python
 | 
				
			||||||
 | 
					 | 
				
			||||||
  # Take Llama-2-7b-chat-hf as an example
 | 
					  # Take Llama-2-7b-chat-hf as an example
 | 
				
			||||||
  from transformers import LlamaForCausalLM
 | 
					  from transformers import LlamaForCausalLM
 | 
				
			||||||
  from ipex_llm import optimize_model
 | 
					  from ipex_llm import optimize_model
 | 
				
			||||||
| 
						 | 
					@ -35,17 +28,17 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
 | 
				
			||||||
  model = optimize_model(model) # With only one line to enable IPEX-LLM INT4 optimization
 | 
					  model = optimize_model(model) # With only one line to enable IPEX-LLM INT4 optimization
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  model = model.to('xpu') # Important after obtaining the optimized model
 | 
					  model = model.to('xpu') # Important after obtaining the optimized model
 | 
				
			||||||
 | 
					  ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
      .. tip::
 | 
					  > **Tip**"
 | 
				
			||||||
 | 
					  >
 | 
				
			||||||
 | 
					  > When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the `optimize_model` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
 | 
					  >
 | 
				
			||||||
 | 
					  > See the [API doc](https://ipex-llm.readthedocs.io/en/latest/doc/PythonAPI/LLM/optimize.html) for ``optimize_model`` to find more information.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
         When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``optimize_model`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
					  Especially, if you have saved the optimized model following setps [here](./optimize_model.md#save), the loading process on Intel GPUs maybe as follows:
 | 
				
			||||||
         
 | 
					 | 
				
			||||||
         See the `API doc <../../../PythonAPI/LLM/optimize.html#ipex_llm.optimize_model>`_ for ``optimize_model`` to find more information.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
      Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
      .. code-block:: python
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					  ```python
 | 
				
			||||||
  from transformers import LlamaForCausalLM
 | 
					  from transformers import LlamaForCausalLM
 | 
				
			||||||
  from ipex_llm.optimize import low_memory_init, load_low_bit
 | 
					  from ipex_llm.optimize import low_memory_init, load_low_bit
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -57,13 +50,13 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
 | 
				
			||||||
  model = load_low_bit(model, saved_dir) # Load the optimized model
 | 
					  model = load_low_bit(model, saved_dir) # Load the optimized model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  model = model.to('xpu') # Important after obtaining the optimized model
 | 
					  model = model.to('xpu') # Important after obtaining the optimized model
 | 
				
			||||||
 | 
					  ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
   .. tab:: ``transformers``-style API
 | 
					- For **``transformers``-style API**:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
      You could run any Hugging Face Transformers model with ``transformers``-style API, and the loading and optimizing process on Intel GPUs maybe as follows:
 | 
					  You could run any Hugging Face Transformers model with `transformers`-style API, and the loading and optimizing process on Intel GPUs maybe as follows:
 | 
				
			||||||
      
 | 
					 | 
				
			||||||
      .. code-block:: python
 | 
					 | 
				
			||||||
  
 | 
					  
 | 
				
			||||||
 | 
					  ```python
 | 
				
			||||||
  # Take Llama-2-7b-chat-hf as an example
 | 
					  # Take Llama-2-7b-chat-hf as an example
 | 
				
			||||||
  from ipex_llm.transformers import AutoModelForCausalLM
 | 
					  from ipex_llm.transformers import AutoModelForCausalLM
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -71,28 +64,26 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
 | 
				
			||||||
  model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', load_in_4bit=True)
 | 
					  model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', load_in_4bit=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  model = model.to('xpu') # Important after obtaining the optimized model
 | 
					  model = model.to('xpu') # Important after obtaining the optimized model
 | 
				
			||||||
 | 
					  ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
      .. tip::
 | 
					  > [!TIP]
 | 
				
			||||||
 | 
					  > When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the `from_pretrained` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
 | 
					  >
 | 
				
			||||||
 | 
					  > See the [API doc](https://ipex-llm.readthedocs.io/en/latest/doc/PythonAPI/LLM/transformers.html) to find more information.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
         When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``from_pretrained`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
					  Especially, if you have saved the optimized model following setps [here](./hugging_face_format.md#save--load), the loading process on Intel GPUs maybe as follows:
 | 
				
			||||||
         
 | 
					 | 
				
			||||||
         See the `API doc <../../../PythonAPI/LLM/transformers.html#hugging-face-transformers-automodel>`_ to find more information.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
      Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
      .. code-block:: python
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					  ```python
 | 
				
			||||||
  from ipex_llm.transformers import AutoModelForCausalLM
 | 
					  from ipex_llm.transformers import AutoModelForCausalLM
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  saved_dir='./llama-2-ipex-llm-4-bit'
 | 
					  saved_dir='./llama-2-ipex-llm-4-bit'
 | 
				
			||||||
  model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model
 | 
					  model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  model = model.to('xpu') # Important after obtaining the optimized model
 | 
					  model = model.to('xpu') # Important after obtaining the optimized model
 | 
				
			||||||
 | 
					  ```
 | 
				
			||||||
      .. tip::
 | 
					  > [!TIP]
 | 
				
			||||||
 | 
					  >
 | 
				
			||||||
         When running saved optimized models on Intel iGPUs for Windows users, we also recommend setting ``cpu_embedding=True`` in the ``load_low_bit`` function.
 | 
					  > When running saved optimized models on Intel iGPUs for Windows users, we also recommend setting `cpu_embedding=True` in the `load_low_bit` function.
 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Run Optimized Model
 | 
					## Run Optimized Model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -109,20 +100,11 @@ with torch.inference_mode():
 | 
				
			||||||
   output_str = tokenizer.decode(output[0], skip_special_tokens=True)
 | 
					   output_str = tokenizer.decode(output[0], skip_special_tokens=True)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```eval_rst
 | 
					> [!NOTE]
 | 
				
			||||||
.. note::
 | 
					> The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
   The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation.
 | 
					> [!NOTE]
 | 
				
			||||||
```
 | 
					> If you are a Windows user, please also note that for **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```eval_rst
 | 
					> [!TIP]
 | 
				
			||||||
.. note::
 | 
					> See the complete examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU).
 | 
				
			||||||
 | 
					 | 
				
			||||||
   If you are a Windows user, please also note that for **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```eval_rst
 | 
					 | 
				
			||||||
.. seealso::
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
   See the complete examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU>`_
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
| 
						 | 
					@ -18,23 +18,16 @@ doc_chain = load_qa_chain(ipex_llm, ...)
 | 
				
			||||||
output = doc_chain.run(...)
 | 
					output = doc_chain.run(...)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```eval_rst
 | 
					> [!TIP]
 | 
				
			||||||
.. seealso::
 | 
					> See the examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/LangChain/transformers_int4)
 | 
				
			||||||
 | 
					 | 
				
			||||||
   See the examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/LangChain/transformers_int4>`_.
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Using Native INT4 Format
 | 
					## Using Native INT4 Format
 | 
				
			||||||
 | 
					
 | 
				
			||||||
You may also convert Hugging Face *Transformers* models into native INT4 format, and then run the converted models using the LangChain API as follows.
 | 
					You may also convert Hugging Face *Transformers* models into native INT4 format, and then run the converted models using the LangChain API as follows.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```eval_rst
 | 
					> [!NOTE]
 | 
				
			||||||
.. note::
 | 
					> - Currently only llama/bloom/gptneox/starcoder model families are supported; for other models, you may use the Hugging Face ``transformers`` INT4 format as described [above](./langchain_api.md#using-hugging-face-transformers-int4-format).
 | 
				
			||||||
 | 
					> - You may choose the corresponding API developed for specific native models to load the converted model.
 | 
				
			||||||
   * Currently only llama/bloom/gptneox/starcoder model families are supported; for other models, you may use the Hugging Face ``transformers`` INT4 format as described `above <./langchain_api.html#using-hugging-face-transformers-int4-format>`_.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
   * You may choose the corresponding API developed for specific native models to load the converted model.
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
```python
 | 
					```python
 | 
				
			||||||
from ipex_llm.langchain.llms import LlamaLLM
 | 
					from ipex_llm.langchain.llms import LlamaLLM
 | 
				
			||||||
| 
						 | 
					@ -50,8 +43,5 @@ doc_chain = load_qa_chain(ipex_llm, ...)
 | 
				
			||||||
doc_chain.run(...)
 | 
					doc_chain.run(...)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```eval_rst
 | 
					> [!TIP]
 | 
				
			||||||
.. seealso::
 | 
					> See the examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/LangChain/native_int4) for more information.
 | 
				
			||||||
 | 
					 | 
				
			||||||
   See the examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/LangChain/native_int4>`_.
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
		Loading…
	
		Reference in a new issue