LLM: GPU Example Updates for Windows (#9992)
* modify aquila * modify aquila2 * add baichuan * modify baichuan2 * modify blue-lm * modify chatglm3 * modify chinese-llama2 * modiy codellama * modify distil-whisper * modify dolly-v1 * modify dolly-v2 * modify falcon * modify flan-t5 * modify gpt-j * modify internlm * modify llama2 * modify mistral * modify mixtral * modify mpt * modify phi-1_5 * modify qwen * modify qwen-vl * modify replit * modify solar * modify starcoder * modify vicuna * modify voiceassistant * modify whisper * modify yi * modify aquila2 * modify baichuan * modify baichuan2 * modify blue-lm * modify chatglm2 * modify chatglm3 * modify codellama * modify distil-whisper * modify dolly-v1 * modify dolly-v2 * modify flan-t5 * modify llama2 * modify llava * modify mistral * modify mixtral * modify phi-1_5 * modify qwen-vl * modify replit * modify solar * modify starcoder * modify yi * correct the comments * remove cpu_embedding in code for whisper and distil-whisper * remove comment * remove cpu_embedding for voice assistant * revert modify voice assistant * modify for voice assistant * add comment for voice assistant * fix comments * fix comments
This commit is contained in:
		
							parent
							
								
									c6d4f91777
								
							
						
					
					
						commit
						440cfe18ed
					
				
					 100 changed files with 3786 additions and 130 deletions
				
			
		| 
						 | 
					@ -1,6 +1,6 @@
 | 
				
			||||||
# Aquila
 | 
					# Aquila
 | 
				
			||||||
 | 
					
 | 
				
			||||||
In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Aquila models. For illustration purposes, we utilize the [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B) as a reference Aquila model.
 | 
					In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Aquila models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B) as a reference Aquila model.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> **Note**: If you want to download the Hugging Face *Transformers* model, please refer to [here](https://huggingface.co/docs/hub/models-downloading#using-git).
 | 
					> **Note**: If you want to download the Hugging Face *Transformers* model, please refer to [here](https://huggingface.co/docs/hub/models-downloading#using-git).
 | 
				
			||||||
>
 | 
					>
 | 
				
			||||||
| 
						 | 
					@ -13,6 +13,7 @@ To run these examples with BigDL-LLM, we have some recommended requirements for
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Aquila model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Aquila model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -20,20 +21,86 @@ conda activate llm
 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -41,6 +41,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 trust_remote_code=True)
 | 
					                                                 trust_remote_code=True)
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -1,6 +1,6 @@
 | 
				
			||||||
# Aquila2
 | 
					# Aquila2
 | 
				
			||||||
 | 
					
 | 
				
			||||||
In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Aquila2 models. For illustration purposes, we utilize the [BAAI/AquilaChat2-7B](https://huggingface.co/BAAI/AquilaChat2-7B) as a reference Aquila2 model.
 | 
					In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Aquila2 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [BAAI/AquilaChat2-7B](https://huggingface.co/BAAI/AquilaChat2-7B) as a reference Aquila2 model.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> **Note**: If you want to download the Hugging Face *Transformers* model, please refer to [here](https://huggingface.co/docs/hub/models-downloading#using-git).
 | 
					> **Note**: If you want to download the Hugging Face *Transformers* model, please refer to [here](https://huggingface.co/docs/hub/models-downloading#using-git).
 | 
				
			||||||
>
 | 
					>
 | 
				
			||||||
| 
						 | 
					@ -13,6 +13,7 @@ To run these examples with BigDL-LLM, we have some recommended requirements for
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Aquila2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Aquila2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -20,20 +21,87 @@ conda activate llm
 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -41,6 +41,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 trust_remote_code=True)
 | 
					                                                 trust_remote_code=True)
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Baichuan model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Baichuan model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -16,20 +17,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install transformers_stream_generator  # additional package required for Baichuan-13B-Chat to conduct generation
 | 
					pip install transformers_stream_generator  # additional package required for Baichuan-13B-Chat to conduct generation
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install transformers_stream_generator  # additional package required for Baichuan-13B-Chat to conduct generation
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -39,6 +39,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 trust_remote_code=True,
 | 
					                                                 trust_remote_code=True,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Baichuan model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Baichuan model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -16,20 +17,87 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install transformers_stream_generator  # additional package required for Baichuan-7B-Chat to conduct generation
 | 
					pip install transformers_stream_generator  # additional package required for Baichuan-7B-Chat to conduct generation
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install transformers_stream_generator  # additional package required for Baichuan-7B-Chat to conduct generation
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -43,6 +43,8 @@ if __name__ == '__main__':
 | 
				
			||||||
    # to enhance decoding speed, but has `"use_cache": false` in its model config,
 | 
					    # to enhance decoding speed, but has `"use_cache": false` in its model config,
 | 
				
			||||||
    # it is important to set `use_cache=True` explicitly in the `generate` function
 | 
					    # it is important to set `use_cache=True` explicitly in the `generate` function
 | 
				
			||||||
    # to obtain optimal performance with BigDL-LLM INT4 optimizations
 | 
					    # to obtain optimal performance with BigDL-LLM INT4 optimizations
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 trust_remote_code=True,
 | 
					                                                 trust_remote_code=True,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a BlueLM model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a BlueLM model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -15,20 +16,86 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -39,6 +39,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 trust_remote_code=True,
 | 
					                                                 trust_remote_code=True,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -8,6 +8,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example 1: Predict Tokens using `generate()` API
 | 
					## Example 1: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -16,20 +17,85 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -69,6 +135,7 @@ AI stands for Artificial Intelligence. It refers to the development of computer
 | 
				
			||||||
## Example 2: Stream Chat using `stream_chat()` API
 | 
					## Example 2: Stream Chat using `stream_chat()` API
 | 
				
			||||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with BigDL-LLM INT4 optimizations.
 | 
					In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with BigDL-LLM INT4 optimizations.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -77,20 +144,85 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
**Stream Chat using `stream_chat()` API**:
 | 
					**Stream Chat using `stream_chat()` API**:
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
 | 
					python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -41,6 +41,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModel.from_pretrained(model_path,
 | 
					    model = AutoModel.from_pretrained(model_path,
 | 
				
			||||||
                                      load_in_4bit=True,
 | 
					                                      load_in_4bit=True,
 | 
				
			||||||
                                      optimize_model=True,
 | 
					                                      optimize_model=True,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -39,6 +39,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModel.from_pretrained(model_path,
 | 
					    model = AutoModel.from_pretrained(model_path,
 | 
				
			||||||
                                      load_in_4bit=True,
 | 
					                                      load_in_4bit=True,
 | 
				
			||||||
                                      trust_remote_code=True,
 | 
					                                      trust_remote_code=True,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -14,20 +15,85 @@ conda activate llm
 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -58,6 +58,8 @@ if __name__ == '__main__':
 | 
				
			||||||
    # to enhance decoding speed, but has `"use_cache": false` in its model config,
 | 
					    # to enhance decoding speed, but has `"use_cache": false` in its model config,
 | 
				
			||||||
    # it is important to set `use_cache=True` explicitly to obtain optimal
 | 
					    # it is important to set `use_cache=True` explicitly to obtain optimal
 | 
				
			||||||
    # performance with BigDL-LLM INT4 optimizations
 | 
					    # performance with BigDL-LLM INT4 optimizations
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 optimize_model=True,
 | 
					                                                 optimize_model=True,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -40,6 +40,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 optimize_model=False,
 | 
					                                                 optimize_model=False,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for an CodeLlama model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for an CodeLlama model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -16,20 +17,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers
 | 
					pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -8,6 +8,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Recognize Tokens using `generate()` API
 | 
					## Example: Recognize Tokens using `generate()` API
 | 
				
			||||||
In the example [recognize.py](./recognize.py), we show a basic use case for a Distil-Whisper model to conduct transcription using `pipeline()` API for long audio input, with BigDL-LLM INT4 optimizations.
 | 
					In the example [recognize.py](./recognize.py), we show a basic use case for a Distil-Whisper model to conduct transcription using `pipeline()` API for long audio input, with BigDL-LLM INT4 optimizations.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -19,19 +20,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install datasets soundfile librosa # required by audio processing
 | 
					pip install datasets soundfile librosa # required by audio processing
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install datasets soundfile librosa # required by audio processing
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./recognize.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --repo-id-or-data-path REPO_ID_OR_DATA_PATH --language LANGUAGE --chunk-length CHUNK_LENGTH --batch-size BATCH_SIZE
 | 
					python ./recognize.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --repo-id-or-data-path REPO_ID_OR_DATA_PATH --language LANGUAGE --chunk-length CHUNK_LENGTH --batch-size BATCH_SIZE
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -9,6 +9,7 @@ In the example [generate.py](./generate.py), we show a basic use case for a Doll
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -16,20 +17,87 @@ conda activate llm
 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -47,6 +47,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True)
 | 
					                                                 load_in_4bit=True)
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM, we have some recommended requirements for
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Dolly v2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Dolly v2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -15,20 +16,85 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -47,6 +47,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 trust_remote_code=True)
 | 
					                                                 trust_remote_code=True)
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -8,6 +8,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Falcon model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Falcon model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -17,6 +18,16 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install einops # additional package required for falcon-7b-instruct to conduct generation
 | 
					pip install einops # additional package required for falcon-7b-instruct to conduct generation
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install einops # additional package required for falcon-7b-instruct to conduct generation
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. (Optional) Download Model and Replace File
 | 
					### 2. (Optional) Download Model and Replace File
 | 
				
			||||||
If you select the Falcon model ([tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)), please note that their code (`modelling_RW.py`) does not support KV cache at the moment. To address issue, we have provided updated file ([falcon-7b-instruct/modelling_RW.py](./falcon-7b-instruct/modelling_RW.py)), which can be used to achieve the best performance using BigDL-LLM INT4 optimizations with KV cache support.
 | 
					If you select the Falcon model ([tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)), please note that their code (`modelling_RW.py`) does not support KV cache at the moment. To address issue, we have provided updated file ([falcon-7b-instruct/modelling_RW.py](./falcon-7b-instruct/modelling_RW.py)), which can be used to achieve the best performance using BigDL-LLM INT4 optimizations with KV cache support.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -39,19 +50,75 @@ For `tiiuae/falcon-7b-instruct`, you should replace the `modelling_RW.py` with [
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Configures OneAPI environment variables
 | 
					### 3. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 4. Run
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 4. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 4.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 4.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 5. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -41,6 +41,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 trust_remote_code=True,
 | 
					                                                 trust_remote_code=True,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Flan-t5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Flan-t5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -18,20 +19,85 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'Translate to German: My name is Arthur'
 | 
					python ./generate.py --prompt 'Translate to German: My name is Arthur'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -42,6 +42,8 @@ if __name__ == '__main__':
 | 
				
			||||||
    # "wo" module is not converted due to some issues of T5 model 
 | 
					    # "wo" module is not converted due to some issues of T5 model 
 | 
				
			||||||
    # (https://github.com/huggingface/transformers/issues/20287),
 | 
					    # (https://github.com/huggingface/transformers/issues/20287),
 | 
				
			||||||
    # "lm_head" module is not converted to generate outputs with better quality
 | 
					    # "lm_head" module is not converted to generate outputs with better quality
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForSeq2SeqLM.from_pretrained(model_path,
 | 
					    model = AutoModelForSeq2SeqLM.from_pretrained(model_path,
 | 
				
			||||||
                                                  load_in_4bit=True,
 | 
					                                                  load_in_4bit=True,
 | 
				
			||||||
                                                  optimize_model=False,
 | 
					                                                  optimize_model=False,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -39,6 +39,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 trust_remote_code=True,
 | 
					                                                 trust_remote_code=True,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a GPT-J model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a GPT-J model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -14,20 +15,87 @@ conda activate llm
 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a InternLM model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a InternLM model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -15,20 +16,85 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -40,6 +40,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 optimize_model=False,
 | 
					                                                 optimize_model=False,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -14,20 +15,85 @@ conda activate llm
 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -54,6 +54,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 optimize_model=True,
 | 
					                                                 optimize_model=True,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -9,6 +9,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -23,20 +24,88 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install transformers==4.34.0
 | 
					pip install transformers==4.34.0
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Refer to https://huggingface.co/mistralai/Mistral-7B-v0.1#troubleshooting, please make sure you are using a stable version of Transformers, 4.34.0 or newer.
 | 
				
			||||||
 | 
					pip install transformers==4.34.0
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'What is AI?'
 | 
					python ./generate.py --prompt 'What is AI?'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -40,6 +40,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 optimize_model=True,
 | 
					                                                 optimize_model=True,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -9,6 +9,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Mixtral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Mixtral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -23,20 +24,88 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install transformers==4.36.0
 | 
					pip install transformers==4.36.0
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Please make sure you are using a stable version of Transformers, 4.36.0 or newer.
 | 
				
			||||||
 | 
					pip install transformers==4.36.0
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'What is AI?'
 | 
					python ./generate.py --prompt 'What is AI?'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -40,6 +40,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 optimize_model=True,
 | 
					                                                 optimize_model=True,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for an MPT model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for an MPT model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -15,20 +16,86 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
pip install einops  # additional package required for mpt-7b-chat and mpt-30b-chat to conduct generation
 | 
					pip install einops  # additional package required for mpt-7b-chat and mpt-30b-chat to conduct generation
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -41,6 +41,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 optimize_model=False,
 | 
					                                                 optimize_model=False,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a phi-1_5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a phi-1_5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -16,20 +17,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install einops # additional package required for phi-1_5 to conduct generation
 | 
					pip install einops # additional package required for phi-1_5 to conduct generation
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install einops # additional package required for phi-1_5 to conduct generation
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --prompt 'What is AI?'
 | 
					python ./generate.py --prompt 'What is AI?'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -42,6 +42,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 trust_remote_code=True)
 | 
					                                                 trust_remote_code=True)
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Multimodal chat using `chat()` API
 | 
					## Example: Multimodal chat using `chat()` API
 | 
				
			||||||
In the example [chat.py](./chat.py), we show a basic use case for a Qwen-VL model to start a multimodal chat using `chat()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [chat.py](./chat.py), we show a basic use case for a Qwen-VL model to start a multimodal chat using `chat()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -18,19 +19,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
 | 
					pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./chat.py
 | 
					python ./chat.py
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -38,6 +38,8 @@ if __name__ == '__main__':
 | 
				
			||||||
        
 | 
					        
 | 
				
			||||||
    # Load model
 | 
					    # Load model
 | 
				
			||||||
    # For successful BigDL-LLM optimization on Qwen-VL-Chat, skip the 'c_fc' and 'out_proj' modules during optimization
 | 
					    # For successful BigDL-LLM optimization on Qwen-VL-Chat, skip the 'c_fc' and 'out_proj' modules during optimization
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path, 
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path, 
 | 
				
			||||||
                                                 load_in_4bit=True, 
 | 
					                                                 load_in_4bit=True, 
 | 
				
			||||||
                                                 trust_remote_code=True, 
 | 
					                                                 trust_remote_code=True, 
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Qwen model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Qwen model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -16,20 +17,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install tiktoken einops transformers_stream_generator  # additional package required for Qwen-7B-Chat to conduct generation
 | 
					pip install tiktoken einops transformers_stream_generator  # additional package required for Qwen-7B-Chat to conduct generation
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install tiktoken einops transformers_stream_generator  # additional package required for Qwen-7B-Chat to conduct generation
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -47,6 +47,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 optimize_model=True,
 | 
					                                                 optimize_model=True,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for an Replit model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for an Replit model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -17,20 +18,86 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --prompt 'def print_hello_world():'
 | 
					python ./generate.py --prompt 'def print_hello_world():'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -39,6 +39,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 optimize_model=True,
 | 
					                                                 optimize_model=True,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -1,5 +1,5 @@
 | 
				
			||||||
# SOLAR
 | 
					# SOLAR
 | 
				
			||||||
In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on SOLAR models on [Intel GPUs](../README.md). For illustration purposes, we utilize the [upstage/SOLAR-10.7B-Instruct-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0) as a reference SOLAR model.
 | 
					In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on SOLAR models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [upstage/SOLAR-10.7B-Instruct-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0) as a reference SOLAR model.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## 0. Requirements
 | 
					## 0. Requirements
 | 
				
			||||||
To run these examples with BigDL-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 | 
					To run these examples with BigDL-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a SOLAR model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a SOLAR model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -16,20 +17,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install transformers==4.35.2 # required by SOLAR
 | 
					pip install transformers==4.35.2 # required by SOLAR
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install transformers==4.35.2 # required by SOLAR
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -42,6 +42,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 trust_remote_code=True,
 | 
					                                                 trust_remote_code=True,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -39,6 +39,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 optimize_model=False,
 | 
					                                                 optimize_model=False,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for an StarCoder model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for an StarCoder model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -15,20 +16,85 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -9,6 +9,7 @@ In the example [generate.py](./generate.py), we show a basic use case for a Vicu
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -16,20 +17,87 @@ conda activate llm
 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -40,6 +40,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True)
 | 
					                                                 load_in_4bit=True)
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -9,6 +9,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Whisper model to conduct transcription using `generate()` API, then use the recoginzed text as the input for Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Whisper model to conduct transcription using `generate()` API, then use the recoginzed text as the input for Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -22,20 +23,89 @@ pip install SpeechRecognition sentencepiece colorama
 | 
				
			||||||
pip install PyAudio inquirer sounddevice
 | 
					pip install PyAudio inquirer sounddevice
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install librosa soundfile datasets
 | 
				
			||||||
 | 
					pip install accelerate
 | 
				
			||||||
 | 
					pip install SpeechRecognition sentencepiece colorama
 | 
				
			||||||
 | 
					pip install PyAudio inquirer
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --llama2-repo-id-or-model-path REPO_ID_OR_MODEL_PATH --whisper-repo-id-or-model-path REPO_ID_OR_MODEL_PATH --n-predict N_PREDICT
 | 
					python ./generate.py --llama2-repo-id-or-model-path REPO_ID_OR_MODEL_PATH --whisper-repo-id-or-model-path REPO_ID_OR_MODEL_PATH --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -20,6 +20,8 @@ import time
 | 
				
			||||||
import argparse
 | 
					import argparse
 | 
				
			||||||
import numpy as np
 | 
					import numpy as np
 | 
				
			||||||
import inquirer
 | 
					import inquirer
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# For Windows users, please remove `import sounddevice`
 | 
				
			||||||
import sounddevice
 | 
					import sounddevice
 | 
				
			||||||
 | 
					
 | 
				
			||||||
from bigdl.llm.transformers import AutoModelForCausalLM
 | 
					from bigdl.llm.transformers import AutoModelForCausalLM
 | 
				
			||||||
| 
						 | 
					@ -92,6 +94,8 @@ if __name__ == '__main__':
 | 
				
			||||||
    whisper.config.forced_decoder_ids = None
 | 
					    whisper.config.forced_decoder_ids = None
 | 
				
			||||||
    whisper = whisper.to('xpu')
 | 
					    whisper = whisper.to('xpu')
 | 
				
			||||||
    
 | 
					    
 | 
				
			||||||
 | 
					    # When running Llama models on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    llama_model = AutoModelForCausalLM.from_pretrained(llama_model_path, load_in_4bit=True, trust_remote_code=True, optimize_model=False, use_cache=True)
 | 
					    llama_model = AutoModelForCausalLM.from_pretrained(llama_model_path, load_in_4bit=True, trust_remote_code=True, optimize_model=False, use_cache=True)
 | 
				
			||||||
    llama_model = llama_model.to('xpu')
 | 
					    llama_model = llama_model.to('xpu')
 | 
				
			||||||
    tokenizer = LlamaTokenizer.from_pretrained(llama_model_path)
 | 
					    tokenizer = LlamaTokenizer.from_pretrained(llama_model_path)
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -8,6 +8,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Recognize Tokens using `generate()` API
 | 
					## Example: Recognize Tokens using `generate()` API
 | 
				
			||||||
In the example [recognize.py](./recognize.py), we show a basic use case for a Whisper model to conduct transcription using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [recognize.py](./recognize.py), we show a basic use case for a Whisper model to conduct transcription using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.9
 | 
					conda create -n llm python=3.9
 | 
				
			||||||
| 
						 | 
					@ -17,12 +18,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install datasets soundfile librosa # required by audio processing
 | 
					pip install datasets soundfile librosa # required by audio processing
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install datasets soundfile librosa # required by audio processing
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./recognize.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --repo-id-or-data-path REPO_ID_OR_DATA_PATH --language LANGUAGE
 | 
					python ./recognize.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --repo-id-or-data-path REPO_ID_OR_DATA_PATH --language LANGUAGE
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Yi model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Yi model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -19,20 +20,87 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install einops # additional package required for Yi-6B to conduct generation
 | 
					pip install einops # additional package required for Yi-6B to conduct generation
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install einops # additional package required for Yi-6B to conduct generation
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py
 | 
					python ./generate.py
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -45,6 +45,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					    # Load model in 4 bit,
 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					    # which convert the relevant layers in the model into INT4 format
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 load_in_4bit=True,
 | 
					                                                 load_in_4bit=True,
 | 
				
			||||||
                                                 optimize_model=True,
 | 
					                                                 optimize_model=True,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Aquila2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Aquila2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -18,20 +19,85 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'AI是什么?'
 | 
					python ./generate.py --prompt 'AI是什么?'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -45,6 +45,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                                 low_cpu_mem_usage=True)
 | 
					                                                 low_cpu_mem_usage=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Baichuan model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Baichuan model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -19,20 +20,85 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install transformers_stream_generator  # additional package required for Baichuan-13B-Chat to conduct generation
 | 
					pip install transformers_stream_generator  # additional package required for Baichuan-13B-Chat to conduct generation
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install transformers_stream_generator  # additional package required for Baichuan-13B-Chat to conduct generation
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'AI是什么?'
 | 
					python ./generate.py --prompt 'AI是什么?'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -43,7 +109,7 @@ In the example, several arguments can be passed to satisfy your requirements:
 | 
				
			||||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
 | 
					- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
 | 
				
			||||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
					- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 2.3 Sample Output
 | 
					#### 4.1 Sample Output
 | 
				
			||||||
#### [baichuan-inc/Baichuan-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan-13B-Chat)
 | 
					#### [baichuan-inc/Baichuan-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan-13B-Chat)
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
Inference time: xxxx s
 | 
					Inference time: xxxx s
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -44,6 +44,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                                 low_cpu_mem_usage=True)
 | 
					                                                 low_cpu_mem_usage=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -19,20 +20,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install transformers_stream_generator  # additional package required for Baichuan2-7B-Chat to conduct generation
 | 
					pip install transformers_stream_generator  # additional package required for Baichuan2-7B-Chat to conduct generation
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install transformers_stream_generator  # additional package required for Baichuan2-7B-Chat to conduct generation
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'AI是什么?'
 | 
					python ./generate.py --prompt 'AI是什么?'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -43,7 +110,7 @@ In the example, several arguments can be passed to satisfy your requirements:
 | 
				
			||||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
 | 
					- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
 | 
				
			||||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
					- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 2.3 Sample Output
 | 
					#### 4.1 Sample Output
 | 
				
			||||||
#### [baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat)
 | 
					#### [baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat)
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
Inference time: xxxx s
 | 
					Inference time: xxxx s
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -44,6 +44,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                                 low_cpu_mem_usage=True)
 | 
					                                                 low_cpu_mem_usage=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a BlueLM model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a BlueLM model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -18,20 +19,85 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'AI是什么?'
 | 
					python ./generate.py --prompt 'AI是什么?'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements:
 | 
				
			||||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
 | 
					- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
 | 
				
			||||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
					- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 2.3 Sample Output
 | 
					#### 4.1 Sample Output
 | 
				
			||||||
#### [vivo-ai/BlueLM-7B-Chat](https://huggingface.co/vivo-ai/BlueLM-7B-Chat)
 | 
					#### [vivo-ai/BlueLM-7B-Chat](https://huggingface.co/vivo-ai/BlueLM-7B-Chat)
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
Inference time: xxxx s
 | 
					Inference time: xxxx s
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -44,6 +44,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                                 low_cpu_mem_usage=True)
 | 
					                                                 low_cpu_mem_usage=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example 1: Predict Tokens using `generate()` API
 | 
					## Example 1: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -18,20 +19,85 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'AI是什么?'
 | 
					python ./generate.py --prompt 'AI是什么?'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements:
 | 
				
			||||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
 | 
					- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
 | 
				
			||||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
					- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 2.3 Sample Output
 | 
					#### 4.1 Sample Output
 | 
				
			||||||
#### [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
 | 
					#### [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
Inference time: xxxx s
 | 
					Inference time: xxxx s
 | 
				
			||||||
| 
						 | 
					@ -65,6 +131,7 @@ Inference time: xxxx s
 | 
				
			||||||
## Example 2: Stream Chat using `stream_chat()` API
 | 
					## Example 2: Stream Chat using `stream_chat()` API
 | 
				
			||||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM2 model to stream chat, with BigDL-LLM INT4 optimizations.
 | 
					In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM2 model to stream chat, with BigDL-LLM INT4 optimizations.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -76,20 +143,84 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
**Stream Chat using `stream_chat()` API**:
 | 
					**Stream Chat using `stream_chat()` API**:
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./streamchat.py
 | 
					python ./streamchat.py
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -45,6 +45,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                      low_cpu_mem_usage=True)
 | 
					                                      low_cpu_mem_usage=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -44,6 +44,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                      low_cpu_mem_usage=True)
 | 
					                                      low_cpu_mem_usage=True)
 | 
				
			||||||
    
 | 
					    
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model.to('xpu')
 | 
					    model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example 1: Predict Tokens using `generate()` API
 | 
					## Example 1: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -18,20 +19,85 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'AI是什么?'
 | 
					python ./generate.py --prompt 'AI是什么?'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements:
 | 
				
			||||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
 | 
					- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
 | 
				
			||||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
					- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 2.3 Sample Output
 | 
					#### 4.1 Sample Output
 | 
				
			||||||
#### [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b)
 | 
					#### [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b)
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
Inference time: xxxx s
 | 
					Inference time: xxxx s
 | 
				
			||||||
| 
						 | 
					@ -64,6 +130,7 @@ AI stands for Artificial Intelligence. It refers to the development of computer
 | 
				
			||||||
## Example 2: Stream Chat using `stream_chat()` API
 | 
					## Example 2: Stream Chat using `stream_chat()` API
 | 
				
			||||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with BigDL-LLM INT4 optimizations.
 | 
					In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with BigDL-LLM INT4 optimizations.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -75,20 +142,83 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
**Stream Chat using `stream_chat()` API**:
 | 
					**Stream Chat using `stream_chat()` API**:
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./streamchat.py
 | 
					python ./streamchat.py
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -45,6 +45,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                      low_cpu_mem_usage=True)
 | 
					                                      low_cpu_mem_usage=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -44,6 +44,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                      low_cpu_mem_usage=True)
 | 
					                                      low_cpu_mem_usage=True)
 | 
				
			||||||
    
 | 
					    
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model.to('xpu')
 | 
					    model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a CodeLlama model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a CodeLlama model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -19,20 +20,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers
 | 
					pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'def print_hello_world():'
 | 
					python ./generate.py --prompt 'def print_hello_world():'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -43,7 +110,7 @@ In the example, several arguments can be passed to satisfy your requirements:
 | 
				
			||||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `def print_hello_world():'`.
 | 
					- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `def print_hello_world():'`.
 | 
				
			||||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
					- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 2.3 Sample Output
 | 
					#### 4.1 Sample Output
 | 
				
			||||||
#### [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf)
 | 
					#### [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf)
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
Inference time: xxxx s
 | 
					Inference time: xxxx s
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -46,6 +46,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                                 low_cpu_mem_usage=True)
 | 
					                                                 low_cpu_mem_usage=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -8,6 +8,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Recognize Tokens using `generate()` API
 | 
					## Example: Recognize Tokens using `generate()` API
 | 
				
			||||||
In the example [recognize.py](./recognize.py), we show a basic use case for a Distil-Whisper model to conduct transcription using `pipeline()` API for long audio input, with BigDL-LLM INT4 optimizations.
 | 
					In the example [recognize.py](./recognize.py), we show a basic use case for a Distil-Whisper model to conduct transcription using `pipeline()` API for long audio input, with BigDL-LLM INT4 optimizations.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -19,19 +20,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install datasets soundfile librosa # required by audio processing
 | 
					pip install datasets soundfile librosa # required by audio processing
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install datasets soundfile librosa # required by audio processing
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./recognize.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --repo-id-or-data-path REPO_ID_OR_DATA_PATH --language LANGUAGE --chunk-length CHUNK_LENGTH --batch-size BATCH_SIZE
 | 
					python ./recognize.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --repo-id-or-data-path REPO_ID_OR_DATA_PATH --language LANGUAGE --chunk-length CHUNK_LENGTH --batch-size BATCH_SIZE
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Dolly v1 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Dolly v1 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -18,20 +19,85 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'What is AI?'
 | 
					python ./generate.py --prompt 'What is AI?'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements:
 | 
				
			||||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
 | 
					- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
 | 
				
			||||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
					- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 2.3 Sample Output
 | 
					#### 4.1 Sample Output
 | 
				
			||||||
#### [databricks/dolly-v1-6b](https://huggingface.co/databricks/dolly-v1-6b)
 | 
					#### [databricks/dolly-v1-6b](https://huggingface.co/databricks/dolly-v1-6b)
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
Inference time: xxxx s
 | 
					Inference time: xxxx s
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -51,6 +51,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                                 low_cpu_mem_usage=True)
 | 
					                                                 low_cpu_mem_usage=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Dolly v2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Dolly v2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -18,20 +19,85 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'What is AI?'
 | 
					python ./generate.py --prompt 'What is AI?'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements:
 | 
				
			||||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
 | 
					- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
 | 
				
			||||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
					- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 2.3 Sample Output
 | 
					#### 4.1 Sample Output
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b)
 | 
					#### [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b)
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -51,6 +51,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                                 low_cpu_mem_usage=True)
 | 
					                                                 low_cpu_mem_usage=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Flan-t5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Flan-t5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -18,20 +19,85 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'Translate to German: My name is Arthur'
 | 
					python ./generate.py --prompt 'Translate to German: My name is Arthur'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -47,6 +47,8 @@ if __name__ == '__main__':
 | 
				
			||||||
    # "wo" module is not converted due to some issues of T5 model
 | 
					    # "wo" module is not converted due to some issues of T5 model
 | 
				
			||||||
    # (https://github.com/huggingface/transformers/issues/20287),
 | 
					    # (https://github.com/huggingface/transformers/issues/20287),
 | 
				
			||||||
    # "lm_head" module is not converted to generate outputs with better quality
 | 
					    # "lm_head" module is not converted to generate outputs with better quality
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model, modules_to_not_convert=["wo", "lm_head"])
 | 
					    model = optimize_model(model, modules_to_not_convert=["wo", "lm_head"])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example 1 - Basic Version: Predict Tokens using `generate()` API
 | 
					## Example 1 - Basic Version: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -18,20 +19,84 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'What is AI?'
 | 
					python ./generate.py --prompt 'What is AI?'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -49,6 +49,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                                 low_cpu_mem_usage=True)
 | 
					                                                 low_cpu_mem_usage=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Multi-turn chat centered around an image using `generate()` API
 | 
					## Example: Multi-turn chat centered around an image using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a LLaVA model to start a multi-turn chat centered around an image using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a LLaVA model to start a multi-turn chat centered around an image using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -23,20 +24,89 @@ cp generate.py ./LLaVA/ # copy our example to the LLaVA folder
 | 
				
			||||||
cd LLaVA # change the working directory to the LLaVA folder
 | 
					cd LLaVA # change the working directory to the LLaVA folder
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					git clone -b v1.1.1 --depth=1 https://github.com/haotian-liu/LLaVA.git # clone the llava libary
 | 
				
			||||||
 | 
					pip install einops # install dependencies required by llava
 | 
				
			||||||
 | 
					cp generate.py ./LLaVA/ # copy our example to the LLaVA folder
 | 
				
			||||||
 | 
					cd LLaVA # change the working directory to the LLaVA folder
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --image-path-or-url 'https://llava-vl.github.io/static/images/monalisa.jpg'
 | 
					python ./generate.py --image-path-or-url 'https://llava-vl.github.io/static/images/monalisa.jpg'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -65,9 +135,9 @@ The sample input image is:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<a href="https://llava-vl.github.io/static/images/monalisa.jpg"><img width=400px src="https://llava-vl.github.io/static/images/monalisa.jpg" ></a>
 | 
					<a href="https://llava-vl.github.io/static/images/monalisa.jpg"><img width=400px src="https://llava-vl.github.io/static/images/monalisa.jpg" ></a>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 4 Trouble shooting
 | 
					### 5 Trouble shooting
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 4.1 SSLError
 | 
					#### 5.1 SSLError
 | 
				
			||||||
If you encounter the following output, it means your machine has some trouble accessing huggingface.co.
 | 
					If you encounter the following output, it means your machine has some trouble accessing huggingface.co.
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /openai/clip-vit-large-patch14-336/resolve/main/config.json (Caused by SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1129)')))"),
 | 
					requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /openai/clip-vit-large-patch14-336/resolve/main/config.json (Caused by SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1129)')))"),
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -292,6 +292,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                                                 model_name=model_name)
 | 
					                                                                 model_name=model_name)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model).to('xpu')
 | 
					    model = optimize_model(model).to('xpu')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Generate image tensor
 | 
					    # Generate image tensor
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -9,6 +9,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -23,20 +24,85 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install transformers==4.34.0
 | 
					pip install transformers==4.34.0
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install transformers==4.34.0
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'What is AI?'
 | 
					python ./generate.py --prompt 'What is AI?'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -47,7 +113,7 @@ In the example, several arguments can be passed to satisfy your requirements:
 | 
				
			||||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
 | 
					- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
 | 
				
			||||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
					- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 2.3 Sample Output
 | 
					#### 4.1 Sample Output
 | 
				
			||||||
#### [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)
 | 
					#### [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
Inference time: xxxx s
 | 
					Inference time: xxxx s
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -45,6 +45,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                                 low_cpu_mem_usage=True)
 | 
					                                                 low_cpu_mem_usage=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -9,6 +9,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Mixtral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Mixtral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -23,20 +24,87 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install transformers==4.36.0
 | 
					pip install transformers==4.36.0
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Please make sure you are using a stable version of Transformers, 4.36.0 or newer.
 | 
				
			||||||
 | 
					pip install transformers==4.36.0
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'What is AI?'
 | 
					python ./generate.py --prompt 'What is AI?'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -45,6 +45,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                                 low_cpu_mem_usage=True)
 | 
					                                                 low_cpu_mem_usage=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM, we have some recommended requirements for
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a phi-1_5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a phi-1_5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -18,20 +19,85 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install einops # additional package required for phi-1_5 to conduct generation
 | 
					pip install einops # additional package required for phi-1_5 to conduct generation
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install einops # additional package required for phi-1_5 to conduct generation
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --prompt 'What is AI?'
 | 
					python ./generate.py --prompt 'What is AI?'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -43,6 +43,8 @@ if __name__ == '__main__':
 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
    
 | 
					    
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Multimodal chat using `chat()` API
 | 
					## Example: Multimodal chat using `chat()` API
 | 
				
			||||||
In the example [chat.py](./chat.py), we show a basic use case for a Qwen-VL model to start a multimodal chat using `chat()` API, with BigDL-LLM 'optimize_model' API on Intel GPUs.
 | 
					In the example [chat.py](./chat.py), we show a basic use case for a Qwen-VL model to start a multimodal chat using `chat()` API, with BigDL-LLM 'optimize_model' API on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -18,19 +19,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
 | 
					pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./chat.py
 | 
					python ./chat.py
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -41,6 +41,8 @@ if __name__ == '__main__':
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
    # For successful BigDL-LLM optimization on Qwen-VL-Chat, skip the 'c_fc' and 'out_proj' modules during optimization
 | 
					    # For successful BigDL-LLM optimization on Qwen-VL-Chat, skip the 'c_fc' and 'out_proj' modules during optimization
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model, 
 | 
					    model = optimize_model(model, 
 | 
				
			||||||
                           low_bit='sym_int4', 
 | 
					                           low_bit='sym_int4', 
 | 
				
			||||||
                           modules_to_not_convert=['c_fc', 'out_proj'])
 | 
					                           modules_to_not_convert=['c_fc', 'out_proj'])
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Replit model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Replit model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -18,11 +19,33 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					### 3. Run
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					For optimal performance on Arc, it is recommended to set several environment variables.
 | 
				
			||||||
| 
						 | 
					@ -32,6 +55,52 @@ export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'def print_hello_world():'
 | 
					python ./generate.py --prompt 'def print_hello_world():'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -42,7 +111,7 @@ In the example, several arguments can be passed to satisfy your requirements:
 | 
				
			||||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `def print_hello_world():'`.
 | 
					- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `def print_hello_world():'`.
 | 
				
			||||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
					- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 2.3 Sample Output
 | 
					#### 4.1 Sample Output
 | 
				
			||||||
#### [replit/replit-code-v1-3b](https://huggingface.co/replit/replit-code-v1-3b)
 | 
					#### [replit/replit-code-v1-3b](https://huggingface.co/replit/replit-code-v1-3b)
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
Inference time: xxxx s
 | 
					Inference time: xxxx s
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -44,6 +44,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                                 low_cpu_mem_usage=True)
 | 
					                                                 low_cpu_mem_usage=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a SOLAR model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a SOLAR model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -19,20 +20,85 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install transformers==4.35.2 # required by SOLAR
 | 
					pip install transformers==4.35.2 # required by SOLAR
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install transformers==4.35.2 # required by SOLAR
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -43,7 +109,7 @@ In the example, several arguments can be passed to satisfy your requirements:
 | 
				
			||||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
 | 
					- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
 | 
				
			||||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
					- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 2.3 Sample Output
 | 
					#### 4.1 Sample Output
 | 
				
			||||||
#### [upstage/SOLAR-10.7B-Instruct-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0) 
 | 
					#### [upstage/SOLAR-10.7B-Instruct-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0) 
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
Inference time: XXXX s
 | 
					Inference time: XXXX s
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -47,6 +47,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                                 low_cpu_mem_usage=True)
 | 
					                                                 low_cpu_mem_usage=True)
 | 
				
			||||||
    
 | 
					    
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a StarCoder model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a StarCoder model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -18,20 +19,85 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'def print_hello_world():'
 | 
					python ./generate.py --prompt 'def print_hello_world():'
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements:
 | 
				
			||||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `def print_hello_world():'`.
 | 
					- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `def print_hello_world():'`.
 | 
				
			||||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
					- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 2.3 Sample Output
 | 
					#### 4.1 Sample Output
 | 
				
			||||||
#### [bigcode/starcoder](https://huggingface.co/bigcode/starcoder)
 | 
					#### [bigcode/starcoder](https://huggingface.co/bigcode/starcoder)
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
Inference time: xxxx s
 | 
					Inference time: xxxx s
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -44,6 +44,8 @@ if __name__ == '__main__':
 | 
				
			||||||
                                                 low_cpu_mem_usage=True)
 | 
					                                                 low_cpu_mem_usage=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # With only one line to enable BigDL-LLM optimization on model
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
 | 
				
			||||||
## Example: Predict Tokens using `generate()` API
 | 
					## Example: Predict Tokens using `generate()` API
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a Yi model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a Yi model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
### 1. Install
 | 
					### 1. Install
 | 
				
			||||||
 | 
					#### 1.1 Installation on Linux
 | 
				
			||||||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
					We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
After installing conda, create a Python environment for BigDL-LLM:
 | 
					After installing conda, create a Python environment for BigDL-LLM:
 | 
				
			||||||
| 
						 | 
					@ -19,20 +20,85 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
 | 
				
			||||||
pip install einops # additional package required for Yi-6B to conduct generation
 | 
					pip install einops # additional package required for Yi-6B to conduct generation
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					conda create -n llm python=3.9 libuv
 | 
				
			||||||
 | 
					conda activate llm
 | 
				
			||||||
 | 
					# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
				
			||||||
 | 
					pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 | 
				
			||||||
 | 
					pip install einops # additional package required for Yi-6B to conduct generation
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables
 | 
					### 2. Configures OneAPI environment variables
 | 
				
			||||||
 | 
					#### 2.1 Configurations for Linux
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Run
 | 
					#### 2.2 Configurations for Windows
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
For optimal performance on Arc, it is recommended to set several environment variables.
 | 
					call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 | 
				
			||||||
 | 
					### 3. Runtime Configurations
 | 
				
			||||||
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
 | 
					#### 3.1 Configurations for Linux
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
export USE_XETLA=OFF
 | 
					export USE_XETLA=OFF
 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Data Center GPU Max Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
				
			||||||
 | 
					export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					export ENABLE_SDP_FUSION=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### 3.2 Configurations for Windows
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					set BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For Intel Arc™ A300-Series or Pro A60</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```cmd
 | 
				
			||||||
 | 
					set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<summary>For other Intel dGPU Series</summary>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is no need to set further environment variables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					### 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py
 | 
					python ./generate.py
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -37,11 +37,13 @@ if __name__ == '__main__':
 | 
				
			||||||
    args = parser.parse_args()
 | 
					    args = parser.parse_args()
 | 
				
			||||||
    model_path = args.repo_id_or_model_path
 | 
					    model_path = args.repo_id_or_model_path
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # Load model in 4 bit,
 | 
					 | 
				
			||||||
    # which convert the relevant layers in the model into INT4 format
 | 
					 | 
				
			||||||
    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
					    model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
				
			||||||
                                                 trust_remote_code=True,
 | 
					                                                 trust_remote_code=True,
 | 
				
			||||||
                                                 use_cache=True)
 | 
					                                                 use_cache=True)
 | 
				
			||||||
 | 
					    
 | 
				
			||||||
 | 
					    # With only one line to enable BigDL-LLM optimization on model
 | 
				
			||||||
 | 
					    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
 | 
				
			||||||
 | 
					    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
 | 
				
			||||||
    model = optimize_model(model)
 | 
					    model = optimize_model(model)
 | 
				
			||||||
    model = model.to('xpu')
 | 
					    model = model.to('xpu')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
		Loading…
	
		Reference in a new issue