LLM: GPU Example Updates for Windows (#9992)

* modify aquila

* modify aquila2

* add baichuan

* modify baichuan2

* modify blue-lm

* modify chatglm3

* modify chinese-llama2

* modiy codellama

* modify distil-whisper

* modify dolly-v1

* modify dolly-v2

* modify falcon

* modify flan-t5

* modify gpt-j

* modify internlm

* modify llama2

* modify mistral

* modify mixtral

* modify mpt

* modify phi-1_5

* modify qwen

* modify qwen-vl

* modify replit

* modify solar

* modify starcoder

* modify vicuna

* modify voiceassistant

* modify whisper

* modify yi

* modify aquila2

* modify baichuan

* modify baichuan2

* modify blue-lm

* modify chatglm2

* modify chatglm3

* modify codellama

* modify distil-whisper

* modify dolly-v1

* modify dolly-v2

* modify flan-t5

* modify llama2

* modify llava

* modify mistral

* modify mixtral

* modify phi-1_5

* modify qwen-vl

* modify replit

* modify solar

* modify starcoder

* modify yi

* correct the comments

* remove cpu_embedding in code for whisper and distil-whisper

* remove comment

* remove cpu_embedding for voice assistant

* revert modify voice assistant

* modify for voice assistant

* add comment for voice assistant

* fix comments

* fix comments
This commit is contained in:
Jin Qiao 2024-01-29 11:25:11 +08:00 committed by GitHub
parent c6d4f91777
commit 440cfe18ed
100 changed files with 3786 additions and 130 deletions

View file

@ -1,6 +1,6 @@
# Aquila # Aquila
In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Aquila models. For illustration purposes, we utilize the [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B) as a reference Aquila model. In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Aquila models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B) as a reference Aquila model.
> **Note**: If you want to download the Hugging Face *Transformers* model, please refer to [here](https://huggingface.co/docs/hub/models-downloading#using-git). > **Note**: If you want to download the Hugging Face *Transformers* model, please refer to [here](https://huggingface.co/docs/hub/models-downloading#using-git).
> >
@ -13,6 +13,7 @@ To run these examples with BigDL-LLM, we have some recommended requirements for
In the example [generate.py](./generate.py), we show a basic use case for a Aquila model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations. In the example [generate.py](./generate.py), we show a basic use case for a Aquila model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -20,20 +21,86 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
For optimal performance on Arc, it is recommended to set several environment variables. ### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -41,6 +41,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
trust_remote_code=True) trust_remote_code=True)

View file

@ -1,6 +1,6 @@
# Aquila2 # Aquila2
In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Aquila2 models. For illustration purposes, we utilize the [BAAI/AquilaChat2-7B](https://huggingface.co/BAAI/AquilaChat2-7B) as a reference Aquila2 model. In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Aquila2 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [BAAI/AquilaChat2-7B](https://huggingface.co/BAAI/AquilaChat2-7B) as a reference Aquila2 model.
> **Note**: If you want to download the Hugging Face *Transformers* model, please refer to [here](https://huggingface.co/docs/hub/models-downloading#using-git). > **Note**: If you want to download the Hugging Face *Transformers* model, please refer to [here](https://huggingface.co/docs/hub/models-downloading#using-git).
> >
@ -13,6 +13,7 @@ To run these examples with BigDL-LLM, we have some recommended requirements for
In the example [generate.py](./generate.py), we show a basic use case for a Aquila2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations. In the example [generate.py](./generate.py), we show a basic use case for a Aquila2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -20,20 +21,87 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
For optimal performance on Arc, it is recommended to set several environment variables. ### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -41,6 +41,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
trust_remote_code=True) trust_remote_code=True)

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Baichuan model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Baichuan model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -16,20 +17,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install transformers_stream_generator # additional package required for Baichuan-13B-Chat to conduct generation pip install transformers_stream_generator # additional package required for Baichuan-13B-Chat to conduct generation
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install transformers_stream_generator # additional package required for Baichuan-13B-Chat to conduct generation
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
For optimal performance on Arc, it is recommended to set several environment variables. ### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -39,6 +39,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
trust_remote_code=True, trust_remote_code=True,

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Baichuan model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Baichuan model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -16,20 +17,87 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install transformers_stream_generator # additional package required for Baichuan-7B-Chat to conduct generation pip install transformers_stream_generator # additional package required for Baichuan-7B-Chat to conduct generation
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install transformers_stream_generator # additional package required for Baichuan-7B-Chat to conduct generation
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
For optimal performance on Arc, it is recommended to set several environment variables. ### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -43,6 +43,8 @@ if __name__ == '__main__':
# to enhance decoding speed, but has `"use_cache": false` in its model config, # to enhance decoding speed, but has `"use_cache": false` in its model config,
# it is important to set `use_cache=True` explicitly in the `generate` function # it is important to set `use_cache=True` explicitly in the `generate` function
# to obtain optimal performance with BigDL-LLM INT4 optimizations # to obtain optimal performance with BigDL-LLM INT4 optimizations
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
trust_remote_code=True, trust_remote_code=True,

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a BlueLM model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a BlueLM model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -15,20 +16,86 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
For optimal performance on Arc, it is recommended to set several environment variables. ### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -39,6 +39,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
trust_remote_code=True, trust_remote_code=True,

View file

@ -8,6 +8,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example 1: Predict Tokens using `generate()` API ## Example 1: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -16,20 +17,85 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
For optimal performance on Arc, it is recommended to set several environment variables. ### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```
@ -69,6 +135,7 @@ AI stands for Artificial Intelligence. It refers to the development of computer
## Example 2: Stream Chat using `stream_chat()` API ## Example 2: Stream Chat using `stream_chat()` API
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with BigDL-LLM INT4 optimizations. In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with BigDL-LLM INT4 optimizations.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -77,20 +144,85 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
#### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Run ### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
**Stream Chat using `stream_chat()` API**: **Stream Chat using `stream_chat()` API**:
``` ```
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION

View file

@ -41,6 +41,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModel.from_pretrained(model_path, model = AutoModel.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
optimize_model=True, optimize_model=True,

View file

@ -39,6 +39,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModel.from_pretrained(model_path, model = AutoModel.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
trust_remote_code=True, trust_remote_code=True,

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -14,20 +15,85 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
#### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
### 3. Run <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
For optimal performance on Arc, it is recommended to set several environment variables.
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -58,6 +58,8 @@ if __name__ == '__main__':
# to enhance decoding speed, but has `"use_cache": false` in its model config, # to enhance decoding speed, but has `"use_cache": false` in its model config,
# it is important to set `use_cache=True` explicitly to obtain optimal # it is important to set `use_cache=True` explicitly to obtain optimal
# performance with BigDL-LLM INT4 optimizations # performance with BigDL-LLM INT4 optimizations
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
optimize_model=True, optimize_model=True,

View file

@ -40,6 +40,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
optimize_model=False, optimize_model=False,

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for an CodeLlama model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for an CodeLlama model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -16,20 +17,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -8,6 +8,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Recognize Tokens using `generate()` API ## Example: Recognize Tokens using `generate()` API
In the example [recognize.py](./recognize.py), we show a basic use case for a Distil-Whisper model to conduct transcription using `pipeline()` API for long audio input, with BigDL-LLM INT4 optimizations. In the example [recognize.py](./recognize.py), we show a basic use case for a Distil-Whisper model to conduct transcription using `pipeline()` API for long audio input, with BigDL-LLM INT4 optimizations.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -19,19 +20,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install datasets soundfile librosa # required by audio processing pip install datasets soundfile librosa # required by audio processing
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install datasets soundfile librosa # required by audio processing
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
For optimal performance on Arc, it is recommended to set several environment variables. ```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./recognize.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --repo-id-or-data-path REPO_ID_OR_DATA_PATH --language LANGUAGE --chunk-length CHUNK_LENGTH --batch-size BATCH_SIZE python ./recognize.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --repo-id-or-data-path REPO_ID_OR_DATA_PATH --language LANGUAGE --chunk-length CHUNK_LENGTH --batch-size BATCH_SIZE
``` ```

View file

@ -9,6 +9,7 @@ In the example [generate.py](./generate.py), we show a basic use case for a Doll
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -16,20 +17,87 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -47,6 +47,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True) load_in_4bit=True)
model = model.to('xpu') model = model.to('xpu')

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM, we have some recommended requirements for
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Dolly v2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations. In the example [generate.py](./generate.py), we show a basic use case for a Dolly v2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -15,20 +16,85 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -47,6 +47,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
trust_remote_code=True) trust_remote_code=True)

View file

@ -8,6 +8,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Falcon model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Falcon model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -17,6 +18,16 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install einops # additional package required for falcon-7b-instruct to conduct generation pip install einops # additional package required for falcon-7b-instruct to conduct generation
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install einops # additional package required for falcon-7b-instruct to conduct generation
```
### 2. (Optional) Download Model and Replace File ### 2. (Optional) Download Model and Replace File
If you select the Falcon model ([tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)), please note that their code (`modelling_RW.py`) does not support KV cache at the moment. To address issue, we have provided updated file ([falcon-7b-instruct/modelling_RW.py](./falcon-7b-instruct/modelling_RW.py)), which can be used to achieve the best performance using BigDL-LLM INT4 optimizations with KV cache support. If you select the Falcon model ([tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)), please note that their code (`modelling_RW.py`) does not support KV cache at the moment. To address issue, we have provided updated file ([falcon-7b-instruct/modelling_RW.py](./falcon-7b-instruct/modelling_RW.py)), which can be used to achieve the best performance using BigDL-LLM INT4 optimizations with KV cache support.
@ -39,19 +50,75 @@ For `tiiuae/falcon-7b-instruct`, you should replace the `modelling_RW.py` with [
### 3. Configures OneAPI environment variables ### 3. Configures OneAPI environment variables
#### 3.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 4. Run #### 3.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 4. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 4.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 4.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 5. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -41,6 +41,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
trust_remote_code=True, trust_remote_code=True,

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Flan-t5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Flan-t5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -18,20 +19,85 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'Translate to German: My name is Arthur' python ./generate.py --prompt 'Translate to German: My name is Arthur'
``` ```

View file

@ -42,6 +42,8 @@ if __name__ == '__main__':
# "wo" module is not converted due to some issues of T5 model # "wo" module is not converted due to some issues of T5 model
# (https://github.com/huggingface/transformers/issues/20287), # (https://github.com/huggingface/transformers/issues/20287),
# "lm_head" module is not converted to generate outputs with better quality # "lm_head" module is not converted to generate outputs with better quality
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForSeq2SeqLM.from_pretrained(model_path, model = AutoModelForSeq2SeqLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
optimize_model=False, optimize_model=False,

View file

@ -39,6 +39,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
trust_remote_code=True, trust_remote_code=True,

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a GPT-J model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a GPT-J model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -14,20 +15,87 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
For optimal performance on Arc, it is recommended to set several environment variables. ### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a InternLM model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a InternLM model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -15,20 +16,85 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -40,6 +40,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
optimize_model=False, optimize_model=False,

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -14,20 +15,85 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
#### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
### 3. Run <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
For optimal performance on Arc, it is recommended to set several environment variables.
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -54,6 +54,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
optimize_model=True, optimize_model=True,

View file

@ -9,6 +9,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -23,20 +24,88 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install transformers==4.34.0 pip install transformers==4.34.0
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
# Refer to https://huggingface.co/mistralai/Mistral-7B-v0.1#troubleshooting, please make sure you are using a stable version of Transformers, 4.34.0 or newer.
pip install transformers==4.34.0
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'What is AI?' python ./generate.py --prompt 'What is AI?'
``` ```

View file

@ -40,6 +40,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
optimize_model=True, optimize_model=True,

View file

@ -9,6 +9,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Mixtral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Mixtral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -23,20 +24,88 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install transformers==4.36.0 pip install transformers==4.36.0
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
# Please make sure you are using a stable version of Transformers, 4.36.0 or newer.
pip install transformers==4.36.0
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'What is AI?' python ./generate.py --prompt 'What is AI?'
``` ```

View file

@ -40,6 +40,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
optimize_model=True, optimize_model=True,

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for an MPT model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for an MPT model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -15,20 +16,86 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install einops # additional package required for mpt-7b-chat and mpt-30b-chat to conduct generation pip install einops # additional package required for mpt-7b-chat and mpt-30b-chat to conduct generation
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -41,6 +41,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
optimize_model=False, optimize_model=False,

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a phi-1_5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a phi-1_5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -16,20 +17,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install einops # additional package required for phi-1_5 to conduct generation pip install einops # additional package required for phi-1_5 to conduct generation
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install einops # additional package required for phi-1_5 to conduct generation
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --prompt 'What is AI?' python ./generate.py --prompt 'What is AI?'
``` ```

View file

@ -42,6 +42,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
trust_remote_code=True) trust_remote_code=True)

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Multimodal chat using `chat()` API ## Example: Multimodal chat using `chat()` API
In the example [chat.py](./chat.py), we show a basic use case for a Qwen-VL model to start a multimodal chat using `chat()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [chat.py](./chat.py), we show a basic use case for a Qwen-VL model to start a multimodal chat using `chat()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -18,19 +19,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./chat.py python ./chat.py
``` ```

View file

@ -38,6 +38,8 @@ if __name__ == '__main__':
# Load model # Load model
# For successful BigDL-LLM optimization on Qwen-VL-Chat, skip the 'c_fc' and 'out_proj' modules during optimization # For successful BigDL-LLM optimization on Qwen-VL-Chat, skip the 'c_fc' and 'out_proj' modules during optimization
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
trust_remote_code=True, trust_remote_code=True,

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Qwen model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Qwen model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -16,20 +17,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install tiktoken einops transformers_stream_generator # additional package required for Qwen-7B-Chat to conduct generation pip install tiktoken einops transformers_stream_generator # additional package required for Qwen-7B-Chat to conduct generation
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install tiktoken einops transformers_stream_generator # additional package required for Qwen-7B-Chat to conduct generation
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -47,6 +47,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
optimize_model=True, optimize_model=True,

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for an Replit model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for an Replit model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -17,20 +18,86 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --prompt 'def print_hello_world():' python ./generate.py --prompt 'def print_hello_world():'
``` ```

View file

@ -39,6 +39,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
optimize_model=True, optimize_model=True,

View file

@ -1,5 +1,5 @@
# SOLAR # SOLAR
In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on SOLAR models on [Intel GPUs](../README.md). For illustration purposes, we utilize the [upstage/SOLAR-10.7B-Instruct-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0) as a reference SOLAR model. In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on SOLAR models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [upstage/SOLAR-10.7B-Instruct-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0) as a reference SOLAR model.
## 0. Requirements ## 0. Requirements
To run these examples with BigDL-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. To run these examples with BigDL-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a SOLAR model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a SOLAR model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -16,20 +17,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install transformers==4.35.2 # required by SOLAR pip install transformers==4.35.2 # required by SOLAR
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install transformers==4.35.2 # required by SOLAR
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -42,6 +42,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
trust_remote_code=True, trust_remote_code=True,

View file

@ -39,6 +39,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
optimize_model=False, optimize_model=False,

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for an StarCoder model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for an StarCoder model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -15,20 +16,85 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -9,6 +9,7 @@ In the example [generate.py](./generate.py), we show a basic use case for a Vicu
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -16,20 +17,87 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
For optimal performance on Arc, it is recommended to set several environment variables. ### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```

View file

@ -40,6 +40,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True) load_in_4bit=True)
model = model.to('xpu') model = model.to('xpu')

View file

@ -9,6 +9,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Whisper model to conduct transcription using `generate()` API, then use the recoginzed text as the input for Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Whisper model to conduct transcription using `generate()` API, then use the recoginzed text as the input for Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -22,20 +23,89 @@ pip install SpeechRecognition sentencepiece colorama
pip install PyAudio inquirer sounddevice pip install PyAudio inquirer sounddevice
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install librosa soundfile datasets
pip install accelerate
pip install SpeechRecognition sentencepiece colorama
pip install PyAudio inquirer
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --llama2-repo-id-or-model-path REPO_ID_OR_MODEL_PATH --whisper-repo-id-or-model-path REPO_ID_OR_MODEL_PATH --n-predict N_PREDICT python ./generate.py --llama2-repo-id-or-model-path REPO_ID_OR_MODEL_PATH --whisper-repo-id-or-model-path REPO_ID_OR_MODEL_PATH --n-predict N_PREDICT
``` ```
@ -142,4 +212,4 @@ Whisper :
BigDL-LLM: BigDL-LLM:
Intel is a well-known technology company that specializes in designing, manufacturing, and selling computer hardware components and semiconductor products. Intel is a well-known technology company that specializes in designing, manufacturing, and selling computer hardware components and semiconductor products.
``` ```

View file

@ -20,6 +20,8 @@ import time
import argparse import argparse
import numpy as np import numpy as np
import inquirer import inquirer
# For Windows users, please remove `import sounddevice`
import sounddevice import sounddevice
from bigdl.llm.transformers import AutoModelForCausalLM from bigdl.llm.transformers import AutoModelForCausalLM
@ -92,6 +94,8 @@ if __name__ == '__main__':
whisper.config.forced_decoder_ids = None whisper.config.forced_decoder_ids = None
whisper = whisper.to('xpu') whisper = whisper.to('xpu')
# When running Llama models on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
llama_model = AutoModelForCausalLM.from_pretrained(llama_model_path, load_in_4bit=True, trust_remote_code=True, optimize_model=False, use_cache=True) llama_model = AutoModelForCausalLM.from_pretrained(llama_model_path, load_in_4bit=True, trust_remote_code=True, optimize_model=False, use_cache=True)
llama_model = llama_model.to('xpu') llama_model = llama_model.to('xpu')
tokenizer = LlamaTokenizer.from_pretrained(llama_model_path) tokenizer = LlamaTokenizer.from_pretrained(llama_model_path)

View file

@ -8,6 +8,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Recognize Tokens using `generate()` API ## Example: Recognize Tokens using `generate()` API
In the example [recognize.py](./recognize.py), we show a basic use case for a Whisper model to conduct transcription using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [recognize.py](./recognize.py), we show a basic use case for a Whisper model to conduct transcription using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment: We suggest using conda to manage environment:
```bash ```bash
conda create -n llm python=3.9 conda create -n llm python=3.9
@ -17,12 +18,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install datasets soundfile librosa # required by audio processing pip install datasets soundfile librosa # required by audio processing
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install datasets soundfile librosa # required by audio processing
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./recognize.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --repo-id-or-data-path REPO_ID_OR_DATA_PATH --language LANGUAGE python ./recognize.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --repo-id-or-data-path REPO_ID_OR_DATA_PATH --language LANGUAGE
``` ```

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Yi model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Yi model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -19,20 +20,87 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install einops # additional package required for Yi-6B to conduct generation pip install einops # additional package required for Yi-6B to conduct generation
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install einops # additional package required for Yi-6B to conduct generation
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py python ./generate.py
``` ```

View file

@ -45,6 +45,8 @@ if __name__ == '__main__':
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
optimize_model=True, optimize_model=True,

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Aquila2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Aquila2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -18,20 +19,85 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'AI是什么' python ./generate.py --prompt 'AI是什么'
``` ```

View file

@ -45,6 +45,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Baichuan model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Baichuan model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -19,20 +20,85 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install transformers_stream_generator # additional package required for Baichuan-13B-Chat to conduct generation pip install transformers_stream_generator # additional package required for Baichuan-13B-Chat to conduct generation
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install transformers_stream_generator # additional package required for Baichuan-13B-Chat to conduct generation
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'AI是什么' python ./generate.py --prompt 'AI是什么'
``` ```
@ -43,7 +109,7 @@ In the example, several arguments can be passed to satisfy your requirements:
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
#### 2.3 Sample Output #### 4.1 Sample Output
#### [baichuan-inc/Baichuan-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan-13B-Chat) #### [baichuan-inc/Baichuan-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan-13B-Chat)
```log ```log
Inference time: xxxx s Inference time: xxxx s

View file

@ -44,6 +44,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -19,20 +20,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install transformers_stream_generator # additional package required for Baichuan2-7B-Chat to conduct generation pip install transformers_stream_generator # additional package required for Baichuan2-7B-Chat to conduct generation
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install transformers_stream_generator # additional package required for Baichuan2-7B-Chat to conduct generation
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'AI是什么' python ./generate.py --prompt 'AI是什么'
``` ```
@ -43,7 +110,7 @@ In the example, several arguments can be passed to satisfy your requirements:
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
#### 2.3 Sample Output #### 4.1 Sample Output
#### [baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat) #### [baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat)
```log ```log
Inference time: xxxx s Inference time: xxxx s

View file

@ -44,6 +44,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a BlueLM model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a BlueLM model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -18,20 +19,85 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'AI是什么' python ./generate.py --prompt 'AI是什么'
``` ```
@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements:
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
#### 2.3 Sample Output #### 4.1 Sample Output
#### [vivo-ai/BlueLM-7B-Chat](https://huggingface.co/vivo-ai/BlueLM-7B-Chat) #### [vivo-ai/BlueLM-7B-Chat](https://huggingface.co/vivo-ai/BlueLM-7B-Chat)
```log ```log
Inference time: xxxx s Inference time: xxxx s

View file

@ -44,6 +44,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example 1: Predict Tokens using `generate()` API ## Example 1: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -18,20 +19,85 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'AI是什么' python ./generate.py --prompt 'AI是什么'
``` ```
@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements:
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
#### 2.3 Sample Output #### 4.1 Sample Output
#### [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b) #### [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
```log ```log
Inference time: xxxx s Inference time: xxxx s
@ -65,6 +131,7 @@ Inference time: xxxx s
## Example 2: Stream Chat using `stream_chat()` API ## Example 2: Stream Chat using `stream_chat()` API
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM2 model to stream chat, with BigDL-LLM INT4 optimizations. In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM2 model to stream chat, with BigDL-LLM INT4 optimizations.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -76,20 +143,84 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
**Stream Chat using `stream_chat()` API**: **Stream Chat using `stream_chat()` API**:
``` ```
python ./streamchat.py python ./streamchat.py

View file

@ -45,6 +45,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')

View file

@ -44,6 +44,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model.to('xpu') model.to('xpu')

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example 1: Predict Tokens using `generate()` API ## Example 1: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -18,20 +19,85 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'AI是什么' python ./generate.py --prompt 'AI是什么'
``` ```
@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements:
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
#### 2.3 Sample Output #### 4.1 Sample Output
#### [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) #### [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b)
```log ```log
Inference time: xxxx s Inference time: xxxx s
@ -64,6 +130,7 @@ AI stands for Artificial Intelligence. It refers to the development of computer
## Example 2: Stream Chat using `stream_chat()` API ## Example 2: Stream Chat using `stream_chat()` API
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with BigDL-LLM INT4 optimizations. In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with BigDL-LLM INT4 optimizations.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -75,20 +142,83 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
**Stream Chat using `stream_chat()` API**: **Stream Chat using `stream_chat()` API**:
``` ```
python ./streamchat.py python ./streamchat.py

View file

@ -45,6 +45,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')

View file

@ -44,6 +44,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model.to('xpu') model.to('xpu')

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a CodeLlama model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a CodeLlama model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -19,20 +20,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'def print_hello_world():' python ./generate.py --prompt 'def print_hello_world():'
``` ```
@ -43,7 +110,7 @@ In the example, several arguments can be passed to satisfy your requirements:
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `def print_hello_world():'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `def print_hello_world():'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
#### 2.3 Sample Output #### 4.1 Sample Output
#### [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf) #### [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf)
```log ```log
Inference time: xxxx s Inference time: xxxx s

View file

@ -46,6 +46,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')

View file

@ -8,6 +8,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Recognize Tokens using `generate()` API ## Example: Recognize Tokens using `generate()` API
In the example [recognize.py](./recognize.py), we show a basic use case for a Distil-Whisper model to conduct transcription using `pipeline()` API for long audio input, with BigDL-LLM INT4 optimizations. In the example [recognize.py](./recognize.py), we show a basic use case for a Distil-Whisper model to conduct transcription using `pipeline()` API for long audio input, with BigDL-LLM INT4 optimizations.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -19,19 +20,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install datasets soundfile librosa # required by audio processing pip install datasets soundfile librosa # required by audio processing
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install datasets soundfile librosa # required by audio processing
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
For optimal performance on Arc, it is recommended to set several environment variables. ```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./recognize.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --repo-id-or-data-path REPO_ID_OR_DATA_PATH --language LANGUAGE --chunk-length CHUNK_LENGTH --batch-size BATCH_SIZE python ./recognize.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --repo-id-or-data-path REPO_ID_OR_DATA_PATH --language LANGUAGE --chunk-length CHUNK_LENGTH --batch-size BATCH_SIZE
``` ```

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Dolly v1 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Dolly v1 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -18,20 +19,85 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'What is AI?' python ./generate.py --prompt 'What is AI?'
``` ```
@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements:
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
#### 2.3 Sample Output #### 4.1 Sample Output
#### [databricks/dolly-v1-6b](https://huggingface.co/databricks/dolly-v1-6b) #### [databricks/dolly-v1-6b](https://huggingface.co/databricks/dolly-v1-6b)
```log ```log
Inference time: xxxx s Inference time: xxxx s

View file

@ -51,6 +51,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Dolly v2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Dolly v2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -18,20 +19,85 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'What is AI?' python ./generate.py --prompt 'What is AI?'
``` ```
@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements:
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
#### 2.3 Sample Output #### 4.1 Sample Output
#### [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b) #### [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b)
```log ```log

View file

@ -51,6 +51,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Flan-t5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Flan-t5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -18,20 +19,85 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'Translate to German: My name is Arthur' python ./generate.py --prompt 'Translate to German: My name is Arthur'
``` ```

View file

@ -47,6 +47,8 @@ if __name__ == '__main__':
# "wo" module is not converted due to some issues of T5 model # "wo" module is not converted due to some issues of T5 model
# (https://github.com/huggingface/transformers/issues/20287), # (https://github.com/huggingface/transformers/issues/20287),
# "lm_head" module is not converted to generate outputs with better quality # "lm_head" module is not converted to generate outputs with better quality
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model, modules_to_not_convert=["wo", "lm_head"]) model = optimize_model(model, modules_to_not_convert=["wo", "lm_head"])
model = model.to('xpu') model = model.to('xpu')

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example 1 - Basic Version: Predict Tokens using `generate()` API ## Example 1 - Basic Version: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -18,20 +19,84 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
For optimal performance on Arc, it is recommended to set several environment variables. call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'What is AI?' python ./generate.py --prompt 'What is AI?'
``` ```

View file

@ -49,6 +49,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Multi-turn chat centered around an image using `generate()` API ## Example: Multi-turn chat centered around an image using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a LLaVA model to start a multi-turn chat centered around an image using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a LLaVA model to start a multi-turn chat centered around an image using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -23,20 +24,89 @@ cp generate.py ./LLaVA/ # copy our example to the LLaVA folder
cd LLaVA # change the working directory to the LLaVA folder cd LLaVA # change the working directory to the LLaVA folder
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
git clone -b v1.1.1 --depth=1 https://github.com/haotian-liu/LLaVA.git # clone the llava libary
pip install einops # install dependencies required by llava
cp generate.py ./LLaVA/ # copy our example to the LLaVA folder
cd LLaVA # change the working directory to the LLaVA folder
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
For optimal performance on Arc, it is recommended to set several environment variables. call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --image-path-or-url 'https://llava-vl.github.io/static/images/monalisa.jpg' python ./generate.py --image-path-or-url 'https://llava-vl.github.io/static/images/monalisa.jpg'
``` ```
@ -65,9 +135,9 @@ The sample input image is:
<a href="https://llava-vl.github.io/static/images/monalisa.jpg"><img width=400px src="https://llava-vl.github.io/static/images/monalisa.jpg" ></a> <a href="https://llava-vl.github.io/static/images/monalisa.jpg"><img width=400px src="https://llava-vl.github.io/static/images/monalisa.jpg" ></a>
### 4 Trouble shooting ### 5 Trouble shooting
#### 4.1 SSLError #### 5.1 SSLError
If you encounter the following output, it means your machine has some trouble accessing huggingface.co. If you encounter the following output, it means your machine has some trouble accessing huggingface.co.
```log ```log
requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /openai/clip-vit-large-patch14-336/resolve/main/config.json (Caused by SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1129)')))"), requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /openai/clip-vit-large-patch14-336/resolve/main/config.json (Caused by SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1129)')))"),

View file

@ -292,6 +292,8 @@ if __name__ == '__main__':
model_name=model_name) model_name=model_name)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model).to('xpu') model = optimize_model(model).to('xpu')
# Generate image tensor # Generate image tensor

View file

@ -9,6 +9,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -23,20 +24,85 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install transformers==4.34.0 pip install transformers==4.34.0
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install transformers==4.34.0
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
#### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
### 3. Run <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
For optimal performance on Arc, it is recommended to set several environment variables.
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'What is AI?' python ./generate.py --prompt 'What is AI?'
``` ```
@ -47,7 +113,7 @@ In the example, several arguments can be passed to satisfy your requirements:
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
#### 2.3 Sample Output #### 4.1 Sample Output
#### [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) #### [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)
```log ```log
Inference time: xxxx s Inference time: xxxx s

View file

@ -45,6 +45,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')

View file

@ -9,6 +9,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Mixtral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Mixtral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -23,20 +24,87 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install transformers==4.36.0 pip install transformers==4.36.0
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
# Please make sure you are using a stable version of Transformers, 4.36.0 or newer.
pip install transformers==4.36.0
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
For optimal performance on Arc, it is recommended to set several environment variables. call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'What is AI?' python ./generate.py --prompt 'What is AI?'
``` ```

View file

@ -45,6 +45,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM, we have some recommended requirements for
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a phi-1_5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a phi-1_5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -18,20 +19,85 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install einops # additional package required for phi-1_5 to conduct generation pip install einops # additional package required for phi-1_5 to conduct generation
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install einops # additional package required for phi-1_5 to conduct generation
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./generate.py --prompt 'What is AI?' python ./generate.py --prompt 'What is AI?'
``` ```

View file

@ -43,6 +43,8 @@ if __name__ == '__main__':
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Multimodal chat using `chat()` API ## Example: Multimodal chat using `chat()` API
In the example [chat.py](./chat.py), we show a basic use case for a Qwen-VL model to start a multimodal chat using `chat()` API, with BigDL-LLM 'optimize_model' API on Intel GPUs. In the example [chat.py](./chat.py), we show a basic use case for a Qwen-VL model to start a multimodal chat using `chat()` API, with BigDL-LLM 'optimize_model' API on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -18,19 +19,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
``` ```
python ./chat.py python ./chat.py
``` ```

View file

@ -41,6 +41,8 @@ if __name__ == '__main__':
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# For successful BigDL-LLM optimization on Qwen-VL-Chat, skip the 'c_fc' and 'out_proj' modules during optimization # For successful BigDL-LLM optimization on Qwen-VL-Chat, skip the 'c_fc' and 'out_proj' modules during optimization
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model, model = optimize_model(model,
low_bit='sym_int4', low_bit='sym_int4',
modules_to_not_convert=['c_fc', 'out_proj']) modules_to_not_convert=['c_fc', 'out_proj'])

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Replit model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Replit model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -18,11 +19,33 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
#### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
### 3. Run ### 3. Run
For optimal performance on Arc, it is recommended to set several environment variables. For optimal performance on Arc, it is recommended to set several environment variables.
@ -32,6 +55,52 @@ export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'def print_hello_world():' python ./generate.py --prompt 'def print_hello_world():'
``` ```
@ -42,7 +111,7 @@ In the example, several arguments can be passed to satisfy your requirements:
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `def print_hello_world():'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `def print_hello_world():'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
#### 2.3 Sample Output #### 4.1 Sample Output
#### [replit/replit-code-v1-3b](https://huggingface.co/replit/replit-code-v1-3b) #### [replit/replit-code-v1-3b](https://huggingface.co/replit/replit-code-v1-3b)
```log ```log
Inference time: xxxx s Inference time: xxxx s

View file

@ -44,6 +44,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a SOLAR model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a SOLAR model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -19,20 +20,85 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install transformers==4.35.2 # required by SOLAR pip install transformers==4.35.2 # required by SOLAR
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install transformers==4.35.2 # required by SOLAR
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
``` ```
@ -43,7 +109,7 @@ In the example, several arguments can be passed to satisfy your requirements:
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
#### 2.3 Sample Output #### 4.1 Sample Output
#### [upstage/SOLAR-10.7B-Instruct-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0) #### [upstage/SOLAR-10.7B-Instruct-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0)
```log ```log
Inference time: XXXX s Inference time: XXXX s

View file

@ -47,6 +47,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a StarCoder model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a StarCoder model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -18,20 +19,85 @@ conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py --prompt 'def print_hello_world():' python ./generate.py --prompt 'def print_hello_world():'
``` ```
@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements:
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `def print_hello_world():'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `def print_hello_world():'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
#### 2.3 Sample Output #### 4.1 Sample Output
#### [bigcode/starcoder](https://huggingface.co/bigcode/starcoder) #### [bigcode/starcoder](https://huggingface.co/bigcode/starcoder)
```log ```log
Inference time: xxxx s Inference time: xxxx s

View file

@ -44,6 +44,8 @@ if __name__ == '__main__':
low_cpu_mem_usage=True) low_cpu_mem_usage=True)
# With only one line to enable BigDL-LLM optimization on model # With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')

View file

@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req
## Example: Predict Tokens using `generate()` API ## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Yi model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a Yi model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install ### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM: After installing conda, create a Python environment for BigDL-LLM:
@ -19,20 +20,85 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w
pip install einops # additional package required for Yi-6B to conduct generation pip install einops # additional package required for Yi-6B to conduct generation
``` ```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install einops # additional package required for Yi-6B to conduct generation
```
### 2. Configures OneAPI environment variables ### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash ```bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
### 3. Run #### 2.2 Configurations for Windows
```cmd
For optimal performance on Arc, it is recommended to set several environment variables. call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash ```bash
python ./generate.py python ./generate.py
``` ```

View file

@ -37,11 +37,13 @@ if __name__ == '__main__':
args = parser.parse_args() args = parser.parse_args()
model_path = args.repo_id_or_model_path model_path = args.repo_id_or_model_path
# Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format
model = AutoModelForCausalLM.from_pretrained(model_path, model = AutoModelForCausalLM.from_pretrained(model_path,
trust_remote_code=True, trust_remote_code=True,
use_cache=True) use_cache=True)
# With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model) model = optimize_model(model)
model = model.to('xpu') model = model.to('xpu')