diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila/README.md index e9b4bd31..6e9cbee8 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila/README.md @@ -1,6 +1,6 @@ # Aquila -In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Aquila models. For illustration purposes, we utilize the [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B) as a reference Aquila model. +In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Aquila models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B) as a reference Aquila model. > **Note**: If you want to download the Hugging Face *Transformers* model, please refer to [here](https://huggingface.co/docs/hub/models-downloading#using-git). > @@ -13,6 +13,7 @@ To run these examples with BigDL-LLM, we have some recommended requirements for In the example [generate.py](./generate.py), we show a basic use case for a Aquila model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -20,20 +21,86 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. -For optimal performance on Arc, it is recommended to set several environment variables. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
+ +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila/generate.py index 82157269..0f233796 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila/generate.py @@ -41,6 +41,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True) diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2/README.md index 38df3a0b..8cae2285 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2/README.md @@ -1,6 +1,6 @@ # Aquila2 -In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Aquila2 models. For illustration purposes, we utilize the [BAAI/AquilaChat2-7B](https://huggingface.co/BAAI/AquilaChat2-7B) as a reference Aquila2 model. +In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Aquila2 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [BAAI/AquilaChat2-7B](https://huggingface.co/BAAI/AquilaChat2-7B) as a reference Aquila2 model. > **Note**: If you want to download the Hugging Face *Transformers* model, please refer to [here](https://huggingface.co/docs/hub/models-downloading#using-git). > @@ -13,6 +13,7 @@ To run these examples with BigDL-LLM, we have some recommended requirements for In the example [generate.py](./generate.py), we show a basic use case for a Aquila2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -20,20 +21,87 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` + +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. -For optimal performance on Arc, it is recommended to set several environment variables. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
+ +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2/generate.py index 44398b9e..c423904b 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2/generate.py @@ -41,6 +41,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True) diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/README.md index 71548c5d..4c45053a 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Baichuan model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -16,20 +17,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install transformers_stream_generator # additional package required for Baichuan-13B-Chat to conduct generation ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install transformers_stream_generator # additional package required for Baichuan-13B-Chat to conduct generation +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. -For optimal performance on Arc, it is recommended to set several environment variables. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
+ +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/generate.py index 168b8ca0..18c0e10d 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/generate.py @@ -39,6 +39,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/README.md index c557af03..1ad1b4eb 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Baichuan model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -16,20 +17,87 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install transformers_stream_generator # additional package required for Baichuan-7B-Chat to conduct generation ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install transformers_stream_generator # additional package required for Baichuan-7B-Chat to conduct generation +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. -For optimal performance on Arc, it is recommended to set several environment variables. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
+ +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/generate.py index cf03dbab..88d7ea40 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/generate.py @@ -43,6 +43,8 @@ if __name__ == '__main__': # to enhance decoding speed, but has `"use_cache": false` in its model config, # it is important to set `use_cache=True` explicitly in the `generate` function # to obtain optimal performance with BigDL-LLM INT4 optimizations + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/README.md index 87888feb..4c112ed7 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a BlueLM model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -15,20 +16,86 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. -For optimal performance on Arc, it is recommended to set several environment variables. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
+ +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/generate.py index 4f0a4514..d34ff2ef 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/generate.py @@ -39,6 +39,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/README.md index edec164c..663c5478 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/README.md @@ -8,6 +8,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example 1: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -16,20 +17,85 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. -For optimal performance on Arc, it is recommended to set several environment variables. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
+For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` @@ -69,6 +135,7 @@ AI stands for Artificial Intelligence. It refers to the development of computer ## Example 2: Stream Chat using `stream_chat()` API In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with BigDL-LLM INT4 optimizations. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -77,20 +144,85 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. -### 3. Run +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + **Stream Chat using `stream_chat()` API**: ``` python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/generate.py index d79da518..35ecfb49 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/generate.py @@ -41,6 +41,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModel.from_pretrained(model_path, load_in_4bit=True, optimize_model=True, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/streamchat.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/streamchat.py index 8a63804c..9fbbd16c 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/streamchat.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/streamchat.py @@ -39,6 +39,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModel.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/README.md index cf01c1dc..f91cf4d9 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -14,20 +15,85 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` + +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-### 3. Run - -For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/generate.py index 865270a9..da977404 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/generate.py @@ -58,6 +58,8 @@ if __name__ == '__main__': # to enhance decoding speed, but has `"use_cache": false` in its model config, # it is important to set `use_cache=True` explicitly to obtain optimal # performance with BigDL-LLM INT4 optimizations + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, optimize_model=True, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/generate.py index 9624458f..6de2f2d7 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/generate.py @@ -40,6 +40,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, optimize_model=False, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/readme.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/readme.md index 6241cc61..902678b4 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/readme.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/readme.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for an CodeLlama model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -16,20 +17,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/distil-whisper/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/distil-whisper/README.md index d9fde457..76c2bfb7 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/distil-whisper/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/distil-whisper/README.md @@ -8,6 +8,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Recognize Tokens using `generate()` API In the example [recognize.py](./recognize.py), we show a basic use case for a Distil-Whisper model to conduct transcription using `pipeline()` API for long audio input, with BigDL-LLM INT4 optimizations. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -19,19 +20,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install datasets soundfile librosa # required by audio processing ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install datasets soundfile librosa # required by audio processing +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run -For optimal performance on Arc, it is recommended to set several environment variables. +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
+ +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./recognize.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --repo-id-or-data-path REPO_ID_OR_DATA_PATH --language LANGUAGE --chunk-length CHUNK_LENGTH --batch-size BATCH_SIZE ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1/README.md index 2196ab13..1ff5ab28 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1/README.md @@ -9,6 +9,7 @@ In the example [generate.py](./generate.py), we show a basic use case for a Doll ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -16,20 +17,87 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` + +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux + ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1/generate.py index bf45d549..b4a9c439 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1/generate.py @@ -47,6 +47,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True) model = model.to('xpu') diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2/README.md index d5547452..82b741cf 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM, we have some recommended requirements for ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Dolly v2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -15,20 +16,85 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2/generate.py index 5d6f4da4..93729f74 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2/generate.py @@ -47,6 +47,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True) diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/README.md index ab977811..d02ce8a8 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/README.md @@ -8,6 +8,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Falcon model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -17,6 +18,16 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install einops # additional package required for falcon-7b-instruct to conduct generation ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install einops # additional package required for falcon-7b-instruct to conduct generation +``` + ### 2. (Optional) Download Model and Replace File If you select the Falcon model ([tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)), please note that their code (`modelling_RW.py`) does not support KV cache at the moment. To address issue, we have provided updated file ([falcon-7b-instruct/modelling_RW.py](./falcon-7b-instruct/modelling_RW.py)), which can be used to achieve the best performance using BigDL-LLM INT4 optimizations with KV cache support. @@ -39,19 +50,75 @@ For `tiiuae/falcon-7b-instruct`, you should replace the `modelling_RW.py` with [ ### 3. Configures OneAPI environment variables +#### 3.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 4. Run +#### 3.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 4. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 4.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 4.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 5. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/generate.py index c7ed31cb..81229c5e 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/generate.py @@ -41,6 +41,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/README.md index fdbf4e60..c0168eb0 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Flan-t5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -18,20 +19,85 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'Translate to German: My name is Arthur' ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/generate.py index 1b8cddf6..7ebfc1f2 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/generate.py @@ -42,6 +42,8 @@ if __name__ == '__main__': # "wo" module is not converted due to some issues of T5 model # (https://github.com/huggingface/transformers/issues/20287), # "lm_head" module is not converted to generate outputs with better quality + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForSeq2SeqLM.from_pretrained(model_path, load_in_4bit=True, optimize_model=False, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/generate.py index 0ab6cb3b..d9da4b5c 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/generate.py @@ -39,6 +39,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/readme.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/readme.md index 3e6bb8db..100c5b15 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/readme.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/readme.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a GPT-J model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -14,20 +15,87 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` + +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. -For optimal performance on Arc, it is recommended to set several environment variables. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
+ +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/README.md index 8fdce310..e81ea33a 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a InternLM model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -15,20 +16,85 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/generate.py index 3144ce3b..99e5b52f 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/generate.py @@ -40,6 +40,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, optimize_model=False, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/README.md index 821715da..26b1a638 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -14,20 +15,85 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` + +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-### 3. Run - -For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/generate.py index e9004612..e9095acc 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/generate.py @@ -54,6 +54,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, optimize_model=True, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/README.md index 64510e45..8cd40a15 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/README.md @@ -9,6 +9,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -23,20 +24,88 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install transformers==4.34.0 ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu + +# Refer to https://huggingface.co/mistralai/Mistral-7B-v0.1#troubleshooting, please make sure you are using a stable version of Transformers, 4.34.0 or newer. +pip install transformers==4.34.0 +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'What is AI?' ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/generate.py index 7f01b358..faecbcf3 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/generate.py @@ -40,6 +40,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, optimize_model=True, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/README.md index 7526c928..309e4e25 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/README.md @@ -9,6 +9,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Mixtral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -23,20 +24,88 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install transformers==4.36.0 ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu + +# Please make sure you are using a stable version of Transformers, 4.36.0 or newer. +pip install transformers==4.36.0 +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'What is AI?' ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/generate.py index 60421d97..79cc9995 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/generate.py @@ -40,6 +40,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, optimize_model=True, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/README.md index ea479018..a073ce02 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for an MPT model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -15,20 +16,86 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu pip install einops # additional package required for mpt-7b-chat and mpt-30b-chat to conduct generation ``` + +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/generate.py index 8b6d833e..ff9b4b06 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/generate.py @@ -41,6 +41,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, optimize_model=False, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/README.md index f83903fa..8e108b1f 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a phi-1_5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -16,20 +17,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install einops # additional package required for phi-1_5 to conduct generation ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install einops # additional package required for phi-1_5 to conduct generation +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --prompt 'What is AI?' ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/generate.py index bb591336..0437fa5f 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/generate.py @@ -42,6 +42,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True) diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md index 0cfa93a2..d5c54ff6 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Multimodal chat using `chat()` API In the example [chat.py](./chat.py), we show a basic use case for a Qwen-VL model to start a multimodal chat using `chat()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -18,19 +19,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` + +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./chat.py ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl/chat.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl/chat.py index 4781a853..9df8cf46 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl/chat.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl/chat.py @@ -38,6 +38,8 @@ if __name__ == '__main__': # Load model # For successful BigDL-LLM optimization on Qwen-VL-Chat, skip the 'c_fc' and 'out_proj' modules during optimization + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/README.md index 4e82c205..ee1d162e 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Qwen model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -16,20 +17,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install tiktoken einops transformers_stream_generator # additional package required for Qwen-7B-Chat to conduct generation ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install tiktoken einops transformers_stream_generator # additional package required for Qwen-7B-Chat to conduct generation +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/generate.py index e3b95fba..182a093f 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/generate.py @@ -47,6 +47,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, optimize_model=True, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/README.md index ea72a304..f869f46f 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for an Replit model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -17,20 +18,86 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` + +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --prompt 'def print_hello_world():' ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/generate.py index e025249b..5720a3eb 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/generate.py @@ -39,6 +39,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, optimize_model=True, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/README.md index cd591d34..957b0736 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/README.md @@ -1,5 +1,5 @@ # SOLAR -In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on SOLAR models on [Intel GPUs](../README.md). For illustration purposes, we utilize the [upstage/SOLAR-10.7B-Instruct-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0) as a reference SOLAR model. +In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on SOLAR models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [upstage/SOLAR-10.7B-Instruct-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0) as a reference SOLAR model. ## 0. Requirements To run these examples with BigDL-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a SOLAR model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -16,20 +17,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install transformers==4.35.2 # required by SOLAR ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install transformers==4.35.2 # required by SOLAR +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/generate.py index 6b105bad..7dd54586 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/generate.py @@ -42,6 +42,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/generate.py index 762aea17..8be048ff 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/generate.py @@ -39,6 +39,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, optimize_model=False, diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/readme.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/readme.md index f61a4518..3b8b46c3 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/readme.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/readme.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for an StarCoder model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -15,20 +16,85 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna/README.md index 7a83e347..de227e03 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna/README.md @@ -9,6 +9,7 @@ In the example [generate.py](./generate.py), we show a basic use case for a Vicu ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -16,20 +17,87 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` + +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. -For optimal performance on Arc, it is recommended to set several environment variables. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
+ +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna/generate.py index 064e9358..387d815f 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna/generate.py @@ -40,6 +40,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True) model = model.to('xpu') diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/voiceassistant/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/voiceassistant/README.md index 353a861a..e5965073 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/voiceassistant/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/voiceassistant/README.md @@ -9,6 +9,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Whisper model to conduct transcription using `generate()` API, then use the recoginzed text as the input for Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -22,20 +23,89 @@ pip install SpeechRecognition sentencepiece colorama pip install PyAudio inquirer sounddevice ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install librosa soundfile datasets +pip install accelerate +pip install SpeechRecognition sentencepiece colorama +pip install PyAudio inquirer +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --llama2-repo-id-or-model-path REPO_ID_OR_MODEL_PATH --whisper-repo-id-or-model-path REPO_ID_OR_MODEL_PATH --n-predict N_PREDICT ``` @@ -142,4 +212,4 @@ Whisper : BigDL-LLM: Intel is a well-known technology company that specializes in designing, manufacturing, and selling computer hardware components and semiconductor products. -``` +``` \ No newline at end of file diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/voiceassistant/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/voiceassistant/generate.py index 3231d7b4..57db00f2 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/voiceassistant/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/voiceassistant/generate.py @@ -20,6 +20,8 @@ import time import argparse import numpy as np import inquirer + +# For Windows users, please remove `import sounddevice` import sounddevice from bigdl.llm.transformers import AutoModelForCausalLM @@ -92,6 +94,8 @@ if __name__ == '__main__': whisper.config.forced_decoder_ids = None whisper = whisper.to('xpu') + # When running Llama models on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. llama_model = AutoModelForCausalLM.from_pretrained(llama_model_path, load_in_4bit=True, trust_remote_code=True, optimize_model=False, use_cache=True) llama_model = llama_model.to('xpu') tokenizer = LlamaTokenizer.from_pretrained(llama_model_path) diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper/readme.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper/readme.md index 57d497a7..17046ef9 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper/readme.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper/readme.md @@ -8,6 +8,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Recognize Tokens using `generate()` API In the example [recognize.py](./recognize.py), we show a basic use case for a Whisper model to conduct transcription using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage environment: ```bash conda create -n llm python=3.9 @@ -17,12 +18,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install datasets soundfile librosa # required by audio processing ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install datasets soundfile librosa # required by audio processing +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. + +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
+ +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series + +```bash +export USE_XETLA=OFF +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +``` + +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples ``` python ./recognize.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --repo-id-or-data-path REPO_ID_OR_DATA_PATH --language LANGUAGE ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/README.md index 6997da85..1ef23e2b 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/README.md +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Yi model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -19,20 +20,87 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install einops # additional package required for Yi-6B to conduct generation ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install einops # additional package required for Yi-6B to conduct generation +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` + +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py ``` diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/generate.py b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/generate.py index 00f281cb..4af84098 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/generate.py +++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/generate.py @@ -45,6 +45,8 @@ if __name__ == '__main__': # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, optimize_model=True, diff --git a/python/llm/example/GPU/PyTorch-Models/Model/aquila2/README.md b/python/llm/example/GPU/PyTorch-Models/Model/aquila2/README.md index c5392676..66b702ed 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/aquila2/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/aquila2/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Aquila2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -18,20 +19,85 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'AI是什么?' ``` diff --git a/python/llm/example/GPU/PyTorch-Models/Model/aquila2/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/aquila2/generate.py index 3845d842..5733e2a3 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/aquila2/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/aquila2/generate.py @@ -45,6 +45,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/baichuan/README.md b/python/llm/example/GPU/PyTorch-Models/Model/baichuan/README.md index 8e96c533..c0a098af 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/baichuan/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/baichuan/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Baichuan model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -19,20 +20,85 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install transformers_stream_generator # additional package required for Baichuan-13B-Chat to conduct generation ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install transformers_stream_generator # additional package required for Baichuan-13B-Chat to conduct generation +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'AI是什么?' ``` @@ -43,7 +109,7 @@ In the example, several arguments can be passed to satisfy your requirements: - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. -#### 2.3 Sample Output +#### 4.1 Sample Output #### [baichuan-inc/Baichuan-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan-13B-Chat) ```log Inference time: xxxx s diff --git a/python/llm/example/GPU/PyTorch-Models/Model/baichuan/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/baichuan/generate.py index 05ec2eca..108ccb00 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/baichuan/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/baichuan/generate.py @@ -44,6 +44,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/baichuan2/README.md b/python/llm/example/GPU/PyTorch-Models/Model/baichuan2/README.md index b7f2e8f2..7c0fecd0 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/baichuan2/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/baichuan2/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -19,20 +20,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install transformers_stream_generator # additional package required for Baichuan2-7B-Chat to conduct generation ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install transformers_stream_generator # additional package required for Baichuan2-7B-Chat to conduct generation +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'AI是什么?' ``` @@ -43,7 +110,7 @@ In the example, several arguments can be passed to satisfy your requirements: - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. -#### 2.3 Sample Output +#### 4.1 Sample Output #### [baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat) ```log Inference time: xxxx s diff --git a/python/llm/example/GPU/PyTorch-Models/Model/baichuan2/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/baichuan2/generate.py index d9e0b9a8..8f27d1bc 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/baichuan2/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/baichuan2/generate.py @@ -44,6 +44,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/bluelm/README.md b/python/llm/example/GPU/PyTorch-Models/Model/bluelm/README.md index eac57f83..a99a99ca 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/bluelm/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/bluelm/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a BlueLM model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -18,20 +19,85 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'AI是什么?' ``` @@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements: - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. -#### 2.3 Sample Output +#### 4.1 Sample Output #### [vivo-ai/BlueLM-7B-Chat](https://huggingface.co/vivo-ai/BlueLM-7B-Chat) ```log Inference time: xxxx s diff --git a/python/llm/example/GPU/PyTorch-Models/Model/bluelm/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/bluelm/generate.py index aa2369f7..d62eea25 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/bluelm/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/bluelm/generate.py @@ -44,6 +44,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/chatglm2/README.md b/python/llm/example/GPU/PyTorch-Models/Model/chatglm2/README.md index 384824a0..887d88bc 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/chatglm2/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/chatglm2/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example 1: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -18,20 +19,85 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'AI是什么?' ``` @@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements: - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. -#### 2.3 Sample Output +#### 4.1 Sample Output #### [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b) ```log Inference time: xxxx s @@ -65,6 +131,7 @@ Inference time: xxxx s ## Example 2: Stream Chat using `stream_chat()` API In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM2 model to stream chat, with BigDL-LLM INT4 optimizations. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -76,20 +143,84 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + **Stream Chat using `stream_chat()` API**: ``` python ./streamchat.py diff --git a/python/llm/example/GPU/PyTorch-Models/Model/chatglm2/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/chatglm2/generate.py index ae97ef25..bd7ac596 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/chatglm2/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/chatglm2/generate.py @@ -45,6 +45,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/chatglm2/streamchat.py b/python/llm/example/GPU/PyTorch-Models/Model/chatglm2/streamchat.py index 2ef9c150..78bf75d9 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/chatglm2/streamchat.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/chatglm2/streamchat.py @@ -44,6 +44,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/chatglm3/README.md b/python/llm/example/GPU/PyTorch-Models/Model/chatglm3/README.md index 18ab470a..84d148cc 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/chatglm3/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/chatglm3/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example 1: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -18,20 +19,85 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'AI是什么?' ``` @@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements: - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. -#### 2.3 Sample Output +#### 4.1 Sample Output #### [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) ```log Inference time: xxxx s @@ -64,6 +130,7 @@ AI stands for Artificial Intelligence. It refers to the development of computer ## Example 2: Stream Chat using `stream_chat()` API In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with BigDL-LLM INT4 optimizations. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -75,20 +142,83 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples **Stream Chat using `stream_chat()` API**: ``` python ./streamchat.py diff --git a/python/llm/example/GPU/PyTorch-Models/Model/chatglm3/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/chatglm3/generate.py index f8e41d98..a6432ef3 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/chatglm3/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/chatglm3/generate.py @@ -45,6 +45,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/chatglm3/streamchat.py b/python/llm/example/GPU/PyTorch-Models/Model/chatglm3/streamchat.py index 293a02e1..569f6ec7 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/chatglm3/streamchat.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/chatglm3/streamchat.py @@ -44,6 +44,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/codellama/README.md b/python/llm/example/GPU/PyTorch-Models/Model/codellama/README.md index 607fb274..e1c9f7bc 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/codellama/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/codellama/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a CodeLlama model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -19,20 +20,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'def print_hello_world():' ``` @@ -43,7 +110,7 @@ In the example, several arguments can be passed to satisfy your requirements: - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `def print_hello_world():'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. -#### 2.3 Sample Output +#### 4.1 Sample Output #### [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf) ```log Inference time: xxxx s diff --git a/python/llm/example/GPU/PyTorch-Models/Model/codellama/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/codellama/generate.py index a95d7d8c..c65b3079 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/codellama/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/codellama/generate.py @@ -46,6 +46,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/distil-whisper/README.md b/python/llm/example/GPU/PyTorch-Models/Model/distil-whisper/README.md index ad041f25..a3201f70 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/distil-whisper/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/distil-whisper/README.md @@ -8,6 +8,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Recognize Tokens using `generate()` API In the example [recognize.py](./recognize.py), we show a basic use case for a Distil-Whisper model to conduct transcription using `pipeline()` API for long audio input, with BigDL-LLM INT4 optimizations. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -19,19 +20,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install datasets soundfile librosa # required by audio processing ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install datasets soundfile librosa # required by audio processing +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run -For optimal performance on Arc, it is recommended to set several environment variables. +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
+ +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./recognize.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --repo-id-or-data-path REPO_ID_OR_DATA_PATH --language LANGUAGE --chunk-length CHUNK_LENGTH --batch-size BATCH_SIZE ``` diff --git a/python/llm/example/GPU/PyTorch-Models/Model/dolly-v1/README.md b/python/llm/example/GPU/PyTorch-Models/Model/dolly-v1/README.md index d19931e3..28c04d8d 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/dolly-v1/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/dolly-v1/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Dolly v1 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -18,20 +19,85 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'What is AI?' ``` @@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements: - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. -#### 2.3 Sample Output +#### 4.1 Sample Output #### [databricks/dolly-v1-6b](https://huggingface.co/databricks/dolly-v1-6b) ```log Inference time: xxxx s diff --git a/python/llm/example/GPU/PyTorch-Models/Model/dolly-v1/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/dolly-v1/generate.py index 8e93646b..403135de 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/dolly-v1/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/dolly-v1/generate.py @@ -51,6 +51,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/dolly-v2/README.md b/python/llm/example/GPU/PyTorch-Models/Model/dolly-v2/README.md index 73a5b72a..485212a0 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/dolly-v2/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/dolly-v2/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Dolly v2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -18,20 +19,85 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'What is AI?' ``` @@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements: - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. -#### 2.3 Sample Output +#### 4.1 Sample Output #### [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b) ```log diff --git a/python/llm/example/GPU/PyTorch-Models/Model/dolly-v2/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/dolly-v2/generate.py index 2dba7581..a9f218bd 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/dolly-v2/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/dolly-v2/generate.py @@ -51,6 +51,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/flan-t5/README.md b/python/llm/example/GPU/PyTorch-Models/Model/flan-t5/README.md index 5d462ec2..ae6ce8cb 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/flan-t5/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/flan-t5/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Flan-t5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -18,20 +19,85 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'Translate to German: My name is Arthur' ``` diff --git a/python/llm/example/GPU/PyTorch-Models/Model/flan-t5/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/flan-t5/generate.py index c89ecbf1..c9ded902 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/flan-t5/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/flan-t5/generate.py @@ -47,6 +47,8 @@ if __name__ == '__main__': # "wo" module is not converted due to some issues of T5 model # (https://github.com/huggingface/transformers/issues/20287), # "lm_head" module is not converted to generate outputs with better quality + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model, modules_to_not_convert=["wo", "lm_head"]) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/llama2/README.md b/python/llm/example/GPU/PyTorch-Models/Model/llama2/README.md index 76002ace..844db7cf 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/llama2/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/llama2/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example 1 - Basic Version: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -18,20 +19,84 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run - -For optimal performance on Arc, it is recommended to set several environment variables. +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
+For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'What is AI?' ``` diff --git a/python/llm/example/GPU/PyTorch-Models/Model/llama2/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/llama2/generate.py index 1fc9028a..81042d06 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/llama2/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/llama2/generate.py @@ -49,6 +49,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/llava/README.md b/python/llm/example/GPU/PyTorch-Models/Model/llava/README.md index 04fddc28..3909a8eb 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/llava/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/llava/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Multi-turn chat centered around an image using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a LLaVA model to start a multi-turn chat centered around an image using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -23,20 +24,89 @@ cp generate.py ./LLaVA/ # copy our example to the LLaVA folder cd LLaVA # change the working directory to the LLaVA folder ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu + +git clone -b v1.1.1 --depth=1 https://github.com/haotian-liu/LLaVA.git # clone the llava libary +pip install einops # install dependencies required by llava +cp generate.py ./LLaVA/ # copy our example to the LLaVA folder +cd LLaVA # change the working directory to the LLaVA folder +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run - -For optimal performance on Arc, it is recommended to set several environment variables. +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
+For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --image-path-or-url 'https://llava-vl.github.io/static/images/monalisa.jpg' ``` @@ -65,9 +135,9 @@ The sample input image is: -### 4 Trouble shooting +### 5 Trouble shooting -#### 4.1 SSLError +#### 5.1 SSLError If you encounter the following output, it means your machine has some trouble accessing huggingface.co. ```log requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /openai/clip-vit-large-patch14-336/resolve/main/config.json (Caused by SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1129)')))"), diff --git a/python/llm/example/GPU/PyTorch-Models/Model/llava/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/llava/generate.py index 597814e0..0bf3f23d 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/llava/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/llava/generate.py @@ -292,6 +292,8 @@ if __name__ == '__main__': model_name=model_name) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model).to('xpu') # Generate image tensor diff --git a/python/llm/example/GPU/PyTorch-Models/Model/mistral/README.md b/python/llm/example/GPU/PyTorch-Models/Model/mistral/README.md index 65cbcf78..bbbefbb1 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/mistral/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/mistral/README.md @@ -9,6 +9,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -23,20 +24,85 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install transformers==4.34.0 ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install transformers==4.34.0 +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-### 3. Run - -For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'What is AI?' ``` @@ -47,7 +113,7 @@ In the example, several arguments can be passed to satisfy your requirements: - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. -#### 2.3 Sample Output +#### 4.1 Sample Output #### [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) ```log Inference time: xxxx s diff --git a/python/llm/example/GPU/PyTorch-Models/Model/mistral/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/mistral/generate.py index 937a6dcb..d05ee560 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/mistral/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/mistral/generate.py @@ -45,6 +45,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/mixtral/README.md b/python/llm/example/GPU/PyTorch-Models/Model/mixtral/README.md index 7f3a9958..d214ad6c 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/mixtral/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/mixtral/README.md @@ -9,6 +9,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Mixtral model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -23,20 +24,87 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install transformers==4.36.0 ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu + +# Please make sure you are using a stable version of Transformers, 4.36.0 or newer. +pip install transformers==4.36.0 +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run - -For optimal performance on Arc, it is recommended to set several environment variables. +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
+For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'What is AI?' ``` diff --git a/python/llm/example/GPU/PyTorch-Models/Model/mixtral/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/mixtral/generate.py index aabfadcc..83d74959 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/mixtral/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/mixtral/generate.py @@ -45,6 +45,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/phi-1_5/README.md b/python/llm/example/GPU/PyTorch-Models/Model/phi-1_5/README.md index 32cff9f6..1f90c48b 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/phi-1_5/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/phi-1_5/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM, we have some recommended requirements for ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a phi-1_5 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -18,20 +19,85 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install einops # additional package required for phi-1_5 to conduct generation ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install einops # additional package required for phi-1_5 to conduct generation +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./generate.py --prompt 'What is AI?' ``` diff --git a/python/llm/example/GPU/PyTorch-Models/Model/phi-1_5/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/phi-1_5/generate.py index 833922e9..743192e2 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/phi-1_5/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/phi-1_5/generate.py @@ -43,6 +43,8 @@ if __name__ == '__main__': model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md b/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md index 2922d3d0..a88ca3a4 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Multimodal chat using `chat()` API In the example [chat.py](./chat.py), we show a basic use case for a Qwen-VL model to start a multimodal chat using `chat()` API, with BigDL-LLM 'optimize_model' API on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -18,19 +19,86 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` + +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ``` python ./chat.py ``` diff --git a/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/chat.py b/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/chat.py index b51ffa71..df869916 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/chat.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/chat.py @@ -41,6 +41,8 @@ if __name__ == '__main__': # With only one line to enable BigDL-LLM optimization on model # For successful BigDL-LLM optimization on Qwen-VL-Chat, skip the 'c_fc' and 'out_proj' modules during optimization + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model, low_bit='sym_int4', modules_to_not_convert=['c_fc', 'out_proj']) diff --git a/python/llm/example/GPU/PyTorch-Models/Model/replit/README.md b/python/llm/example/GPU/PyTorch-Models/Model/replit/README.md index fddb251d..55786ed6 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/replit/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/replit/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Replit model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -18,11 +19,33 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
+ +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series + ### 3. Run For optimal performance on Arc, it is recommended to set several environment variables. @@ -32,6 +55,52 @@ export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'def print_hello_world():' ``` @@ -42,7 +111,7 @@ In the example, several arguments can be passed to satisfy your requirements: - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `def print_hello_world():'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. -#### 2.3 Sample Output +#### 4.1 Sample Output #### [replit/replit-code-v1-3b](https://huggingface.co/replit/replit-code-v1-3b) ```log Inference time: xxxx s diff --git a/python/llm/example/GPU/PyTorch-Models/Model/replit/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/replit/generate.py index c43dc32d..57f2c9f6 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/replit/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/replit/generate.py @@ -44,6 +44,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/solar/README.md b/python/llm/example/GPU/PyTorch-Models/Model/solar/README.md index 0a0de502..e18edbbe 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/solar/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/solar/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a SOLAR model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -19,20 +20,85 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install transformers==4.35.2 # required by SOLAR ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install transformers==4.35.2 # required by SOLAR +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT ``` @@ -43,7 +109,7 @@ In the example, several arguments can be passed to satisfy your requirements: - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. -#### 2.3 Sample Output +#### 4.1 Sample Output #### [upstage/SOLAR-10.7B-Instruct-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0) ```log Inference time: XXXX s diff --git a/python/llm/example/GPU/PyTorch-Models/Model/solar/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/solar/generate.py index af9b2844..930a3881 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/solar/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/solar/generate.py @@ -47,6 +47,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/starcoder/README.md b/python/llm/example/GPU/PyTorch-Models/Model/starcoder/README.md index 8506db84..a1fb0a66 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/starcoder/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/starcoder/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a StarCoder model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -18,20 +19,85 @@ conda activate llm pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
-For optimal performance on Arc, it is recommended to set several environment variables. +For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py --prompt 'def print_hello_world():' ``` @@ -42,7 +108,7 @@ In the example, several arguments can be passed to satisfy your requirements: - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `def print_hello_world():'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. -#### 2.3 Sample Output +#### 4.1 Sample Output #### [bigcode/starcoder](https://huggingface.co/bigcode/starcoder) ```log Inference time: xxxx s diff --git a/python/llm/example/GPU/PyTorch-Models/Model/starcoder/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/starcoder/generate.py index 8caf6761..a4e0a08a 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/starcoder/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/starcoder/generate.py @@ -44,6 +44,8 @@ if __name__ == '__main__': low_cpu_mem_usage=True) # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/yi/README.md b/python/llm/example/GPU/PyTorch-Models/Model/yi/README.md index e4927ae1..32f85134 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/yi/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/yi/README.md @@ -7,6 +7,7 @@ To run these examples with BigDL-LLM on Intel GPUs, we have some recommended req ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Yi model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs. ### 1. Install +#### 1.1 Installation on Linux We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). After installing conda, create a Python environment for BigDL-LLM: @@ -19,20 +20,85 @@ pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-w pip install einops # additional package required for Yi-6B to conduct generation ``` +#### 1.2 Installation on Windows +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 libuv +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install einops # additional package required for Yi-6B to conduct generation +``` + ### 2. Configures OneAPI environment variables +#### 2.1 Configurations for Linux ```bash source /opt/intel/oneapi/setvars.sh ``` -### 3. Run - -For optimal performance on Arc, it is recommended to set several environment variables. +#### 2.2 Configurations for Windows +```cmd +call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" +``` +> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported. +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +#### 3.1 Configurations for Linux +
+For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ```bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ``` +
+ +
+ +For Intel Data Center GPU Max Series + +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +#### 3.2 Configurations for Windows +
+ +For Intel iGPU + +```cmd +set SYCL_CACHE_PERSISTENT=1 +set BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +
+ +For Intel Arc™ A300-Series or Pro A60 + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` + +
+ +
+ +For other Intel dGPU Series + +There is no need to set further environment variables. + +
+ +> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +### 4. Running examples + ```bash python ./generate.py ``` diff --git a/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py index 0bafebc1..d6649004 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py @@ -37,11 +37,13 @@ if __name__ == '__main__': args = parser.parse_args() model_path = args.repo_id_or_model_path - # Load model in 4 bit, - # which convert the relevant layers in the model into INT4 format model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, use_cache=True) + + # With only one line to enable BigDL-LLM optimization on model + # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function. + # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. model = optimize_model(model) model = model.to('xpu')