ChatGLM Examples Restructure regarding Installation Steps (#11285)
* merge install step in glm examples * fix section * fix section * fix tiktoken
This commit is contained in:
parent
91965b5d05
commit
0e7a31a09c
10 changed files with 98 additions and 635 deletions
|
|
@ -5,9 +5,7 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o
|
|||
## 0. Requirements
|
||||
To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
|
||||
|
||||
## Example 1: Predict Tokens using `generate()` API
|
||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations.
|
||||
### 1. Install
|
||||
## 1. Install
|
||||
We suggest using conda to manage environment:
|
||||
|
||||
On Linux:
|
||||
|
|
@ -29,7 +27,11 @@ conda activate llm
|
|||
pip install --pre --upgrade ipex-llm[all]
|
||||
```
|
||||
|
||||
### 2. Run
|
||||
## 2. Run
|
||||
|
||||
### Example 1: Predict Tokens using `generate()` API
|
||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations.
|
||||
|
||||
```
|
||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
|
||||
```
|
||||
|
|
@ -63,7 +65,7 @@ numactl -C 0-47 -m 0 python ./generate.py
|
|||
```
|
||||
|
||||
#### 2.3 Sample Output
|
||||
#### [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
|
||||
##### [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
|
||||
```log
|
||||
Inference time: xxxx s
|
||||
-------------------- Prompt --------------------
|
||||
|
|
@ -88,31 +90,9 @@ Inference time: xxxx s
|
|||
答: Artificial Intelligence (AI) refers to the ability of a computer or machine to perform tasks that typically require human-like intelligence, such as understanding language, recognizing patterns
|
||||
```
|
||||
|
||||
## Example 2: Stream Chat using `stream_chat()` API
|
||||
### Example 2: Stream Chat using `stream_chat()` API
|
||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM2 model to stream chat, with IPEX-LLM INT4 optimizations.
|
||||
### 1. Install
|
||||
We suggest using conda to manage environment:
|
||||
|
||||
On Linux:
|
||||
|
||||
```bash
|
||||
conda create -n llm python=3.11 # recommend to use Python 3.11
|
||||
conda activate llm
|
||||
|
||||
# install the latest ipex-llm nightly build with 'all' option
|
||||
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
|
||||
```
|
||||
|
||||
On Windows:
|
||||
|
||||
```cmd
|
||||
conda create -n llm python=3.11
|
||||
conda activate llm
|
||||
|
||||
pip install --pre --upgrade ipex-llm[all]
|
||||
```
|
||||
|
||||
### 2. Run
|
||||
**Stream Chat using `stream_chat()` API**:
|
||||
```
|
||||
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
|
||||
|
|
|
|||
|
|
@ -5,9 +5,7 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o
|
|||
## 0. Requirements
|
||||
To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
|
||||
|
||||
## Example 1: Predict Tokens using `generate()` API
|
||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations.
|
||||
### 1. Install
|
||||
## 1. Install
|
||||
We suggest using conda to manage environment:
|
||||
|
||||
On Linux:
|
||||
|
|
@ -29,7 +27,11 @@ conda activate llm
|
|||
pip install --pre --upgrade ipex-llm[all]
|
||||
```
|
||||
|
||||
### 2. Run
|
||||
## 2. Run
|
||||
|
||||
### Example 1: Predict Tokens using `generate()` API
|
||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations.
|
||||
|
||||
```
|
||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
|
||||
```
|
||||
|
|
@ -63,7 +65,7 @@ numactl -C 0-47 -m 0 python ./generate.py
|
|||
```
|
||||
|
||||
#### 2.3 Sample Output
|
||||
#### [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b)
|
||||
##### [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b)
|
||||
```log
|
||||
Inference time: xxxx s
|
||||
-------------------- Prompt --------------------
|
||||
|
|
@ -89,31 +91,9 @@ What is AI?
|
|||
AI stands for Artificial Intelligence. It refers to the development of computer systems that can perform tasks that would normally require human intelligence, such as recognizing speech or making
|
||||
```
|
||||
|
||||
## Example 2: Stream Chat using `stream_chat()` API
|
||||
### Example 2: Stream Chat using `stream_chat()` API
|
||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with IPEX-LLM INT4 optimizations.
|
||||
### 1. Install
|
||||
We suggest using conda to manage environment:
|
||||
|
||||
On Linux:
|
||||
|
||||
```bash
|
||||
conda create -n llm python=3.11 # recommend to use Python 3.11
|
||||
conda activate llm
|
||||
|
||||
# install the latest ipex-llm nightly build with 'all' option
|
||||
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
|
||||
```
|
||||
|
||||
On Windows:
|
||||
|
||||
```cmd
|
||||
conda create -n llm python=3.11
|
||||
conda activate llm
|
||||
|
||||
pip install --pre --upgrade ipex-llm[all]
|
||||
```
|
||||
|
||||
### 2. Run
|
||||
**Stream Chat using `stream_chat()` API**:
|
||||
```
|
||||
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
|
||||
|
|
|
|||
|
|
@ -5,9 +5,7 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o
|
|||
## 0. Requirements
|
||||
To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
|
||||
|
||||
## Example 1: Predict Tokens using `generate()` API
|
||||
In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations.
|
||||
### 1. Install
|
||||
## 1. Install
|
||||
We suggest using conda to manage environment:
|
||||
|
||||
On Linux:
|
||||
|
|
@ -20,7 +18,7 @@ conda activate llm
|
|||
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
|
||||
|
||||
# install tiktoken required for GLM-4
|
||||
pip install tiktoken
|
||||
pip install "tiktoken>=0.7.0"
|
||||
```
|
||||
|
||||
On Windows:
|
||||
|
|
@ -31,10 +29,14 @@ conda activate llm
|
|||
|
||||
pip install --pre --upgrade ipex-llm[all]
|
||||
|
||||
pip install tiktoken
|
||||
pip install "tiktoken>=0.7.0"
|
||||
```
|
||||
|
||||
### 2. Run
|
||||
## 2. Run
|
||||
|
||||
### Example 1: Predict Tokens using `generate()` API
|
||||
In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations.
|
||||
|
||||
```
|
||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
|
||||
```
|
||||
|
|
@ -95,36 +97,9 @@ What is AI?
|
|||
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term "art
|
||||
```
|
||||
|
||||
## Example 2: Stream Chat using `stream_chat()` API
|
||||
### Example 2: Stream Chat using `stream_chat()` API
|
||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a GLM-4 model to stream chat, with IPEX-LLM INT4 optimizations.
|
||||
### 1. Install
|
||||
We suggest using conda to manage environment:
|
||||
|
||||
On Linux:
|
||||
|
||||
```bash
|
||||
conda create -n llm python=3.11 # recommend to use Python 3.11
|
||||
conda activate llm
|
||||
|
||||
# install the latest ipex-llm nightly build with 'all' option
|
||||
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
|
||||
|
||||
# install tiktoken required for GLM-4
|
||||
pip install tiktoken
|
||||
```
|
||||
|
||||
On Windows:
|
||||
|
||||
```cmd
|
||||
conda create -n llm python=3.11
|
||||
conda activate llm
|
||||
|
||||
pip install --pre --upgrade ipex-llm[all]
|
||||
|
||||
pip install tiktoken
|
||||
```
|
||||
|
||||
### 2. Run
|
||||
**Stream Chat using `stream_chat()` API**:
|
||||
```
|
||||
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
|
||||
|
|
|
|||
|
|
@ -21,7 +21,7 @@ conda activate llm
|
|||
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
|
||||
|
||||
# install tiktoken required for GLM-4
|
||||
pip install tiktoken
|
||||
pip install "tiktoken>=0.7.0"
|
||||
```
|
||||
|
||||
On Windows:
|
||||
|
|
@ -32,7 +32,7 @@ conda activate llm
|
|||
|
||||
pip install --pre --upgrade ipex-llm[all]
|
||||
|
||||
pip install tiktoken
|
||||
pip install "tiktoken>=0.7.0"
|
||||
```
|
||||
|
||||
### 2. Run
|
||||
|
|
|
|||
|
|
@ -5,11 +5,8 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o
|
|||
## 0. Requirements
|
||||
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
|
||||
|
||||
## Example 1: Predict Tokens using `generate()` API
|
||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
|
||||
|
||||
### 1. Install
|
||||
#### 1.1 Installation on Linux
|
||||
## 1. Install
|
||||
### 1.1 Installation on Linux
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11
|
||||
|
|
@ -18,7 +15,7 @@ conda activate llm
|
|||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
|
||||
#### 1.2 Installation on Windows
|
||||
### 1.2 Installation on Windows
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11 libuv
|
||||
|
|
@ -28,7 +25,7 @@ conda activate llm
|
|||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
|
||||
### 2. Configures OneAPI environment variables for Linux
|
||||
## 2. Configures OneAPI environment variables for Linux
|
||||
|
||||
> [!NOTE]
|
||||
> Skip this step if you are running on Windows.
|
||||
|
|
@ -39,9 +36,9 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this
|
|||
source /opt/intel/oneapi/setvars.sh
|
||||
```
|
||||
|
||||
### 3. Runtime Configurations
|
||||
## 3. Runtime Configurations
|
||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
|
||||
#### 3.1 Configurations for Linux
|
||||
### 3.1 Configurations for Linux
|
||||
<details>
|
||||
|
||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
|
||||
|
|
@ -78,7 +75,7 @@ export BIGDL_LLM_XMX_DISABLED=1
|
|||
|
||||
</details>
|
||||
|
||||
#### 3.2 Configurations for Windows
|
||||
### 3.2 Configurations for Windows
|
||||
<details>
|
||||
|
||||
<summary>For Intel iGPU</summary>
|
||||
|
|
@ -103,7 +100,11 @@ set SYCL_CACHE_PERSISTENT=1
|
|||
> [!NOTE]
|
||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
|
||||
|
||||
### 4. Running examples
|
||||
## 4. Running examples
|
||||
|
||||
### Example 1: Predict Tokens using `generate()` API
|
||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
|
||||
|
||||
```
|
||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
|
||||
```
|
||||
|
|
@ -139,103 +140,9 @@ Inference time: xxxx s
|
|||
答: Artificial Intelligence (AI) refers to the ability of a computer or machine to perform tasks that typically require human-like intelligence, such as understanding language, recognizing patterns
|
||||
```
|
||||
|
||||
## Example 2: Stream Chat using `stream_chat()` API
|
||||
### Example 2: Stream Chat using `stream_chat()` API
|
||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM2 model to stream chat, with IPEX-LLM INT4 optimizations.
|
||||
### 1. Install
|
||||
#### 1.1 Installation on Linux
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11
|
||||
conda activate llm
|
||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
|
||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
#### 1.2 Installation on Windows
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11 libuv
|
||||
conda activate llm
|
||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
|
||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
|
||||
|
||||
### 2. Configures OneAPI environment variables for Linux
|
||||
|
||||
> [!NOTE]
|
||||
> Skip this step if you are running on Windows.
|
||||
|
||||
This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
|
||||
|
||||
```bash
|
||||
source /opt/intel/oneapi/setvars.sh
|
||||
```
|
||||
|
||||
### 3. Runtime Configurations
|
||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
|
||||
#### 3.1 Configurations for Linux
|
||||
<details>
|
||||
|
||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
|
||||
|
||||
```bash
|
||||
export USE_XETLA=OFF
|
||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||
export SYCL_CACHE_PERSISTENT=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>For Intel Data Center GPU Max Series</summary>
|
||||
|
||||
```bash
|
||||
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
|
||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||
export SYCL_CACHE_PERSISTENT=1
|
||||
export ENABLE_SDP_FUSION=1
|
||||
```
|
||||
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>For Intel iGPU</summary>
|
||||
|
||||
```bash
|
||||
export SYCL_CACHE_PERSISTENT=1
|
||||
export BIGDL_LLM_XMX_DISABLED=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
#### 3.2 Configurations for Windows
|
||||
<details>
|
||||
|
||||
<summary>For Intel iGPU</summary>
|
||||
|
||||
```cmd
|
||||
set SYCL_CACHE_PERSISTENT=1
|
||||
set BIGDL_LLM_XMX_DISABLED=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>For Intel Arc™ A-Series Graphics</summary>
|
||||
|
||||
```cmd
|
||||
set SYCL_CACHE_PERSISTENT=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
> [!NOTE]
|
||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
|
||||
|
||||
### 4. Running examples
|
||||
**Stream Chat using `stream_chat()` API**:
|
||||
```
|
||||
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
|
||||
|
|
|
|||
|
|
@ -5,10 +5,8 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o
|
|||
## 0. Requirements
|
||||
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
|
||||
|
||||
## Example 1: Predict Tokens using `generate()` API
|
||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
|
||||
### 1. Install
|
||||
#### 1.1 Installation on Linux
|
||||
## 1. Install
|
||||
### 1.1 Installation on Linux
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11
|
||||
|
|
@ -17,7 +15,7 @@ conda activate llm
|
|||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
|
||||
#### 1.2 Installation on Windows
|
||||
### 1.2 Installation on Windows
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11 libuv
|
||||
|
|
@ -27,7 +25,7 @@ conda activate llm
|
|||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
|
||||
### 2. Configures OneAPI environment variables for Linux
|
||||
## 2. Configures OneAPI environment variables for Linux
|
||||
|
||||
> [!NOTE]
|
||||
> Skip this step if you are running on Windows.
|
||||
|
|
@ -38,9 +36,9 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this
|
|||
source /opt/intel/oneapi/setvars.sh
|
||||
```
|
||||
|
||||
### 3. Runtime Configurations
|
||||
## 3. Runtime Configurations
|
||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
|
||||
#### 3.1 Configurations for Linux
|
||||
### 3.1 Configurations for Linux
|
||||
<details>
|
||||
|
||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
|
||||
|
|
@ -77,7 +75,7 @@ export BIGDL_LLM_XMX_DISABLED=1
|
|||
|
||||
</details>
|
||||
|
||||
#### 3.2 Configurations for Windows
|
||||
### 3.2 Configurations for Windows
|
||||
<details>
|
||||
|
||||
<summary>For Intel iGPU</summary>
|
||||
|
|
@ -101,7 +99,10 @@ set SYCL_CACHE_PERSISTENT=1
|
|||
|
||||
> [!NOTE]
|
||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
|
||||
### 4. Running examples
|
||||
## 4. Running examples
|
||||
|
||||
### Example 1: Predict Tokens using `generate()` API
|
||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
|
||||
|
||||
```
|
||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
|
||||
|
|
@ -139,103 +140,8 @@ What is AI?
|
|||
AI stands for Artificial Intelligence. It refers to the development of computer systems or machines that can perform tasks that would normally require human intelligence, such as recognizing patterns
|
||||
```
|
||||
|
||||
## Example 2: Stream Chat using `stream_chat()` API
|
||||
### Example 2: Stream Chat using `stream_chat()` API
|
||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with IPEX-LLM INT4 optimizations.
|
||||
### 1. Install
|
||||
#### 1.1 Installation on Linux
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11
|
||||
conda activate llm
|
||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
|
||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
|
||||
#### 1.2 Installation on Windows
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11 libuv
|
||||
conda activate llm
|
||||
|
||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
|
||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
|
||||
### 2. Configures OneAPI environment variables for Linux
|
||||
|
||||
> [!NOTE]
|
||||
> Skip this step if you are running on Windows.
|
||||
|
||||
This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
|
||||
|
||||
```bash
|
||||
source /opt/intel/oneapi/setvars.sh
|
||||
```
|
||||
|
||||
### 3. Runtime Configurations
|
||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
|
||||
#### 3.1 Configurations for Linux
|
||||
<details>
|
||||
|
||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
|
||||
|
||||
```bash
|
||||
export USE_XETLA=OFF
|
||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||
export SYCL_CACHE_PERSISTENT=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>For Intel Data Center GPU Max Series</summary>
|
||||
|
||||
```bash
|
||||
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
|
||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||
export SYCL_CACHE_PERSISTENT=1
|
||||
export ENABLE_SDP_FUSION=1
|
||||
```
|
||||
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>For Intel iGPU</summary>
|
||||
|
||||
```bash
|
||||
export SYCL_CACHE_PERSISTENT=1
|
||||
export BIGDL_LLM_XMX_DISABLED=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
#### 3.2 Configurations for Windows
|
||||
<details>
|
||||
|
||||
<summary>For Intel iGPU</summary>
|
||||
|
||||
```cmd
|
||||
set SYCL_CACHE_PERSISTENT=1
|
||||
set BIGDL_LLM_XMX_DISABLED=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>For Intel Arc™ A-Series Graphics</summary>
|
||||
|
||||
```cmd
|
||||
set SYCL_CACHE_PERSISTENT=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
> [!NOTE]
|
||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
|
||||
### 4. Running examples
|
||||
|
||||
**Stream Chat using `stream_chat()` API**:
|
||||
```
|
||||
|
|
|
|||
|
|
@ -4,10 +4,8 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o
|
|||
## 0. Requirements
|
||||
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
|
||||
|
||||
## Example 1: Predict Tokens using `generate()` API
|
||||
In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
|
||||
### 1. Install
|
||||
#### 1.1 Installation on Linux
|
||||
## 1. Install
|
||||
### 1.1 Installation on Linux
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11
|
||||
|
|
@ -16,10 +14,10 @@ conda activate llm
|
|||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
|
||||
# install tiktoken required for GLM-4
|
||||
pip install tiktoken
|
||||
pip install "tiktoken>=0.7.0"
|
||||
```
|
||||
|
||||
#### 1.2 Installation on Windows
|
||||
### 1.2 Installation on Windows
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11 libuv
|
||||
|
|
@ -29,10 +27,10 @@ conda activate llm
|
|||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
|
||||
# install tiktoken required for GLM-4
|
||||
pip install tiktoken
|
||||
pip install "tiktoken>=0.7.0"
|
||||
```
|
||||
|
||||
### 2. Configures OneAPI environment variables for Linux
|
||||
## 2. Configures OneAPI environment variables for Linux
|
||||
|
||||
> [!NOTE]
|
||||
> Skip this step if you are running on Windows.
|
||||
|
|
@ -43,9 +41,9 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this
|
|||
source /opt/intel/oneapi/setvars.sh
|
||||
```
|
||||
|
||||
### 3. Runtime Configurations
|
||||
## 3. Runtime Configurations
|
||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
|
||||
#### 3.1 Configurations for Linux
|
||||
### 3.1 Configurations for Linux
|
||||
<details>
|
||||
|
||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
|
||||
|
|
@ -82,7 +80,7 @@ export BIGDL_LLM_XMX_DISABLED=1
|
|||
|
||||
</details>
|
||||
|
||||
#### 3.2 Configurations for Windows
|
||||
### 3.2 Configurations for Windows
|
||||
<details>
|
||||
|
||||
<summary>For Intel iGPU</summary>
|
||||
|
|
@ -106,7 +104,10 @@ set SYCL_CACHE_PERSISTENT=1
|
|||
|
||||
> [!NOTE]
|
||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
|
||||
### 4. Running examples
|
||||
## 4. Running examples
|
||||
|
||||
### Example 1: Predict Tokens using `generate()` API
|
||||
In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
|
||||
|
||||
```
|
||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
|
||||
|
|
@ -118,7 +119,7 @@ Arguments info:
|
|||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
|
||||
|
||||
#### Sample Output
|
||||
##### [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat)
|
||||
#### [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat)
|
||||
```log
|
||||
Inference time: xxxx s
|
||||
-------------------- Prompt --------------------
|
||||
|
|
@ -145,109 +146,8 @@ What is AI?
|
|||
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term "art
|
||||
```
|
||||
|
||||
## Example 2: Stream Chat using `stream_chat()` API
|
||||
### Example 2: Stream Chat using `stream_chat()` API
|
||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a GLM-4 model to stream chat, with IPEX-LLM INT4 optimizations.
|
||||
### 1. Install
|
||||
#### 1.1 Installation on Linux
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11
|
||||
conda activate llm
|
||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
|
||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
|
||||
# install tiktoken required for GLM-4
|
||||
pip install tiktoken
|
||||
```
|
||||
|
||||
#### 1.2 Installation on Windows
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11 libuv
|
||||
conda activate llm
|
||||
|
||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
|
||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
|
||||
# install tiktoken required for GLM-4
|
||||
pip install tiktoken
|
||||
```
|
||||
|
||||
### 2. Configures OneAPI environment variables for Linux
|
||||
|
||||
> [!NOTE]
|
||||
> Skip this step if you are running on Windows.
|
||||
|
||||
This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
|
||||
|
||||
```bash
|
||||
source /opt/intel/oneapi/setvars.sh
|
||||
```
|
||||
|
||||
### 3. Runtime Configurations
|
||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
|
||||
#### 3.1 Configurations for Linux
|
||||
<details>
|
||||
|
||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
|
||||
|
||||
```bash
|
||||
export USE_XETLA=OFF
|
||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||
export SYCL_CACHE_PERSISTENT=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>For Intel Data Center GPU Max Series</summary>
|
||||
|
||||
```bash
|
||||
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
|
||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||
export SYCL_CACHE_PERSISTENT=1
|
||||
export ENABLE_SDP_FUSION=1
|
||||
```
|
||||
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>For Intel iGPU</summary>
|
||||
|
||||
```bash
|
||||
export SYCL_CACHE_PERSISTENT=1
|
||||
export BIGDL_LLM_XMX_DISABLED=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
#### 3.2 Configurations for Windows
|
||||
<details>
|
||||
|
||||
<summary>For Intel iGPU</summary>
|
||||
|
||||
```cmd
|
||||
set SYCL_CACHE_PERSISTENT=1
|
||||
set BIGDL_LLM_XMX_DISABLED=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>For Intel Arc™ A-Series Graphics</summary>
|
||||
|
||||
```cmd
|
||||
set SYCL_CACHE_PERSISTENT=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
> [!NOTE]
|
||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
|
||||
### 4. Running examples
|
||||
|
||||
**Stream Chat using `stream_chat()` API**:
|
||||
```
|
||||
|
|
|
|||
|
|
@ -4,10 +4,8 @@ In this directory, you will find examples on how you could use IPEX-LLM `optimiz
|
|||
## Requirements
|
||||
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
|
||||
|
||||
## Example 1: Predict Tokens using `generate()` API
|
||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
|
||||
### 1. Install
|
||||
#### 1.1 Installation on Linux
|
||||
## 1. Install
|
||||
### 1.1 Installation on Linux
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11
|
||||
|
|
@ -16,7 +14,7 @@ conda activate llm
|
|||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
|
||||
#### 1.2 Installation on Windows
|
||||
### 1.2 Installation on Windows
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11 libuv
|
||||
|
|
@ -26,7 +24,7 @@ conda activate llm
|
|||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
|
||||
### 2. Configures OneAPI environment variables for Linux
|
||||
## 2. Configures OneAPI environment variables for Linux
|
||||
|
||||
> [!NOTE]
|
||||
> Skip this step if you are running on Windows.
|
||||
|
|
@ -37,9 +35,9 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this
|
|||
source /opt/intel/oneapi/setvars.sh
|
||||
```
|
||||
|
||||
### 3. Runtime Configurations
|
||||
## 3. Runtime Configurations
|
||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
|
||||
#### 3.1 Configurations for Linux
|
||||
### 3.1 Configurations for Linux
|
||||
<details>
|
||||
|
||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
|
||||
|
|
@ -76,7 +74,7 @@ export BIGDL_LLM_XMX_DISABLED=1
|
|||
|
||||
</details>
|
||||
|
||||
#### 3.2 Configurations for Windows
|
||||
### 3.2 Configurations for Windows
|
||||
<details>
|
||||
|
||||
<summary>For Intel iGPU</summary>
|
||||
|
|
@ -100,7 +98,11 @@ set SYCL_CACHE_PERSISTENT=1
|
|||
|
||||
> [!NOTE]
|
||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
|
||||
### 4. Running examples
|
||||
|
||||
## 4. Running examples
|
||||
|
||||
### Example 1: Predict Tokens using `generate()` API
|
||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
|
||||
|
||||
```bash
|
||||
python ./generate.py --prompt 'AI是什么?'
|
||||
|
|
@ -112,8 +114,9 @@ In the example, several arguments can be passed to satisfy your requirements:
|
|||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
|
||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
|
||||
|
||||
#### 4.1 Sample Output
|
||||
#### Sample Output
|
||||
#### [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
|
||||
|
||||
```log
|
||||
Inference time: xxxx s
|
||||
-------------------- Output --------------------
|
||||
|
|
@ -132,103 +135,8 @@ Inference time: xxxx s
|
|||
答: Artificial Intelligence (AI) refers to the ability of a computer or machine to perform tasks that typically require human-like intelligence, such as understanding language, recognizing patterns
|
||||
```
|
||||
|
||||
## Example 2: Stream Chat using `stream_chat()` API
|
||||
### Example 2: Stream Chat using `stream_chat()` API
|
||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM2 model to stream chat, with IPEX-LLM INT4 optimizations.
|
||||
### 1. Install
|
||||
#### 1.1 Installation on Linux
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11
|
||||
conda activate llm
|
||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
|
||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
|
||||
#### 1.2 Installation on Windows
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11 libuv
|
||||
conda activate llm
|
||||
|
||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
|
||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
|
||||
### 2. Configures OneAPI environment variables for Linux
|
||||
|
||||
> [!NOTE]
|
||||
> Skip this step if you are running on Windows.
|
||||
|
||||
This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
|
||||
|
||||
```bash
|
||||
source /opt/intel/oneapi/setvars.sh
|
||||
```
|
||||
|
||||
### 3. Runtime Configurations
|
||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
|
||||
#### 3.1 Configurations for Linux
|
||||
<details>
|
||||
|
||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
|
||||
|
||||
```bash
|
||||
export USE_XETLA=OFF
|
||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||
export SYCL_CACHE_PERSISTENT=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>For Intel Data Center GPU Max Series</summary>
|
||||
|
||||
```bash
|
||||
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
|
||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||
export SYCL_CACHE_PERSISTENT=1
|
||||
export ENABLE_SDP_FUSION=1
|
||||
```
|
||||
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>For Intel iGPU</summary>
|
||||
|
||||
```bash
|
||||
export SYCL_CACHE_PERSISTENT=1
|
||||
export BIGDL_LLM_XMX_DISABLED=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
#### 3.2 Configurations for Windows
|
||||
<details>
|
||||
|
||||
<summary>For Intel iGPU</summary>
|
||||
|
||||
```cmd
|
||||
set SYCL_CACHE_PERSISTENT=1
|
||||
set BIGDL_LLM_XMX_DISABLED=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>For Intel Arc™ A-Series Graphics</summary>
|
||||
|
||||
```cmd
|
||||
set SYCL_CACHE_PERSISTENT=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
> [!NOTE]
|
||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
|
||||
### 4. Running examples
|
||||
|
||||
**Stream Chat using `stream_chat()` API**:
|
||||
```
|
||||
|
|
@ -244,4 +152,4 @@ In the example, several arguments can be passed to satisfy your requirements:
|
|||
|
||||
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the ChatGLM2 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/chatglm2-6b'`.
|
||||
- `--question QUESTION`: argument defining the question to ask. It is default to be `"晚上睡不着应该怎么办"`.
|
||||
- `--disable-stream`: argument defining whether to stream chat. If include `--disable-stream` when running the script, the stream chat is disabled and `chat()` API is used.
|
||||
- `--disable-stream`: argument defining whether to stream chat. If include `--disable-stream` when running the script, the stream chat is disabled and `chat()` API is used.
|
||||
|
|
|
|||
|
|
@ -4,10 +4,8 @@ In this directory, you will find examples on how you could use IPEX-LLM `optimiz
|
|||
## Requirements
|
||||
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
|
||||
|
||||
## Example 1: Predict Tokens using `generate()` API
|
||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
|
||||
### 1. Install
|
||||
#### 1.1 Installation on Linux
|
||||
## 1. Install
|
||||
### 1.1 Installation on Linux
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11
|
||||
|
|
@ -16,7 +14,7 @@ conda activate llm
|
|||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
|
||||
#### 1.2 Installation on Windows
|
||||
### 1.2 Installation on Windows
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11 libuv
|
||||
|
|
@ -26,7 +24,7 @@ conda activate llm
|
|||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
|
||||
### 2. Configures OneAPI environment variables for Linux
|
||||
## 2. Configures OneAPI environment variables for Linux
|
||||
|
||||
> [!NOTE]
|
||||
> Skip this step if you are running on Windows.
|
||||
|
|
@ -37,9 +35,9 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this
|
|||
source /opt/intel/oneapi/setvars.sh
|
||||
```
|
||||
|
||||
### 3. Runtime Configurations
|
||||
## 3. Runtime Configurations
|
||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
|
||||
#### 3.1 Configurations for Linux
|
||||
### 3.1 Configurations for Linux
|
||||
<details>
|
||||
|
||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
|
||||
|
|
@ -76,7 +74,7 @@ export BIGDL_LLM_XMX_DISABLED=1
|
|||
|
||||
</details>
|
||||
|
||||
#### 3.2 Configurations for Windows
|
||||
### 3.2 Configurations for Windows
|
||||
<details>
|
||||
|
||||
<summary>For Intel iGPU</summary>
|
||||
|
|
@ -100,7 +98,10 @@ set SYCL_CACHE_PERSISTENT=1
|
|||
|
||||
> [!NOTE]
|
||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
|
||||
### 4. Running examples
|
||||
## 4. Running examples
|
||||
|
||||
### Example 1: Predict Tokens using `generate()` API
|
||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
|
||||
|
||||
```bash
|
||||
python ./generate.py --prompt 'AI是什么?'
|
||||
|
|
@ -112,7 +113,7 @@ In the example, several arguments can be passed to satisfy your requirements:
|
|||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
|
||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
|
||||
|
||||
#### 4.1 Sample Output
|
||||
#### Sample Output
|
||||
#### [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b)
|
||||
```log
|
||||
Inference time: xxxx s
|
||||
|
|
@ -131,103 +132,9 @@ What is AI?
|
|||
AI stands for Artificial Intelligence. It refers to the development of computer systems or machines that can perform tasks that would normally require human intelligence, such as recognizing patterns
|
||||
```
|
||||
|
||||
## Example 2: Stream Chat using `stream_chat()` API
|
||||
### Example 2: Stream Chat using `stream_chat()` API
|
||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with IPEX-LLM INT4 optimizations.
|
||||
### 1. Install
|
||||
#### 1.1 Installation on Linux
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11
|
||||
conda activate llm
|
||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
|
||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
|
||||
#### 1.2 Installation on Windows
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.11 libuv
|
||||
conda activate llm
|
||||
|
||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
|
||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
|
||||
### 2. Configures OneAPI environment variables for Linux
|
||||
|
||||
> [!NOTE]
|
||||
> Skip this step if you are running on Windows.
|
||||
|
||||
This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
|
||||
|
||||
```bash
|
||||
source /opt/intel/oneapi/setvars.sh
|
||||
```
|
||||
|
||||
### 3. Runtime Configurations
|
||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
|
||||
#### 3.1 Configurations for Linux
|
||||
<details>
|
||||
|
||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
|
||||
|
||||
```bash
|
||||
export USE_XETLA=OFF
|
||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||
export SYCL_CACHE_PERSISTENT=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>For Intel Data Center GPU Max Series</summary>
|
||||
|
||||
```bash
|
||||
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
|
||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||
export SYCL_CACHE_PERSISTENT=1
|
||||
export ENABLE_SDP_FUSION=1
|
||||
```
|
||||
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>For Intel iGPU</summary>
|
||||
|
||||
```bash
|
||||
export SYCL_CACHE_PERSISTENT=1
|
||||
export BIGDL_LLM_XMX_DISABLED=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
#### 3.2 Configurations for Windows
|
||||
<details>
|
||||
|
||||
<summary>For Intel iGPU</summary>
|
||||
|
||||
```cmd
|
||||
set SYCL_CACHE_PERSISTENT=1
|
||||
set BIGDL_LLM_XMX_DISABLED=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>For Intel Arc™ A-Series Graphics</summary>
|
||||
|
||||
```cmd
|
||||
set SYCL_CACHE_PERSISTENT=1
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
> [!NOTE]
|
||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
|
||||
### 4. Running examples
|
||||
**Stream Chat using `stream_chat()` API**:
|
||||
```
|
||||
python ./streamchat.py
|
||||
|
|
|
|||
|
|
@ -16,7 +16,7 @@ conda activate llm
|
|||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
|
||||
# install tiktoken required for GLM-4
|
||||
pip install tiktoken
|
||||
pip install "tiktoken>=0.7.0"
|
||||
```
|
||||
|
||||
#### 1.2 Installation on Windows
|
||||
|
|
@ -29,7 +29,7 @@ conda activate llm
|
|||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
|
||||
# install tiktoken required for GLM-4
|
||||
pip install tiktoken
|
||||
pip install "tiktoken>=0.7.0"
|
||||
```
|
||||
|
||||
### 2. Configures OneAPI environment variables for Linux
|
||||
|
|
|
|||
Loading…
Reference in a new issue