ChatGLM Examples Restructure regarding Installation Steps (#11285)
* merge install step in glm examples * fix section * fix section * fix tiktoken
This commit is contained in:
		
							parent
							
								
									91965b5d05
								
							
						
					
					
						commit
						0e7a31a09c
					
				
					 10 changed files with 98 additions and 635 deletions
				
			
		| 
						 | 
					@ -5,9 +5,7 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o
 | 
				
			||||||
## 0. Requirements
 | 
					## 0. Requirements
 | 
				
			||||||
To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
 | 
					To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Example 1: Predict Tokens using `generate()` API
 | 
					## 1. Install
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations.
 | 
					 | 
				
			||||||
### 1. Install
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
On Linux:
 | 
					On Linux:
 | 
				
			||||||
| 
						 | 
					@ -29,7 +27,11 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[all]
 | 
					pip install --pre --upgrade ipex-llm[all]
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Run
 | 
					## 2. Run
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Example 1: Predict Tokens using `generate()` API
 | 
				
			||||||
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -63,7 +65,7 @@ numactl -C 0-47 -m 0 python ./generate.py
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 2.3 Sample Output
 | 
					#### 2.3 Sample Output
 | 
				
			||||||
#### [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
 | 
					##### [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
Inference time: xxxx s
 | 
					Inference time: xxxx s
 | 
				
			||||||
-------------------- Prompt --------------------
 | 
					-------------------- Prompt --------------------
 | 
				
			||||||
| 
						 | 
					@ -88,31 +90,9 @@ Inference time: xxxx s
 | 
				
			||||||
答: Artificial Intelligence (AI) refers to the ability of a computer or machine to perform tasks that typically require human-like intelligence, such as understanding language, recognizing patterns
 | 
					答: Artificial Intelligence (AI) refers to the ability of a computer or machine to perform tasks that typically require human-like intelligence, such as understanding language, recognizing patterns
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Example 2: Stream Chat using `stream_chat()` API
 | 
					### Example 2: Stream Chat using `stream_chat()` API
 | 
				
			||||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM2 model to stream chat, with IPEX-LLM INT4 optimizations.
 | 
					In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM2 model to stream chat, with IPEX-LLM INT4 optimizations.
 | 
				
			||||||
### 1. Install
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
On Linux:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
conda create -n llm python=3.11 # recommend to use Python 3.11
 | 
					 | 
				
			||||||
conda activate llm
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
# install the latest ipex-llm nightly build with 'all' option
 | 
					 | 
				
			||||||
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
On Windows:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```cmd
 | 
					 | 
				
			||||||
conda create -n llm python=3.11
 | 
					 | 
				
			||||||
conda activate llm
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
pip install --pre --upgrade ipex-llm[all]
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### 2. Run
 | 
					 | 
				
			||||||
**Stream Chat using `stream_chat()` API**:
 | 
					**Stream Chat using `stream_chat()` API**:
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
 | 
					python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -5,9 +5,7 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o
 | 
				
			||||||
## 0. Requirements
 | 
					## 0. Requirements
 | 
				
			||||||
To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
 | 
					To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Example 1: Predict Tokens using `generate()` API
 | 
					## 1. Install
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations.
 | 
					 | 
				
			||||||
### 1. Install
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
On Linux:
 | 
					On Linux:
 | 
				
			||||||
| 
						 | 
					@ -29,7 +27,11 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[all]
 | 
					pip install --pre --upgrade ipex-llm[all]
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Run
 | 
					## 2. Run
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Example 1: Predict Tokens using `generate()` API
 | 
				
			||||||
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -63,7 +65,7 @@ numactl -C 0-47 -m 0 python ./generate.py
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 2.3 Sample Output
 | 
					#### 2.3 Sample Output
 | 
				
			||||||
#### [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b)
 | 
					##### [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b)
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
Inference time: xxxx s
 | 
					Inference time: xxxx s
 | 
				
			||||||
-------------------- Prompt --------------------
 | 
					-------------------- Prompt --------------------
 | 
				
			||||||
| 
						 | 
					@ -89,31 +91,9 @@ What is AI?
 | 
				
			||||||
AI stands for Artificial Intelligence. It refers to the development of computer systems that can perform tasks that would normally require human intelligence, such as recognizing speech or making
 | 
					AI stands for Artificial Intelligence. It refers to the development of computer systems that can perform tasks that would normally require human intelligence, such as recognizing speech or making
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Example 2: Stream Chat using `stream_chat()` API
 | 
					### Example 2: Stream Chat using `stream_chat()` API
 | 
				
			||||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with IPEX-LLM INT4 optimizations.
 | 
					In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with IPEX-LLM INT4 optimizations.
 | 
				
			||||||
### 1. Install
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
On Linux:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
conda create -n llm python=3.11 # recommend to use Python 3.11
 | 
					 | 
				
			||||||
conda activate llm
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
# install the latest ipex-llm nightly build with 'all' option
 | 
					 | 
				
			||||||
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
On Windows:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```cmd
 | 
					 | 
				
			||||||
conda create -n llm python=3.11
 | 
					 | 
				
			||||||
conda activate llm
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
pip install --pre --upgrade ipex-llm[all]
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### 2. Run
 | 
					 | 
				
			||||||
**Stream Chat using `stream_chat()` API**:
 | 
					**Stream Chat using `stream_chat()` API**:
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
 | 
					python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -5,9 +5,7 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o
 | 
				
			||||||
## 0. Requirements
 | 
					## 0. Requirements
 | 
				
			||||||
To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
 | 
					To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Example 1: Predict Tokens using `generate()` API
 | 
					## 1. Install
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations.
 | 
					 | 
				
			||||||
### 1. Install
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
On Linux:
 | 
					On Linux:
 | 
				
			||||||
| 
						 | 
					@ -20,7 +18,7 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
 | 
					pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
 | 
				
			||||||
 | 
					
 | 
				
			||||||
# install tiktoken required for GLM-4
 | 
					# install tiktoken required for GLM-4
 | 
				
			||||||
pip install tiktoken
 | 
					pip install "tiktoken>=0.7.0"
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
On Windows:
 | 
					On Windows:
 | 
				
			||||||
| 
						 | 
					@ -31,10 +29,14 @@ conda activate llm
 | 
				
			||||||
 | 
					
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[all]
 | 
					pip install --pre --upgrade ipex-llm[all]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
pip install tiktoken
 | 
					pip install "tiktoken>=0.7.0"
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Run
 | 
					## 2. Run
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Example 1: Predict Tokens using `generate()` API
 | 
				
			||||||
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -95,36 +97,9 @@ What is AI?
 | 
				
			||||||
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term "art
 | 
					Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term "art
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Example 2: Stream Chat using `stream_chat()` API
 | 
					### Example 2: Stream Chat using `stream_chat()` API
 | 
				
			||||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a GLM-4 model to stream chat, with IPEX-LLM INT4 optimizations.
 | 
					In the example [streamchat.py](./streamchat.py), we show a basic use case for a GLM-4 model to stream chat, with IPEX-LLM INT4 optimizations.
 | 
				
			||||||
### 1. Install
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
On Linux:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
conda create -n llm python=3.11 # recommend to use Python 3.11
 | 
					 | 
				
			||||||
conda activate llm
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
# install the latest ipex-llm nightly build with 'all' option
 | 
					 | 
				
			||||||
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
# install tiktoken required for GLM-4
 | 
					 | 
				
			||||||
pip install tiktoken
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
On Windows:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```cmd
 | 
					 | 
				
			||||||
conda create -n llm python=3.11
 | 
					 | 
				
			||||||
conda activate llm
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
pip install --pre --upgrade ipex-llm[all]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
pip install tiktoken
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### 2. Run
 | 
					 | 
				
			||||||
**Stream Chat using `stream_chat()` API**:
 | 
					**Stream Chat using `stream_chat()` API**:
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
 | 
					python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -21,7 +21,7 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
 | 
					pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
 | 
				
			||||||
 | 
					
 | 
				
			||||||
# install tiktoken required for GLM-4
 | 
					# install tiktoken required for GLM-4
 | 
				
			||||||
pip install tiktoken
 | 
					pip install "tiktoken>=0.7.0"
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
On Windows:
 | 
					On Windows:
 | 
				
			||||||
| 
						 | 
					@ -32,7 +32,7 @@ conda activate llm
 | 
				
			||||||
 | 
					
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[all]
 | 
					pip install --pre --upgrade ipex-llm[all]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
pip install tiktoken
 | 
					pip install "tiktoken>=0.7.0"
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Run
 | 
					### 2. Run
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -5,11 +5,8 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o
 | 
				
			||||||
## 0. Requirements
 | 
					## 0. Requirements
 | 
				
			||||||
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 | 
					To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Example 1: Predict Tokens using `generate()` API
 | 
					## 1. Install
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
 | 
					### 1.1 Installation on Linux
 | 
				
			||||||
 | 
					 | 
				
			||||||
### 1. Install
 | 
					 | 
				
			||||||
#### 1.1 Installation on Linux
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.11
 | 
					conda create -n llm python=3.11
 | 
				
			||||||
| 
						 | 
					@ -18,7 +15,7 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 1.2 Installation on Windows
 | 
					### 1.2 Installation on Windows
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.11 libuv
 | 
					conda create -n llm python=3.11 libuv
 | 
				
			||||||
| 
						 | 
					@ -28,7 +25,7 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables for Linux
 | 
					## 2. Configures OneAPI environment variables for Linux
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!NOTE]
 | 
					> [!NOTE]
 | 
				
			||||||
> Skip this step if you are running on Windows.
 | 
					> Skip this step if you are running on Windows.
 | 
				
			||||||
| 
						 | 
					@ -39,9 +36,9 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Runtime Configurations
 | 
					## 3. Runtime Configurations
 | 
				
			||||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
#### 3.1 Configurations for Linux
 | 
					### 3.1 Configurations for Linux
 | 
				
			||||||
<details>
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
| 
						 | 
					@ -78,7 +75,7 @@ export BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
</details>
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 3.2 Configurations for Windows
 | 
					### 3.2 Configurations for Windows
 | 
				
			||||||
<details>
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<summary>For Intel iGPU</summary>
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
| 
						 | 
					@ -103,7 +100,11 @@ set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
> [!NOTE]
 | 
					> [!NOTE]
 | 
				
			||||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
					> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 4. Running examples
 | 
					## 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Example 1: Predict Tokens using `generate()` API
 | 
				
			||||||
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					@ -139,103 +140,9 @@ Inference time: xxxx s
 | 
				
			||||||
答: Artificial Intelligence (AI) refers to the ability of a computer or machine to perform tasks that typically require human-like intelligence, such as understanding language, recognizing patterns
 | 
					答: Artificial Intelligence (AI) refers to the ability of a computer or machine to perform tasks that typically require human-like intelligence, such as understanding language, recognizing patterns
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Example 2: Stream Chat using `stream_chat()` API
 | 
					### Example 2: Stream Chat using `stream_chat()` API
 | 
				
			||||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM2 model to stream chat, with IPEX-LLM INT4 optimizations.
 | 
					In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM2 model to stream chat, with IPEX-LLM INT4 optimizations.
 | 
				
			||||||
### 1. Install
 | 
					 | 
				
			||||||
#### 1.1 Installation on Linux
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
conda create -n llm python=3.11
 | 
					 | 
				
			||||||
conda activate llm
 | 
					 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
#### 1.2 Installation on Windows
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
conda create -n llm python=3.11 libuv
 | 
					 | 
				
			||||||
conda activate llm
 | 
					 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					 | 
				
			||||||
### 2. Configures OneAPI environment variables for Linux
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
> [!NOTE]
 | 
					 | 
				
			||||||
> Skip this step if you are running on Windows.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### 3. Runtime Configurations
 | 
					 | 
				
			||||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
					 | 
				
			||||||
#### 3.1 Configurations for Linux
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
export USE_XETLA=OFF
 | 
					 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					 | 
				
			||||||
export SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel Data Center GPU Max Series</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
					 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					 | 
				
			||||||
export SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
export ENABLE_SDP_FUSION=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel iGPU</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
export SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
export BIGDL_LLM_XMX_DISABLED=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
#### 3.2 Configurations for Windows
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel iGPU</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```cmd
 | 
					 | 
				
			||||||
set SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
set BIGDL_LLM_XMX_DISABLED=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel Arc™ A-Series Graphics</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```cmd
 | 
					 | 
				
			||||||
set SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
> [!NOTE]
 | 
					 | 
				
			||||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### 4. Running examples
 | 
					 | 
				
			||||||
**Stream Chat using `stream_chat()` API**:
 | 
					**Stream Chat using `stream_chat()` API**:
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
 | 
					python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -5,10 +5,8 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o
 | 
				
			||||||
## 0. Requirements
 | 
					## 0. Requirements
 | 
				
			||||||
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 | 
					To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Example 1: Predict Tokens using `generate()` API
 | 
					## 1. Install
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
 | 
					### 1.1 Installation on Linux
 | 
				
			||||||
### 1. Install
 | 
					 | 
				
			||||||
#### 1.1 Installation on Linux
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.11
 | 
					conda create -n llm python=3.11
 | 
				
			||||||
| 
						 | 
					@ -17,7 +15,7 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 1.2 Installation on Windows
 | 
					### 1.2 Installation on Windows
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.11 libuv
 | 
					conda create -n llm python=3.11 libuv
 | 
				
			||||||
| 
						 | 
					@ -27,7 +25,7 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables for Linux
 | 
					## 2. Configures OneAPI environment variables for Linux
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!NOTE]
 | 
					> [!NOTE]
 | 
				
			||||||
> Skip this step if you are running on Windows.
 | 
					> Skip this step if you are running on Windows.
 | 
				
			||||||
| 
						 | 
					@ -38,9 +36,9 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Runtime Configurations
 | 
					## 3. Runtime Configurations
 | 
				
			||||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
#### 3.1 Configurations for Linux
 | 
					### 3.1 Configurations for Linux
 | 
				
			||||||
<details>
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
| 
						 | 
					@ -77,7 +75,7 @@ export BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
</details>
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 3.2 Configurations for Windows
 | 
					### 3.2 Configurations for Windows
 | 
				
			||||||
<details>
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<summary>For Intel iGPU</summary>
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
| 
						 | 
					@ -101,7 +99,10 @@ set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!NOTE]
 | 
					> [!NOTE]
 | 
				
			||||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
					> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
### 4. Running examples
 | 
					## 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Example 1: Predict Tokens using `generate()` API
 | 
				
			||||||
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
| 
						 | 
					@ -139,103 +140,8 @@ What is AI?
 | 
				
			||||||
AI stands for Artificial Intelligence. It refers to the development of computer systems or machines that can perform tasks that would normally require human intelligence, such as recognizing patterns
 | 
					AI stands for Artificial Intelligence. It refers to the development of computer systems or machines that can perform tasks that would normally require human intelligence, such as recognizing patterns
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Example 2: Stream Chat using `stream_chat()` API
 | 
					### Example 2: Stream Chat using `stream_chat()` API
 | 
				
			||||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with IPEX-LLM INT4 optimizations.
 | 
					In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with IPEX-LLM INT4 optimizations.
 | 
				
			||||||
### 1. Install
 | 
					 | 
				
			||||||
#### 1.1 Installation on Linux
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
conda create -n llm python=3.11
 | 
					 | 
				
			||||||
conda activate llm
 | 
					 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
#### 1.2 Installation on Windows
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
conda create -n llm python=3.11 libuv
 | 
					 | 
				
			||||||
conda activate llm
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### 2. Configures OneAPI environment variables for Linux
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
> [!NOTE]
 | 
					 | 
				
			||||||
> Skip this step if you are running on Windows.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### 3. Runtime Configurations
 | 
					 | 
				
			||||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
					 | 
				
			||||||
#### 3.1 Configurations for Linux
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
export USE_XETLA=OFF
 | 
					 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					 | 
				
			||||||
export SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel Data Center GPU Max Series</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
					 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					 | 
				
			||||||
export SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
export ENABLE_SDP_FUSION=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel iGPU</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
export SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
export BIGDL_LLM_XMX_DISABLED=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
#### 3.2 Configurations for Windows
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel iGPU</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```cmd
 | 
					 | 
				
			||||||
set SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
set BIGDL_LLM_XMX_DISABLED=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel Arc™ A-Series Graphics</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```cmd
 | 
					 | 
				
			||||||
set SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
> [!NOTE]
 | 
					 | 
				
			||||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
					 | 
				
			||||||
### 4. Running examples
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
**Stream Chat using `stream_chat()` API**:
 | 
					**Stream Chat using `stream_chat()` API**:
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -4,10 +4,8 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o
 | 
				
			||||||
## 0. Requirements
 | 
					## 0. Requirements
 | 
				
			||||||
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 | 
					To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Example 1: Predict Tokens using `generate()` API
 | 
					## 1. Install
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
 | 
					### 1.1 Installation on Linux
 | 
				
			||||||
### 1. Install
 | 
					 | 
				
			||||||
#### 1.1 Installation on Linux
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.11
 | 
					conda create -n llm python=3.11
 | 
				
			||||||
| 
						 | 
					@ -16,10 +14,10 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
				
			||||||
 | 
					
 | 
				
			||||||
# install tiktoken required for GLM-4
 | 
					# install tiktoken required for GLM-4
 | 
				
			||||||
pip install tiktoken
 | 
					pip install "tiktoken>=0.7.0"
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 1.2 Installation on Windows
 | 
					### 1.2 Installation on Windows
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.11 libuv
 | 
					conda create -n llm python=3.11 libuv
 | 
				
			||||||
| 
						 | 
					@ -29,10 +27,10 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
				
			||||||
 | 
					
 | 
				
			||||||
# install tiktoken required for GLM-4
 | 
					# install tiktoken required for GLM-4
 | 
				
			||||||
pip install tiktoken
 | 
					pip install "tiktoken>=0.7.0"
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables for Linux
 | 
					## 2. Configures OneAPI environment variables for Linux
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!NOTE]
 | 
					> [!NOTE]
 | 
				
			||||||
> Skip this step if you are running on Windows.
 | 
					> Skip this step if you are running on Windows.
 | 
				
			||||||
| 
						 | 
					@ -43,9 +41,9 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Runtime Configurations
 | 
					## 3. Runtime Configurations
 | 
				
			||||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
#### 3.1 Configurations for Linux
 | 
					### 3.1 Configurations for Linux
 | 
				
			||||||
<details>
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
| 
						 | 
					@ -82,7 +80,7 @@ export BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
</details>
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 3.2 Configurations for Windows
 | 
					### 3.2 Configurations for Windows
 | 
				
			||||||
<details>
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<summary>For Intel iGPU</summary>
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
| 
						 | 
					@ -106,7 +104,10 @@ set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!NOTE]
 | 
					> [!NOTE]
 | 
				
			||||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
					> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
### 4. Running examples
 | 
					## 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Example 1: Predict Tokens using `generate()` API
 | 
				
			||||||
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
					python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
				
			||||||
| 
						 | 
					@ -118,7 +119,7 @@ Arguments info:
 | 
				
			||||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
					- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### Sample Output
 | 
					#### Sample Output
 | 
				
			||||||
##### [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat)
 | 
					#### [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat)
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
Inference time: xxxx s
 | 
					Inference time: xxxx s
 | 
				
			||||||
-------------------- Prompt --------------------
 | 
					-------------------- Prompt --------------------
 | 
				
			||||||
| 
						 | 
					@ -145,109 +146,8 @@ What is AI?
 | 
				
			||||||
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term "art
 | 
					Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term "art
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Example 2: Stream Chat using `stream_chat()` API
 | 
					### Example 2: Stream Chat using `stream_chat()` API
 | 
				
			||||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a GLM-4 model to stream chat, with IPEX-LLM INT4 optimizations.
 | 
					In the example [streamchat.py](./streamchat.py), we show a basic use case for a GLM-4 model to stream chat, with IPEX-LLM INT4 optimizations.
 | 
				
			||||||
### 1. Install
 | 
					 | 
				
			||||||
#### 1.1 Installation on Linux
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
conda create -n llm python=3.11
 | 
					 | 
				
			||||||
conda activate llm
 | 
					 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
# install tiktoken required for GLM-4
 | 
					 | 
				
			||||||
pip install tiktoken
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
#### 1.2 Installation on Windows
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
conda create -n llm python=3.11 libuv
 | 
					 | 
				
			||||||
conda activate llm
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
# install tiktoken required for GLM-4
 | 
					 | 
				
			||||||
pip install tiktoken
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### 2. Configures OneAPI environment variables for Linux
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
> [!NOTE]
 | 
					 | 
				
			||||||
> Skip this step if you are running on Windows.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### 3. Runtime Configurations
 | 
					 | 
				
			||||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
					 | 
				
			||||||
#### 3.1 Configurations for Linux
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
export USE_XETLA=OFF
 | 
					 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					 | 
				
			||||||
export SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel Data Center GPU Max Series</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
					 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					 | 
				
			||||||
export SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
export ENABLE_SDP_FUSION=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel iGPU</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
export SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
export BIGDL_LLM_XMX_DISABLED=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
#### 3.2 Configurations for Windows
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel iGPU</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```cmd
 | 
					 | 
				
			||||||
set SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
set BIGDL_LLM_XMX_DISABLED=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel Arc™ A-Series Graphics</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```cmd
 | 
					 | 
				
			||||||
set SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
> [!NOTE]
 | 
					 | 
				
			||||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
					 | 
				
			||||||
### 4. Running examples
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
**Stream Chat using `stream_chat()` API**:
 | 
					**Stream Chat using `stream_chat()` API**:
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -4,10 +4,8 @@ In this directory, you will find examples on how you could use IPEX-LLM `optimiz
 | 
				
			||||||
## Requirements
 | 
					## Requirements
 | 
				
			||||||
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 | 
					To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Example 1: Predict Tokens using `generate()` API
 | 
					## 1. Install
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
 | 
					### 1.1 Installation on Linux
 | 
				
			||||||
### 1. Install
 | 
					 | 
				
			||||||
#### 1.1 Installation on Linux
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.11
 | 
					conda create -n llm python=3.11
 | 
				
			||||||
| 
						 | 
					@ -16,7 +14,7 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 1.2 Installation on Windows
 | 
					### 1.2 Installation on Windows
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.11 libuv
 | 
					conda create -n llm python=3.11 libuv
 | 
				
			||||||
| 
						 | 
					@ -26,7 +24,7 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables for Linux
 | 
					## 2. Configures OneAPI environment variables for Linux
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!NOTE]
 | 
					> [!NOTE]
 | 
				
			||||||
> Skip this step if you are running on Windows.
 | 
					> Skip this step if you are running on Windows.
 | 
				
			||||||
| 
						 | 
					@ -37,9 +35,9 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Runtime Configurations
 | 
					## 3. Runtime Configurations
 | 
				
			||||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
#### 3.1 Configurations for Linux
 | 
					### 3.1 Configurations for Linux
 | 
				
			||||||
<details>
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
| 
						 | 
					@ -76,7 +74,7 @@ export BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
</details>
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 3.2 Configurations for Windows
 | 
					### 3.2 Configurations for Windows
 | 
				
			||||||
<details>
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<summary>For Intel iGPU</summary>
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
| 
						 | 
					@ -100,7 +98,11 @@ set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!NOTE]
 | 
					> [!NOTE]
 | 
				
			||||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
					> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
### 4. Running examples
 | 
					
 | 
				
			||||||
 | 
					## 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Example 1: Predict Tokens using `generate()` API
 | 
				
			||||||
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'AI是什么?'
 | 
					python ./generate.py --prompt 'AI是什么?'
 | 
				
			||||||
| 
						 | 
					@ -112,8 +114,9 @@ In the example, several arguments can be passed to satisfy your requirements:
 | 
				
			||||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
 | 
					- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
 | 
				
			||||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
					- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 4.1 Sample Output
 | 
					#### Sample Output
 | 
				
			||||||
#### [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
 | 
					#### [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
Inference time: xxxx s
 | 
					Inference time: xxxx s
 | 
				
			||||||
-------------------- Output --------------------
 | 
					-------------------- Output --------------------
 | 
				
			||||||
| 
						 | 
					@ -132,103 +135,8 @@ Inference time: xxxx s
 | 
				
			||||||
答: Artificial Intelligence (AI) refers to the ability of a computer or machine to perform tasks that typically require human-like intelligence, such as understanding language, recognizing patterns
 | 
					答: Artificial Intelligence (AI) refers to the ability of a computer or machine to perform tasks that typically require human-like intelligence, such as understanding language, recognizing patterns
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Example 2: Stream Chat using `stream_chat()` API
 | 
					### Example 2: Stream Chat using `stream_chat()` API
 | 
				
			||||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM2 model to stream chat, with IPEX-LLM INT4 optimizations.
 | 
					In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM2 model to stream chat, with IPEX-LLM INT4 optimizations.
 | 
				
			||||||
### 1. Install
 | 
					 | 
				
			||||||
#### 1.1 Installation on Linux
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
conda create -n llm python=3.11
 | 
					 | 
				
			||||||
conda activate llm
 | 
					 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
#### 1.2 Installation on Windows
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
conda create -n llm python=3.11 libuv
 | 
					 | 
				
			||||||
conda activate llm
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### 2. Configures OneAPI environment variables for Linux
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
> [!NOTE]
 | 
					 | 
				
			||||||
> Skip this step if you are running on Windows.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### 3. Runtime Configurations
 | 
					 | 
				
			||||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
					 | 
				
			||||||
#### 3.1 Configurations for Linux
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
export USE_XETLA=OFF
 | 
					 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					 | 
				
			||||||
export SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel Data Center GPU Max Series</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
					 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					 | 
				
			||||||
export SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
export ENABLE_SDP_FUSION=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel iGPU</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
export SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
export BIGDL_LLM_XMX_DISABLED=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
#### 3.2 Configurations for Windows
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel iGPU</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```cmd
 | 
					 | 
				
			||||||
set SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
set BIGDL_LLM_XMX_DISABLED=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel Arc™ A-Series Graphics</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```cmd
 | 
					 | 
				
			||||||
set SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
> [!NOTE]
 | 
					 | 
				
			||||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
					 | 
				
			||||||
### 4. Running examples
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
**Stream Chat using `stream_chat()` API**:
 | 
					**Stream Chat using `stream_chat()` API**:
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -4,10 +4,8 @@ In this directory, you will find examples on how you could use IPEX-LLM `optimiz
 | 
				
			||||||
## Requirements
 | 
					## Requirements
 | 
				
			||||||
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 | 
					To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Example 1: Predict Tokens using `generate()` API
 | 
					## 1. Install
 | 
				
			||||||
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
 | 
					### 1.1 Installation on Linux
 | 
				
			||||||
### 1. Install
 | 
					 | 
				
			||||||
#### 1.1 Installation on Linux
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.11
 | 
					conda create -n llm python=3.11
 | 
				
			||||||
| 
						 | 
					@ -16,7 +14,7 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 1.2 Installation on Windows
 | 
					### 1.2 Installation on Windows
 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					We suggest using conda to manage environment:
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
conda create -n llm python=3.11 libuv
 | 
					conda create -n llm python=3.11 libuv
 | 
				
			||||||
| 
						 | 
					@ -26,7 +24,7 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables for Linux
 | 
					## 2. Configures OneAPI environment variables for Linux
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!NOTE]
 | 
					> [!NOTE]
 | 
				
			||||||
> Skip this step if you are running on Windows.
 | 
					> Skip this step if you are running on Windows.
 | 
				
			||||||
| 
						 | 
					@ -37,9 +35,9 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this
 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 3. Runtime Configurations
 | 
					## 3. Runtime Configurations
 | 
				
			||||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
					For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
				
			||||||
#### 3.1 Configurations for Linux
 | 
					### 3.1 Configurations for Linux
 | 
				
			||||||
<details>
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
					<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
				
			||||||
| 
						 | 
					@ -76,7 +74,7 @@ export BIGDL_LLM_XMX_DISABLED=1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
</details>
 | 
					</details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 3.2 Configurations for Windows
 | 
					### 3.2 Configurations for Windows
 | 
				
			||||||
<details>
 | 
					<details>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<summary>For Intel iGPU</summary>
 | 
					<summary>For Intel iGPU</summary>
 | 
				
			||||||
| 
						 | 
					@ -100,7 +98,10 @@ set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!NOTE]
 | 
					> [!NOTE]
 | 
				
			||||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
					> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
				
			||||||
### 4. Running examples
 | 
					## 4. Running examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Example 1: Predict Tokens using `generate()` API
 | 
				
			||||||
 | 
					In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
python ./generate.py --prompt 'AI是什么?'
 | 
					python ./generate.py --prompt 'AI是什么?'
 | 
				
			||||||
| 
						 | 
					@ -112,7 +113,7 @@ In the example, several arguments can be passed to satisfy your requirements:
 | 
				
			||||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
 | 
					- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
 | 
				
			||||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
					- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 4.1 Sample Output
 | 
					#### Sample Output
 | 
				
			||||||
#### [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b)
 | 
					#### [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b)
 | 
				
			||||||
```log
 | 
					```log
 | 
				
			||||||
Inference time: xxxx s
 | 
					Inference time: xxxx s
 | 
				
			||||||
| 
						 | 
					@ -131,103 +132,9 @@ What is AI?
 | 
				
			||||||
AI stands for Artificial Intelligence. It refers to the development of computer systems or machines that can perform tasks that would normally require human intelligence, such as recognizing patterns
 | 
					AI stands for Artificial Intelligence. It refers to the development of computer systems or machines that can perform tasks that would normally require human intelligence, such as recognizing patterns
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Example 2: Stream Chat using `stream_chat()` API
 | 
					### Example 2: Stream Chat using `stream_chat()` API
 | 
				
			||||||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with IPEX-LLM INT4 optimizations.
 | 
					In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with IPEX-LLM INT4 optimizations.
 | 
				
			||||||
### 1. Install
 | 
					 | 
				
			||||||
#### 1.1 Installation on Linux
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
conda create -n llm python=3.11
 | 
					 | 
				
			||||||
conda activate llm
 | 
					 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 1.2 Installation on Windows
 | 
					 | 
				
			||||||
We suggest using conda to manage environment:
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
conda create -n llm python=3.11 libuv
 | 
					 | 
				
			||||||
conda activate llm
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 | 
					 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### 2. Configures OneAPI environment variables for Linux
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
> [!NOTE]
 | 
					 | 
				
			||||||
> Skip this step if you are running on Windows.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
source /opt/intel/oneapi/setvars.sh
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### 3. Runtime Configurations
 | 
					 | 
				
			||||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 | 
					 | 
				
			||||||
#### 3.1 Configurations for Linux
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
export USE_XETLA=OFF
 | 
					 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					 | 
				
			||||||
export SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel Data Center GPU Max Series</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 | 
					 | 
				
			||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					 | 
				
			||||||
export SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
export ENABLE_SDP_FUSION=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel iGPU</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```bash
 | 
					 | 
				
			||||||
export SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
export BIGDL_LLM_XMX_DISABLED=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
#### 3.2 Configurations for Windows
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel iGPU</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```cmd
 | 
					 | 
				
			||||||
set SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
set BIGDL_LLM_XMX_DISABLED=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
<summary>For Intel Arc™ A-Series Graphics</summary>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```cmd
 | 
					 | 
				
			||||||
set SYCL_CACHE_PERSISTENT=1
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
</details>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
> [!NOTE]
 | 
					 | 
				
			||||||
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 | 
					 | 
				
			||||||
### 4. Running examples
 | 
					 | 
				
			||||||
**Stream Chat using `stream_chat()` API**:
 | 
					**Stream Chat using `stream_chat()` API**:
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
python ./streamchat.py
 | 
					python ./streamchat.py
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -16,7 +16,7 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
				
			||||||
 | 
					
 | 
				
			||||||
# install tiktoken required for GLM-4
 | 
					# install tiktoken required for GLM-4
 | 
				
			||||||
pip install tiktoken
 | 
					pip install "tiktoken>=0.7.0"
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 1.2 Installation on Windows
 | 
					#### 1.2 Installation on Windows
 | 
				
			||||||
| 
						 | 
					@ -29,7 +29,7 @@ conda activate llm
 | 
				
			||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
					pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 | 
				
			||||||
 | 
					
 | 
				
			||||||
# install tiktoken required for GLM-4
 | 
					# install tiktoken required for GLM-4
 | 
				
			||||||
pip install tiktoken
 | 
					pip install "tiktoken>=0.7.0"
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 2. Configures OneAPI environment variables for Linux
 | 
					### 2. Configures OneAPI environment variables for Linux
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
		Loading…
	
		Reference in a new issue