Update NPU example readme (#11931)
This commit is contained in:
parent
6c3eb1e1e8
commit
14dddfc0d6
1 changed files with 16 additions and 51 deletions
|
|
@ -9,7 +9,7 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o
|
||||||
| Llama3 | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |
|
| Llama3 | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |
|
||||||
| Chatglm3 | [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) |
|
| Chatglm3 | [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) |
|
||||||
| Chatglm2 | [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b) |
|
| Chatglm2 | [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b) |
|
||||||
| Qwen2 | [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) |
|
| Qwen2 | [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct), [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) |
|
||||||
| MiniCPM | [openbmb/MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) |
|
| MiniCPM | [openbmb/MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) |
|
||||||
| Phi-3 | [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) |
|
| Phi-3 | [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) |
|
||||||
| Stablelm | [stabilityai/stablelm-zephyr-3b](https://huggingface.co/stabilityai/stablelm-zephyr-3b) |
|
| Stablelm | [stabilityai/stablelm-zephyr-3b](https://huggingface.co/stabilityai/stablelm-zephyr-3b) |
|
||||||
|
|
@ -23,10 +23,8 @@ Go to https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-w
|
||||||
Then go to **Device Manager**, find **Neural Processors** -> **Intel(R) AI Boost**.
|
Then go to **Device Manager**, find **Neural Processors** -> **Intel(R) AI Boost**.
|
||||||
Right click and select **Update Driver**. And then manually select the folder unzipped from the driver.
|
Right click and select **Update Driver**. And then manually select the folder unzipped from the driver.
|
||||||
|
|
||||||
## Example 1: Predict Tokens using `generate()` API
|
## 1. Install
|
||||||
In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel NPUs.
|
### 1.1 Installation on Windows
|
||||||
### 1. Install
|
|
||||||
#### 1.1 Installation on Windows
|
|
||||||
We suggest using conda to manage environment:
|
We suggest using conda to manage environment:
|
||||||
```bash
|
```bash
|
||||||
conda create -n llm python=3.10
|
conda create -n llm python=3.10
|
||||||
|
|
@ -36,9 +34,9 @@ conda activate llm
|
||||||
pip install --pre --upgrade ipex-llm[npu]
|
pip install --pre --upgrade ipex-llm[npu]
|
||||||
```
|
```
|
||||||
|
|
||||||
### 2. Runtime Configurations
|
## 2. Runtime Configurations
|
||||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
|
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
|
||||||
#### 2.1 Configurations for Windows
|
### 2.1 Configurations for Windows
|
||||||
|
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> For optimal performance, we recommend running code in `conhost` rather than Windows Terminal:
|
> For optimal performance, we recommend running code in `conhost` rather than Windows Terminal:
|
||||||
|
|
@ -54,19 +52,20 @@ For optimal performance, it is recommended to set several environment variables.
|
||||||
set BIGDL_USE_NPU=1
|
set BIGDL_USE_NPU=1
|
||||||
```
|
```
|
||||||
|
|
||||||
### 3. Running examples
|
## 3. Run models
|
||||||
|
In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel NPUs.
|
||||||
|
|
||||||
```
|
```
|
||||||
python ./generate.py
|
python ./generate.py
|
||||||
```
|
```
|
||||||
|
|
||||||
Arguments info:
|
Arguments info:
|
||||||
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`, and more verified models please see the list in [Verified Models](#verified-models).
|
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`, and more verified models please see the list in [Verified Models](#verified-models).
|
||||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun'`.
|
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun'`.
|
||||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
|
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
|
||||||
- `--load_in_low_bit`: argument defining the `load_in_low_bit` format used. It is default to be `sym_int8`, `sym_int4` can also be used.
|
- `--load_in_low_bit`: argument defining the `load_in_low_bit` format used. It is default to be `sym_int8`, `sym_int4` can also be used.
|
||||||
|
|
||||||
#### Sample Output
|
### Sample Output
|
||||||
#### [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
|
#### [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
|
||||||
|
|
||||||
```log
|
```log
|
||||||
|
|
@ -77,44 +76,10 @@ Inference time: xxxx s
|
||||||
done
|
done
|
||||||
```
|
```
|
||||||
|
|
||||||
## Example 2: Predict Tokens using `generate()` API using multi processes
|
## 4. Run Optimized Models (Experimental)
|
||||||
In the example [llama2.py](./llama2.py) and [qwen2.py](./qwen2.py), we show an experimental support for a Llama2 / Qwen2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimization and fused decoderlayer optimization on Intel NPUs.
|
The example below shows how to run the **_optimized model implementations_** on Intel NPU, including
|
||||||
|
- [Llama2-7B](./llama2.py)
|
||||||
> [!IMPORTANT]
|
- [Qwen2-1.5B](./qwen2.py)
|
||||||
> To run Qwen2 and Llama2 with IPEX-LLM on Intel NPUs, we recommend using version **32.0.100.2540** for the Intel NPU.
|
|
||||||
>
|
|
||||||
> Go to https://www.intel.com/content/www/us/en/download/794734/825735/intel-npu-driver-windows.html to download and unzip the driver. Then follow the same steps on [Requirements](#0-requirements).
|
|
||||||
|
|
||||||
### 1. Install
|
|
||||||
#### 1.1 Installation on Windows
|
|
||||||
We suggest using conda to manage environment:
|
|
||||||
```bash
|
|
||||||
conda create -n llm python=3.10
|
|
||||||
conda activate llm
|
|
||||||
|
|
||||||
# install ipex-llm with 'npu' option
|
|
||||||
pip install --pre --upgrade ipex-llm[npu]
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Runtime Configurations
|
|
||||||
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
|
|
||||||
#### 2.1 Configurations for Windows
|
|
||||||
|
|
||||||
> [!NOTE]
|
|
||||||
> For optimal performance, we recommend running code in `conhost` rather than Windows Terminal:
|
|
||||||
> - Press <kbd>Win</kbd>+<kbd>R</kbd> and input `conhost`, then press Enter to launch `conhost`.
|
|
||||||
> - Run following command to use conda in `conhost`. Replace `<your conda install location>` with your conda install location.
|
|
||||||
> ```
|
|
||||||
> call <your conda install location>\Scripts\activate
|
|
||||||
> ```
|
|
||||||
|
|
||||||
**Following envrionment variables are required**:
|
|
||||||
|
|
||||||
```cmd
|
|
||||||
set BIGDL_USE_NPU=1
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Running examples
|
|
||||||
|
|
||||||
```
|
```
|
||||||
# to run Llama-2-7b-chat-hf
|
# to run Llama-2-7b-chat-hf
|
||||||
|
|
@ -132,7 +97,7 @@ Arguments info:
|
||||||
- `--max-prompt-len MAX_PROMPT_LEN`: Defines the maximum number of tokens that the input prompt can contain. It is default to be `512`.
|
- `--max-prompt-len MAX_PROMPT_LEN`: Defines the maximum number of tokens that the input prompt can contain. It is default to be `512`.
|
||||||
- `--disable-transpose-value-cache`: Disable the optimization of transposing value cache.
|
- `--disable-transpose-value-cache`: Disable the optimization of transposing value cache.
|
||||||
|
|
||||||
### 4. Troubleshooting
|
### Troubleshooting
|
||||||
|
|
||||||
If you encounter output problem, please try to disable the optimization of transposing value cache with following command:
|
If you encounter output problem, please try to disable the optimization of transposing value cache with following command:
|
||||||
```bash
|
```bash
|
||||||
|
|
@ -144,7 +109,7 @@ python qwen2.py --disable-transpose-value-cache
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
#### Sample Output
|
### Sample Output
|
||||||
#### [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
|
#### [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
|
||||||
|
|
||||||
```log
|
```log
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue