Add Qwen2.5 NPU Example (#12110)

* Add Qwen2.5 NPU Example

* fix

* Merge qwen2.py and qwen2.5.py into qwen.py

* Fix description
This commit is contained in:
Jin, Qiao 2024-09-25 15:20:03 +08:00 committed by GitHub
parent 657889e3e4
commit 2bedb17be7
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 24 additions and 13 deletions

View file

@ -10,6 +10,7 @@ In this directory, you will find examples on how to directly run HuggingFace `tr
| Chatglm3 | [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) |
| Chatglm2 | [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b) |
| Qwen2 | [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct), [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) |
| Qwen2.5 | [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
| MiniCPM | [openbmb/MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) |
| Phi-3 | [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) |
| Stablelm | [stabilityai/stablelm-zephyr-3b](https://huggingface.co/stabilityai/stablelm-zephyr-3b) |
@ -81,8 +82,9 @@ done
The examples below show how to run the **_optimized HuggingFace model implementations_** on Intel NPU, including
- [Llama2-7B](./llama.py)
- [Llama3-8B](./llama.py)
- [Qwen2-1.5B](./qwen2.py)
- [Qwen2-7B](./qwen2.py)
- [Qwen2-1.5B](./qwen.py)
- [Qwen2-7B](./qwen.py)
- [Qwen2.5-7B](./qwen.py)
- [MiniCPM-1B](./minicpm.py)
- [MiniCPM-2B](./minicpm.py)
- [Baichuan2-7B](./baichuan2.py)
@ -95,7 +97,7 @@ Supported models: Llama2-7B, Llama3-8B, Qwen2-1.5B, Qwen2-7B, MiniCPM-1B, MiniCP
#### 32.0.100.2625
Supported models: Llama2-7B, MiniCPM-1B, Baichuan2-7B
#### 32.0.101.2715
Supported models: Llama3-8B, MiniCPM-2B, Qwen2-7B, Qwen2-1.5B
Supported models: Llama3-8B, MiniCPM-2B, Qwen2-7B, Qwen2-1.5B, Qwen2.5-7B
### Run
```cmd
@ -105,11 +107,14 @@ python llama.py
:: to run Meta-Llama-3-8B-Instruct (LNL driver version: 32.0.101.2715)
python llama.py --repo-id-or-model-path meta-llama/Meta-Llama-3-8B-Instruct
:: to run Qwen2-1.5B-Instruct LNL driver version: 32.0.101.2715)
python qwen2.py
:: to run Qwen2-1.5B-Instruct (LNL driver version: 32.0.101.2715)
python qwen.py
:: to run Qwen2-7B-Instruct LNL driver version: 32.0.101.2715)
python qwen2.py --repo-id-or-model-path Qwen/Qwen2-7B-Instruct
:: to run Qwen2-7B-Instruct (LNL driver version: 32.0.101.2715)
python qwen.py --repo-id-or-model-path Qwen/Qwen2-7B-Instruct
:: to run Qwen2.5-7B-Instruct (LNL driver version: 32.0.101.2715)
python qwen.py --repo-id-or-model-path Qwen/Qwen2.5-7B-Instruct
:: to run MiniCPM-1B-sft-bf16
python minicpm.py
@ -133,7 +138,7 @@ Arguments info:
### Troubleshooting
#### `TypeError: can't convert meta device type tensor to numpy.` Error
If you encounter `TypeError: can't convert meta device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.` error when loading lowbit model, please try re-saving the lowbit model with the example script you are currently using. Please note that lowbit models saved by `qwen2.py`, `llama.py`, etc. cannot be loaded by `generate.py`.
If you encounter `TypeError: can't convert meta device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.` error when loading lowbit model, please try re-saving the lowbit model with the example script you are currently using. Please note that lowbit models saved by `qwen.py`, `llama.py`, etc. cannot be loaded by `generate.py`.
#### Output Problem
If you encounter output problem, please try to disable the optimization of transposing value cache with following command:
@ -145,10 +150,13 @@ python llama.py --disable-transpose-value-cache
python llama.py --repo-id-or-model-path meta-llama/Meta-Llama-3-8B-Instruct --disable-transpose-value-cache
:: to run Qwen2-1.5B-Instruct (LNL driver version: 32.0.101.2715)
python qwen2.py --disable-transpose-value-cache
python qwen.py --disable-transpose-value-cache
:: to run Qwen2-7B-Instruct LNL driver version: 32.0.101.2715)
python qwen2.py --repo-id-or-model-path Qwen/Qwen2-7B-Instruct --disable-transpose-value-cache
python qwen.py --repo-id-or-model-path Qwen/Qwen2-7B-Instruct --disable-transpose-value-cache
:: to run Qwen2.5-7B-Instruct LNL driver version: 32.0.101.2715)
python qwen.py --repo-id-or-model-path Qwen/Qwen2.5-7B-Instruct --disable-transpose-value-cache
:: to run MiniCPM-1B-sft-bf16
python minicpm.py --disable-transpose-value-cache
@ -160,10 +168,13 @@ python minicpm.py --repo-id-or-model-path openbmb/MiniCPM-2B-sft-bf16 --disable-
python baichuan2.py --disable-transpose-value-cache
```
For [Qwen2-7B](./qwen2.py), you could also try to enable mixed precision optimization when encountering output problems:
For [Qwen2-7B](./qwen.py) and [Qwen2.5-7B](./qwen.py), you could also try to enable mixed precision optimization when encountering output problems:
```cmd
python qwen2.py --repo-id-or-model-path Qwen/Qwen2-7B-Instruct --mixed-precision
python qwen.py --repo-id-or-model-path Qwen/Qwen2-7B-Instruct --mixed-precision
```
```cmd
python qwen.py --repo-id-or-model-path Qwen/Qwen2.5-7B-Instruct --mixed-precision
```
#### Better Performance with High CPU Utilization

View file

@ -34,7 +34,7 @@ if __name__ == "__main__":
"--repo-id-or-model-path",
type=str,
default="Qwen/Qwen2-1.5B-Instruct",
help="The huggingface repo id for the Qwen2 model to be downloaded"
help="The huggingface repo id for the Qwen2 or Qwen2.5 model to be downloaded"
", or the path to the huggingface checkpoint folder",
)
parser.add_argument("--lowbit-path", type=str,