diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md index bd8bc6e8..dadd8ce6 100644 --- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md +++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md @@ -10,6 +10,7 @@ In this directory, you will find examples on how to directly run HuggingFace `tr | Chatglm3 | [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) | | Chatglm2 | [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b) | | Qwen2 | [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct), [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) | +| Qwen2.5 | [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | | MiniCPM | [openbmb/MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) | | Phi-3 | [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) | | Stablelm | [stabilityai/stablelm-zephyr-3b](https://huggingface.co/stabilityai/stablelm-zephyr-3b) | @@ -81,8 +82,9 @@ done The examples below show how to run the **_optimized HuggingFace model implementations_** on Intel NPU, including - [Llama2-7B](./llama.py) - [Llama3-8B](./llama.py) -- [Qwen2-1.5B](./qwen2.py) -- [Qwen2-7B](./qwen2.py) +- [Qwen2-1.5B](./qwen.py) +- [Qwen2-7B](./qwen.py) +- [Qwen2.5-7B](./qwen.py) - [MiniCPM-1B](./minicpm.py) - [MiniCPM-2B](./minicpm.py) - [Baichuan2-7B](./baichuan2.py) @@ -95,7 +97,7 @@ Supported models: Llama2-7B, Llama3-8B, Qwen2-1.5B, Qwen2-7B, MiniCPM-1B, MiniCP #### 32.0.100.2625 Supported models: Llama2-7B, MiniCPM-1B, Baichuan2-7B #### 32.0.101.2715 -Supported models: Llama3-8B, MiniCPM-2B, Qwen2-7B, Qwen2-1.5B +Supported models: Llama3-8B, MiniCPM-2B, Qwen2-7B, Qwen2-1.5B, Qwen2.5-7B ### Run ```cmd @@ -105,11 +107,14 @@ python llama.py :: to run Meta-Llama-3-8B-Instruct (LNL driver version: 32.0.101.2715) python llama.py --repo-id-or-model-path meta-llama/Meta-Llama-3-8B-Instruct -:: to run Qwen2-1.5B-Instruct LNL driver version: 32.0.101.2715) -python qwen2.py +:: to run Qwen2-1.5B-Instruct (LNL driver version: 32.0.101.2715) +python qwen.py -:: to run Qwen2-7B-Instruct LNL driver version: 32.0.101.2715) -python qwen2.py --repo-id-or-model-path Qwen/Qwen2-7B-Instruct +:: to run Qwen2-7B-Instruct (LNL driver version: 32.0.101.2715) +python qwen.py --repo-id-or-model-path Qwen/Qwen2-7B-Instruct + +:: to run Qwen2.5-7B-Instruct (LNL driver version: 32.0.101.2715) +python qwen.py --repo-id-or-model-path Qwen/Qwen2.5-7B-Instruct :: to run MiniCPM-1B-sft-bf16 python minicpm.py @@ -133,7 +138,7 @@ Arguments info: ### Troubleshooting #### `TypeError: can't convert meta device type tensor to numpy.` Error -If you encounter `TypeError: can't convert meta device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.` error when loading lowbit model, please try re-saving the lowbit model with the example script you are currently using. Please note that lowbit models saved by `qwen2.py`, `llama.py`, etc. cannot be loaded by `generate.py`. +If you encounter `TypeError: can't convert meta device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.` error when loading lowbit model, please try re-saving the lowbit model with the example script you are currently using. Please note that lowbit models saved by `qwen.py`, `llama.py`, etc. cannot be loaded by `generate.py`. #### Output Problem If you encounter output problem, please try to disable the optimization of transposing value cache with following command: @@ -145,10 +150,13 @@ python llama.py --disable-transpose-value-cache python llama.py --repo-id-or-model-path meta-llama/Meta-Llama-3-8B-Instruct --disable-transpose-value-cache :: to run Qwen2-1.5B-Instruct (LNL driver version: 32.0.101.2715) -python qwen2.py --disable-transpose-value-cache +python qwen.py --disable-transpose-value-cache :: to run Qwen2-7B-Instruct LNL driver version: 32.0.101.2715) -python qwen2.py --repo-id-or-model-path Qwen/Qwen2-7B-Instruct --disable-transpose-value-cache +python qwen.py --repo-id-or-model-path Qwen/Qwen2-7B-Instruct --disable-transpose-value-cache + +:: to run Qwen2.5-7B-Instruct LNL driver version: 32.0.101.2715) +python qwen.py --repo-id-or-model-path Qwen/Qwen2.5-7B-Instruct --disable-transpose-value-cache :: to run MiniCPM-1B-sft-bf16 python minicpm.py --disable-transpose-value-cache @@ -160,10 +168,13 @@ python minicpm.py --repo-id-or-model-path openbmb/MiniCPM-2B-sft-bf16 --disable- python baichuan2.py --disable-transpose-value-cache ``` -For [Qwen2-7B](./qwen2.py), you could also try to enable mixed precision optimization when encountering output problems: +For [Qwen2-7B](./qwen.py) and [Qwen2.5-7B](./qwen.py), you could also try to enable mixed precision optimization when encountering output problems: ```cmd -python qwen2.py --repo-id-or-model-path Qwen/Qwen2-7B-Instruct --mixed-precision +python qwen.py --repo-id-or-model-path Qwen/Qwen2-7B-Instruct --mixed-precision +``` +```cmd +python qwen.py --repo-id-or-model-path Qwen/Qwen2.5-7B-Instruct --mixed-precision ``` #### Better Performance with High CPU Utilization diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/qwen2.py b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/qwen.py similarity index 98% rename from python/llm/example/NPU/HF-Transformers-AutoModels/LLM/qwen2.py rename to python/llm/example/NPU/HF-Transformers-AutoModels/LLM/qwen.py index 9eec34a0..835a72de 100644 --- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/qwen2.py +++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/qwen.py @@ -34,7 +34,7 @@ if __name__ == "__main__": "--repo-id-or-model-path", type=str, default="Qwen/Qwen2-1.5B-Instruct", - help="The huggingface repo id for the Qwen2 model to be downloaded" + help="The huggingface repo id for the Qwen2 or Qwen2.5 model to be downloaded" ", or the path to the huggingface checkpoint folder", ) parser.add_argument("--lowbit-path", type=str,