History

Yuwen Hu 381d448ee2 [NPU] Example & Quickstart updates (#12650 ) * Remove model with optimize_model=False in NPU verified models tables, and remove related example * Remove experimental in run optimized model section title * Unify model table order & example cmd * Move embedding example to separate folder & update quickstart example link * Add Quickstart reference in main NPU readme * Small fix * Small fix * Move save/load examples under NPU/HF-Transformers-AutoModels * Add low-bit and polish arguments for LLM Python examples * Small fix * Add low-bit and polish arguments for Multi-Model examples * Polish argument for Embedding models * Polish argument for LLM CPP examples * Add low-bit and polish argument for Save-Load examples * Add accuracy tuning tips for examples * Update NPU qucikstart accuracy tuning with low-bit optimizations * Add save/load section to qucikstart * Update CPP example sample output to EN * Add installation regarding cmake for CPP examples * Small fix * Small fix * Small fix * Small fix * Small fix * Small fix * Unify max prompt length to 512 * Change recommended low-bit for Qwen2.5-3B-Instruct to asym_int4 * Update based on comments * Small fix		2025-01-07 13:52:41 +08:00
..
CMakeLists.txt	Initial NPU C++ Example (#12417 )	2024-11-21 10:09:26 +08:00
convert.py	[NPU] Example & Quickstart updates (#12650 )	2025-01-07 13:52:41 +08:00
llm-npu-cli.cpp	[NPU] Update C++ example with repetition_penalty & update Python code accordingly (#12528 )	2024-12-12 13:42:55 +08:00
README.md	[NPU] Example & Quickstart updates (#12650 )	2025-01-07 13:52:41 +08:00

C++ Example of running LLM on Intel NPU using IPEX-LLM

In this directory, you will find a C++ example on how to run LLM models on Intel NPUs using IPEX-LLM (leveraging Intel NPU Acceleration Library). See the table blow for verified models.

Verified Models

Model	Model Link
Llama2	meta-llama/Llama-2-7b-chat-hf
Llama3	meta-llama/Meta-Llama-3-8B-Instruct
Llama3.2	meta-llama/Llama-3.2-1B-Instruct, meta-llama/Llama-3.2-3B-Instruct
Qwen2	Qwen/Qwen2-1.5B-Instruct, Qwen/Qwen2-7B-Instruct
Qwen2.5	Qwen/Qwen2.5-3B-Instruct, Qwen/Qwen2.5-7B-Instruct
MiniCPM	openbmb/MiniCPM-1B-sft-bf16, openbmb/MiniCPM-2B-sft-bf16

Please refer to Quickstart for details about verified platforms.

0. Prerequisites

For ipex-llm NPU support, please refer to Quickstart for details about the required preparations.

1. Install & Runtime Configurations

1.1 Installation on Windows

We suggest using conda to manage environment:

conda create -n llm python=3.11
conda activate llm

:: for building the example
pip install cmake

:: install ipex-llm with 'npu' option
pip install --pre --upgrade ipex-llm[npu]

:: [optional] for Llama-3.2-1B-Instruct & Llama-3.2-3B-Instruct
pip install transformers==4.45.0 accelerate==0.33.0

Please refer to Quickstart for more details about ipex-llm installation on Intel NPU.

1.2 Runtime Configurations

Please refer to Quickstart for environment variables setting based on your device.

2. Convert Model

We provide a convert script under current directory, by running it, you can obtain the whole weights and configuration files which are required to run C++ example.

:: to convert Llama-2-7b-chat-hf
python convert.py --repo-id-or-model-path meta-llama/Llama-2-7b-chat-hf --save-directory <converted_model_path>

:: to convert Meta-Llama-3-8B-Instruct
python convert.py --repo-id-or-model-path meta-llama/Meta-Llama-3-8B-Instruct --save-directory <converted_model_path>

:: to convert Llama-3.2-1B-Instruct
python convert.py --repo-id-or-model-path meta-llama/Llama-3.2-1B-Instruct --save-directory <converted_model_path>

:: to convert Llama-3.2-3B-Instruct
python convert.py --repo-id-or-model-path meta-llama/Llama-3.2-3B-Instruct --save-directory <converted_model_path>

:: to convert Qwen2-1.5B-Instruct
python convert.py --repo-id-or-model-path Qwen/Qwen2-1.5B-Instruct --save-directory <converted_model_path>

:: to convert Qwen2-7B-Instruct
python convert.py --repo-id-or-model-path Qwen/Qwen2-7B-Instruct --save-directory <converted_model_path>

:: to convert Qwen2.5-3B-Instruct
python convert.py --repo-id-or-model-path Qwen/Qwen2.5-3B-Instruct --save-directory <converted_model_path> --low-bit "asym_int4"

:: to convert Qwen2.5-7B-Instruct
python convert.py --repo-id-or-model-path Qwen/Qwen2.5-7B-Instruct --save-directory <converted_model_path>

:: to convert MiniCPM-1B-sft-bf16
python convert.py --repo-id-or-model-path openbmb/MiniCPM-1B-sft-bf16 --save-directory <converted_model_path>

:: to convert MiniCPM-2B-sft-bf16
python convert.py --repo-id-or-model-path openbmb/MiniCPM-2B-sft-bf16 --save-directory <converted_model_path>

Arguments info:

--repo-id-or-model-path REPO_ID_OR_MODEL_PATH: argument defining the huggingface repo id for the model (e.g.Meta-llama/Llama-2-7b-chat-hf for Llama2-7B) to be downloaded, or the path to the huggingface checkpoint folder.
--save-directory SAVE_DIRECTORY: argument defining the path to save converted model. If it is a non-existing path, the original pretrained model specified by REPO_ID_OR_MODEL_PATH will be loaded, and the converted model will be saved into SAVE_DIRECTORY.
--max-context-len MAX_CONTEXT_LEN: argument defining the maximum sequence length for both input and output tokens. It is default to be 1024.
--max-prompt-len MAX_PROMPT_LEN: argument defining the maximum number of tokens that the input prompt can contain. It is default to be 512.
--low-bit LOW_BIT: argument defining the low bit optimizations that will be applied to the model. Current available options are "sym_int4", "asym_int4" and "sym_int8", with "sym_int4" as the default.

3. Build C++ Example `llm-npu-cli`

You can run below cmake script in cmd to build llm-npu-cli, don't forget to replace below conda env dir with your own path.

:: under current directory
:: please replace below conda env dir with your own path
set CONDA_ENV_DIR=C:\Users\arda\miniforge3\envs\llm\Lib\site-packages
mkdir build
cd build
cmake ..
cmake --build . --config Release -j
cd Release

4. Run `llm-npu-cli`

With built llm-npu-cli, you can run the example with specified paramaters. For example,

# Run simple text completion
llm-npu-cli.exe -m <converted_model_path> -n 64 "AI是什么?"

# Run in conversation mode
llm-npu-cli.exe -m <converted_model_path> -cnv

Arguments info:

-m : argument defining the path of saved converted model.
-cnv : argument to enable conversation mode.
-n : argument defining how many tokens will be generated.
Last argument is your input prompt.

5. Sample Output

meta-llama/Llama-2-7b-chat-hf

Text Completion

Input:
<s>[INST] <<SYS>>

<</SYS>>

What is AI? [/INST]

Prefill 26 tokens cost xxxx ms.

Decode 63 tokens cost xxxx ms (avg xxxx ms each token).
Output:
 AI stands for Artificial Intelligence, which is the field of study focused on creating and developing intelligent machines that can perform tasks that typically require human intelligence, such as visual and auditory recognition, speech recognition, and decision-making. AI is a broad and diverse field that includes a wide range

Conversation

User:Hi
Assistant: Hello! It's nice to meet you. How can I help you today?
User:What is AI in one sentence?
Assistant:Sure, here's a one-sentence definition of AI:

Artificial Intelligence (AI) refers to the development and use of computer systems and algorithms that can perform tasks that typically require human intelligence, such as visual and speech recognition, decision-making and problem-solving, and natural language processing.
User:exit

Troubleshooting

Program crash with Chinese prompt

If you run CPP examples on Windows and find that your program raises below error when accepting Chinese prompts, you can search for region in the Windows search bar and go to Region->Administrative->Change System locale.., tick Beta: Use Unicode UTF-8 for worldwide language support option and then restart your computer.

thread '<unnamed>' panicked at src\lib.rs:151:91:
called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 77, error_len: Some(1) }

For detailed instructions on how to do this, see this issue.

Accuracy Tuning

If you enconter output issues when running the CPP examples, you could try the following methods when converting the model to tune the accuracy:

Before converting the model, consider setting an additional environment variable IPEX_LLM_NPU_QUANTIZATION_OPT=1 to enhance output quality.
If you are using the default LOW_BIT value (i.e. sym_int4 optimizations), you could try to use --low-bit "asym_int4" instead to tune the output quality.
You could refer to the Quickstart for more accuracy tuning strategies.

Important

Please note that to make the above methods taking effect, you must specify a new folder for SAVE_DIRECTORY. Reusing the same SAVE_DIRECTORY will load the previously saved low-bit model, and thus making the above accuracy tuning strategies ineffective.

README.md