[NPU] Example & Quickstart updates (#12650 )

* Remove model with optimize_model=False in NPU verified models tables, and remove related example

* Remove experimental in run optimized model section title

* Unify model table order & example cmd

* Move embedding example to separate folder & update quickstart example link

* Add Quickstart reference in main NPU readme

* Small fix

* Small fix

* Move save/load examples under NPU/HF-Transformers-AutoModels

* Add low-bit and polish arguments for LLM Python examples

* Small fix

* Add low-bit and polish arguments for Multi-Model  examples

* Polish argument for Embedding models

* Polish argument for LLM CPP examples

* Add low-bit and polish argument for Save-Load examples

* Add accuracy tuning tips for examples

* Update NPU qucikstart accuracy tuning with low-bit optimizations

* Add save/load section to qucikstart

* Update CPP example sample output to EN

* Add installation regarding cmake for CPP examples

* Small fix

* Small fix

* Small fix

* Small fix

* Small fix

* Small fix

* Unify max prompt length to 512

* Change recommended low-bit for Qwen2.5-3B-Instruct to asym_int4

* Update based on comments

* Small fix

2025-01-07 13:52:41 +08:00

3.4 KiB

Raw Blame History

Save/Load Low-Bit Models with IPEX-LLM Optimizations

In this directory, you will find example on how you could save/load models with IPEX-LLM optimizations on Intel NPU.

Example: Save/Load Optimized Models

In the example generate.py, we show a basic use case of saving/loading model in low-bit optimizations to predict the next N tokens using generate() API.

0. Prerequisites

For ipex-llm NPU support, please refer to Quickstart for details about the required preparations.

1. Install & Runtime Configurations

1.1 Installation on Windows

We suggest using conda to manage environment:

conda create -n llm python=3.11
conda activate llm

:: install ipex-llm with 'npu' option
pip install --pre --upgrade ipex-llm[npu]

:: [optional] for Llama-3.2-1B-Instruct & Llama-3.2-3B-Instruct
pip install transformers==4.45.0 accelerate==0.33.0

Please refer to Quickstart for more details about ipex-llm installation on Intel NPU.

1.2 Runtime Configurations

Please refer to Quickstart for environment variables setting based on your device.

3. Running examples

If you want to save the optimized model, run:

python ./generate.py --repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" --save-directory path/to/save/model

If you want to load the optimized low-bit model, run:

python ./generate.py --load-directory path/to/load/model

In the example, several arguments can be passed to satisfy your requirements:

--repo-id-or-model-path REPO_ID_OR_MODEL_PATH: argument defining the huggingface repo id for the Llama2 model to be downloaded, or the path to the ModelScope checkpoint folder. It is default to be 'meta-llama/Llama-2-7b-chat-hf'.
--save-directory: argument defining the path to save the low-bit model. Then you can load the low-bit directly.
--load-directory: argument defining the path to load low-bit model.
--prompt PROMPT: argument defining the prompt to be inferred (with integrated prompt format for chat). It is default to be 'What is AI?'.
--n-predict N_PREDICT: argument defining the max number of tokens to predict. It is default to be 32.
--max-context-len MAX_CONTEXT_LEN: argument defining the maximum sequence length for both input and output tokens. It is default to be 1024.
--max-prompt-len MAX_PROMPT_LEN: argument defining the maximum number of tokens that the input prompt can contain. It is default to be 512.
--low-bit LOW_BIT: argument defining the low bit optimizations that will be applied to the model. Current available options are "sym_int4", "asym_int4" and "sym_int8", with "sym_int4" as the default.

Sample Output

meta-llama/Llama-2-7b-chat-hf

Inference time: xxxx s
-------------------- Input --------------------
<s><s>  [INST] <<SYS>>

<</SYS>>

What is AI? [/INST]

-------------------- Output --------------------
<s><s>  [INST] <<SYS>>

<</SYS>>

What is AI? [/INST]

Artificial Intelligence (AI) is a field of computer science and technology that focuses on the development of intelligent machines that can perform tasks that

3.4 KiB Raw Blame History