From bde8acc303bcbb9248aa283b398d92894b32b6cb Mon Sep 17 00:00:00 2001
From: binbin Deng <108676127+plusbang@users.noreply.github.com>
Date: Wed, 19 Feb 2025 10:46:35 +0800
Subject: [PATCH] [NPU] Update doc of gguf support (#12837)

---
 docs/mddocs/Quickstart/npu_quickstart.md | 45 ++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/docs/mddocs/Quickstart/npu_quickstart.md b/docs/mddocs/Quickstart/npu_quickstart.md
index 16e6d125..293f50af 100644
--- a/docs/mddocs/Quickstart/npu_quickstart.md
+++ b/docs/mddocs/Quickstart/npu_quickstart.md
@@ -15,6 +15,7 @@ This guide demonstrates:
 - [Runtime Configurations](#runtime-configurations)
 - [Python API](#python-api)
 - [C++ API](#c-api)
+- [llama.cpp Support](#experimental-llamacpp-support)
 - [Accuracy Tuning](#accuracy-tuning)
 
 ## Install Prerequisites
@@ -181,6 +182,50 @@ Refer to the following table for examples of verified models:
 > [!TIP]
 > You could refer to [here](../../../python/llm/example/NPU/HF-Transformers-AutoModels) for full IPEX-LLM examples on Intel NPU.
 
+## (Experimental) llama.cpp Support
+
+IPEX-LLM provides `llama.cpp` compatible API for running GGUF models on Intel NPU.
+
+Refer to the following table for verified models:
+
+| Model | Model link | Verified Platforms |
+|:--|:--|:--|
+| LLaMA 3.2 | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | Meteor Lake, Lunar Lake, Arrow Lake |
+| DeepSeek-R1 | [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B), [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | Meteor Lake, Lunar Lake, Arrow Lake |
+
+### Setup for running llama.cpp
+
+First, you should create a directory to use `llama.cpp`, for instance, use following command to create a `llama-cpp-npu` directory and enter it.
+
+```cmd
+mkdir llama-cpp-npu
+cd llama-cpp-npu
+```
+
+Then, please run the following command with **administrator privilege in Miniforge Prompt** to initialize `llama.cpp` for NPU:
+
+```cmd
+init-llama-cpp.bat
+```
+
+### Model Download
+
+Before running, you should download or copy community GGUF model to your current directory. For instance,  `DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf` of [DeepSeek-R1-Distill-Qwen-7B-GGUF](https://huggingface.co/lmstudio-community/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main).
+
+### Run the quantized model
+
+Please refer to [Runtime Configurations](#runtime-configurations) before running the following command in Miniforge Prompt.
+
+```cmd
+llama-cli-npu.exe -m DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf -n 32 --prompt "What is AI?"
+```
+
+> **Note**:
+>
+> - **Warmup on first run**: When running specific GGUF models on NPU for the first time, you might notice delays up to several minutes before the first token is generated. This delay occurs because the blob compilation.
+> - For more details about meaning of each parameter, you can use `llama-cli-npu.exe -h`.
+
+
 ## Accuracy Tuning
 
 IPEX-LLM provides several optimization methods for enhancing the accuracy of model outputs on Intel NPU. You can select and combine these techniques to achieve better outputs based on your specific use case.