update llama.cpp related quickstart with rebased llama.cpp (#12996)

* update doc with reabsed llama.cpp

* revert table of contents

* update demo output log
This commit is contained in:
Ruonan Wang 2025-03-25 09:49:39 +08:00 committed by GitHub
parent 7a86dd0569
commit 0e0786a63c
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 312 additions and 264 deletions

View file

@ -12,9 +12,9 @@
> For installation on Intel Arc B-Series GPU (such as **B580**), please refer to this [guide](./bmg_quickstart.md). > For installation on Intel Arc B-Series GPU (such as **B580**), please refer to this [guide](./bmg_quickstart.md).
> [!NOTE] > [!NOTE]
> Our latest version is consistent with [3f1ae2e](https://github.com/ggerganov/llama.cpp/commit/3f1ae2e32cde00c39b96be6d01c2997c29bae555) of llama.cpp. > Our latest version is consistent with [d7cfe1f](https://github.com/ggml-org/llama.cpp/commit/d7cfe1ffe0f435d0048a6058d529daf76e072d9c) of llama.cpp.
> >
> `ipex-llm[cpp]==2.2.0b20241204` is consistent with [a1631e5](https://github.com/ggerganov/llama.cpp/commit/a1631e53f6763e17da522ba219b030d8932900bd) of llama.cpp. > `ipex-llm[cpp]==2.2.0b20250320` is consistent with [ba1cb19](https://github.com/ggml-org/llama.cpp/commit/ba1cb19cdd0d92e012e0f6e009e0620f854b6afd) of llama.cpp.
See the demo of running LLaMA2-7B on Intel Arc GPU below. See the demo of running LLaMA2-7B on Intel Arc GPU below.
@ -158,7 +158,7 @@ Before running, you should download or copy community GGUF model to your current
- For **Linux users**: - For **Linux users**:
```bash ```bash
./llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color ./llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color -no-cnv
``` ```
> **Note**: > **Note**:
@ -170,7 +170,7 @@ Before running, you should download or copy community GGUF model to your current
Please run the following command in Miniforge Prompt. Please run the following command in Miniforge Prompt.
```cmd ```cmd
llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color -no-cnv
``` ```
> **Note**: > **Note**:
@ -179,11 +179,10 @@ Before running, you should download or copy community GGUF model to your current
#### Sample Output #### Sample Output
``` ```
Log start main: llama backend init
main: build = 1 (6f4ec98) main: load the model and apply lora adapter, if any
main: built with MSVC 19.39.33519.0 for llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
main: seed = 1724921424 llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/arda/ruonan/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from D:\gguf-models\mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.1 llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.1
@ -208,108 +207,123 @@ llama_model_loader: - kv 19: general.quantization_version u32
llama_model_loader: - type f32: 65 tensors llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens cache size = 3 print_info: file format = GGUF V2
llm_load_vocab: token to piece cache size = 0.1637 MB print_info: file type = Q4_K - Medium
llm_load_print_meta: format = GGUF V2 print_info: file size = 4.07 GiB (4.83 BPW)
llm_load_print_meta: arch = llama load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_print_meta: vocab type = SPM load: special tokens cache size = 3
llm_load_print_meta: n_vocab = 32000 load: token to piece cache size = 0.1637 MB
llm_load_print_meta: n_merges = 0 print_info: arch = llama
llm_load_print_meta: vocab_only = 0 print_info: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768 print_info: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096 print_info: n_embd = 4096
llm_load_print_meta: n_layer = 32 print_info: n_layer = 32
llm_load_print_meta: n_head = 32 print_info: n_head = 32
llm_load_print_meta: n_head_kv = 8 print_info: n_head_kv = 8
llm_load_print_meta: n_rot = 128 print_info: n_rot = 128
llm_load_print_meta: n_swa = 0 print_info: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128 print_info: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128 print_info: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4 print_info: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024 print_info: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024 print_info: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00 print_info: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05 print_info: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00 print_info: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00 print_info: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336 print_info: n_ff = 14336
llm_load_print_meta: n_expert = 0 print_info: n_expert = 0
llm_load_print_meta: n_expert_used = 0 print_info: n_expert_used = 0
llm_load_print_meta: causal attn = 1 print_info: causal attn = 1
llm_load_print_meta: pooling type = 0 print_info: pooling type = 0
llm_load_print_meta: rope type = 0 print_info: rope type = 0
llm_load_print_meta: rope scaling = linear print_info: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0 print_info: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1 print_info: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768 print_info: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown print_info: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0 print_info: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0 print_info: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0 print_info: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0 print_info: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0 print_info: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 7B print_info: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Medium print_info: model params = 7.24 B
llm_load_print_meta: model params = 7.24 B print_info: general.name = mistralai_mistral-7b-instruct-v0.1
llm_load_print_meta: model size = 4.07 GiB (4.83 BPW) print_info: vocab type = SPM
llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.1 print_info: n_vocab = 32000
llm_load_print_meta: BOS token = 1 '<s>' print_info: n_merges = 0
llm_load_print_meta: EOS token = 2 '</s>' print_info: BOS token = 1 '<s>'
llm_load_print_meta: UNK token = 0 '<unk>' print_info: EOS token = 2 '</s>'
llm_load_print_meta: LF token = 13 '<0x0A>' print_info: UNK token = 0 '<unk>'
llm_load_print_meta: max token length = 48 print_info: LF token = 13 '<0x0A>'
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no print_info: EOG token = 2 '</s>'
ggml_sycl_init: SYCL_USE_XMX: yes print_info: max token length = 48
ggml_sycl_init: found 1 SYCL devices: load_tensors: loading model tensors, this can take a while... (mmap = true)
llm_load_tensors: ggml ctx size = 0.27 MiB load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading 32 repeating layers to GPU load_tensors: offloading output layer to GPU
llm_load_tensors: offloading non-repeating layers to GPU load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU load_tensors: CPU_Mapped model buffer size = 70.31 MiB
llm_load_tensors: SYCL0 buffer size = 4095.05 MiB load_tensors: SYCL0 model buffer size = 4095.05 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB .................................................................................................
.............................................................................................. llama_init_from_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 512 llama_init_from_model: n_ctx = 1024
llama_new_context_with_model: n_batch = 512 llama_init_from_model: n_ctx_per_seq = 1024
llama_new_context_with_model: n_ubatch = 512 llama_init_from_model: n_batch = 1024
llama_new_context_with_model: flash_attn = 0 llama_init_from_model: n_ubatch = 1024
llama_new_context_with_model: freq_base = 10000.0 llama_init_from_model: flash_attn = 0
llama_new_context_with_model: freq_scale = 1 llama_init_from_model: freq_base = 10000.0
[SYCL] call ggml_check_sycl llama_init_from_model: freq_scale = 1
ggml_check_sycl: GGML_SYCL_DEBUG: 0 llama_init_from_model: n_ctx_per_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_check_sycl: GGML_SYCL_F16: no Running with Environment Variables:
found 1 SYCL devices: GGML_SYCL_DEBUG: 0
GGML_SYCL_DISABLE_OPT: 1
Build with Macros:
GGML_SYCL_FORCE_MMQ: no
GGML_SYCL_F16: no
Found 1 SYCL devices:
| | | | |Max | |Max |Global | | | | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | | | | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version| |ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------| |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc Graphics| 1.3| 112| 1024| 32| 13578M| 1.3.27504| | 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 12.55| 512| 1024| 32| 16225M| 1.6.31294.120000|
llama_kv_cache_init: SYCL0 KV buffer size = 64.00 MiB SYCL Optimization Feature:
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB |ID| Device Type|Reorder|
llama_new_context_with_model: SYCL_Host output buffer size = 0.12 MiB |--|-------------------|-------|
llama_new_context_with_model: SYCL0 compute buffer size = 81.00 MiB | 0| [level_zero:gpu:0]| Y|
llama_new_context_with_model: SYCL_Host compute buffer size = 9.01 MiB llama_kv_cache_init: kv_size = 1024, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
llama_new_context_with_model: graph nodes = 902 llama_kv_cache_init: SYCL0 KV buffer size = 128.00 MiB
llama_new_context_with_model: graph splits = 2 llama_init_from_model: KV self size = 128.00 MiB, K (f16): 64.00 MiB, V (f16): 64.00 MiB
llama_init_from_model: SYCL_Host output buffer size = 0.12 MiB
llama_init_from_model: SYCL0 compute buffer size = 164.01 MiB
llama_init_from_model: SYCL_Host compute buffer size = 20.01 MiB
llama_init_from_model: graph nodes = 902
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 1024
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
system_info: n_threads = 8 / 18 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampling:
sampler seed: 403565315
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 1024
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 1024, n_batch = 4096, n_predict = 32, n_keep = 1
generate: n_ctx = 512, n_batch = 2048, n_predict = 32, n_keep = 1
Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun exploring the world. But sometimes, she found it hard to find friends who shared her interests.
Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun exploring the world. She lived in a small village where there weren't many opportunities for adventures, but that didn't stop her. She would often read One day, she decided to take matters into her own
llama_print_timings: load time = xxxx ms
llama_print_timings: sample time = x.xx ms / 32 runs ( xx.xx ms per token, xx.xx tokens per second)
llama_print_timings: prompt eval time = xx.xx ms / 31 tokens ( xx.xx ms per token, xx.xx tokens per second)
llama_print_timings: eval time = xx.xx ms / 31 runs ( xx.xx ms per token, xx.xx tokens per second)
llama_print_timings: total time = xx.xx ms / 62 tokens
Log end
llama_perf_sampler_print: sampling time = x.xx ms / 63 runs ( x.xx ms per token, xx.xx tokens per second)
llama_perf_context_print: load time = xx.xx ms
llama_perf_context_print: prompt eval time = xx.xx ms / 31 tokens ( xx.xx ms per token, xx.xx tokens per second)
llama_perf_context_print: eval time = xx.xx ms / 31 runs ( xx.xx ms per token, xx.xx tokens per second)
llama_perf_context_print: total time = xx.xx ms / 62 tokens
``` ```
### Troubleshooting ### Troubleshooting

View file

@ -12,9 +12,9 @@
> 如果是在 Intel Arc B 系列 GPU 上安装(例,**B580**),请参阅本[指南](./bmg_quickstart.md)。 > 如果是在 Intel Arc B 系列 GPU 上安装(例,**B580**),请参阅本[指南](./bmg_quickstart.md)。
> [!NOTE] > [!NOTE]
> `ipex-llm[cpp]` 的最新版本与官方 llama.cpp 的 [3f1ae2e](https://github.com/ggerganov/llama.cpp/commit/3f1ae2e32cde00c39b96be6d01c2997c29bae555) 版本保持一致。 > `ipex-llm[cpp]` 的最新版本与官方 llama.cpp 的 [d7cfe1f](https://github.com/ggml-org/llama.cpp/commit/d7cfe1ffe0f435d0048a6058d529daf76e072d9c) 版本保持一致。
> >
> `ipex-llm[cpp]==2.2.0b20241204` 与官方 llama.cpp 的 [a1631e5](https://github.com/ggerganov/llama.cpp/commit/a1631e53f6763e17da522ba219b030d8932900bd) 版本保持一致。 > `ipex-llm[cpp]==2.2.0b20250320` 与官方 llama.cpp 的 [ba1cb19](https://github.com/ggml-org/llama.cpp/commit/ba1cb19cdd0d92e012e0f6e009e0620f854b6afd) 版本保持一致。
以下是在 Intel Arc GPU 上运行 LLaMA2-7B 的 DEMO 演示。 以下是在 Intel Arc GPU 上运行 LLaMA2-7B 的 DEMO 演示。
@ -159,7 +159,7 @@ cd llama-cpp
- **Linux 用户**: - **Linux 用户**:
```bash ```bash
./llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color ./llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color -no-cnv
``` ```
> **Note**: > **Note**:
@ -171,7 +171,7 @@ cd llama-cpp
请在 Miniforge Prompt 中运行以下命令。 请在 Miniforge Prompt 中运行以下命令。
```cmd ```cmd
llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color -no-cnv
``` ```
> **Note**: > **Note**:
@ -180,11 +180,10 @@ cd llama-cpp
#### 示例输出 #### 示例输出
``` ```
Log start main: llama backend init
main: build = 1 (6f4ec98) main: load the model and apply lora adapter, if any
main: built with MSVC 19.39.33519.0 for llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
main: seed = 1724921424 llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/arda/ruonan/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from D:\gguf-models\mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.1 llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.1
@ -209,108 +208,123 @@ llama_model_loader: - kv 19: general.quantization_version u32
llama_model_loader: - type f32: 65 tensors llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens cache size = 3 print_info: file format = GGUF V2
llm_load_vocab: token to piece cache size = 0.1637 MB print_info: file type = Q4_K - Medium
llm_load_print_meta: format = GGUF V2 print_info: file size = 4.07 GiB (4.83 BPW)
llm_load_print_meta: arch = llama load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_print_meta: vocab type = SPM load: special tokens cache size = 3
llm_load_print_meta: n_vocab = 32000 load: token to piece cache size = 0.1637 MB
llm_load_print_meta: n_merges = 0 print_info: arch = llama
llm_load_print_meta: vocab_only = 0 print_info: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768 print_info: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096 print_info: n_embd = 4096
llm_load_print_meta: n_layer = 32 print_info: n_layer = 32
llm_load_print_meta: n_head = 32 print_info: n_head = 32
llm_load_print_meta: n_head_kv = 8 print_info: n_head_kv = 8
llm_load_print_meta: n_rot = 128 print_info: n_rot = 128
llm_load_print_meta: n_swa = 0 print_info: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128 print_info: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128 print_info: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4 print_info: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024 print_info: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024 print_info: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00 print_info: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05 print_info: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00 print_info: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00 print_info: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336 print_info: n_ff = 14336
llm_load_print_meta: n_expert = 0 print_info: n_expert = 0
llm_load_print_meta: n_expert_used = 0 print_info: n_expert_used = 0
llm_load_print_meta: causal attn = 1 print_info: causal attn = 1
llm_load_print_meta: pooling type = 0 print_info: pooling type = 0
llm_load_print_meta: rope type = 0 print_info: rope type = 0
llm_load_print_meta: rope scaling = linear print_info: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0 print_info: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1 print_info: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768 print_info: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown print_info: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0 print_info: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0 print_info: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0 print_info: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0 print_info: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0 print_info: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 7B print_info: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Medium print_info: model params = 7.24 B
llm_load_print_meta: model params = 7.24 B print_info: general.name = mistralai_mistral-7b-instruct-v0.1
llm_load_print_meta: model size = 4.07 GiB (4.83 BPW) print_info: vocab type = SPM
llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.1 print_info: n_vocab = 32000
llm_load_print_meta: BOS token = 1 '<s>' print_info: n_merges = 0
llm_load_print_meta: EOS token = 2 '</s>' print_info: BOS token = 1 '<s>'
llm_load_print_meta: UNK token = 0 '<unk>' print_info: EOS token = 2 '</s>'
llm_load_print_meta: LF token = 13 '<0x0A>' print_info: UNK token = 0 '<unk>'
llm_load_print_meta: max token length = 48 print_info: LF token = 13 '<0x0A>'
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no print_info: EOG token = 2 '</s>'
ggml_sycl_init: SYCL_USE_XMX: yes print_info: max token length = 48
ggml_sycl_init: found 1 SYCL devices: load_tensors: loading model tensors, this can take a while... (mmap = true)
llm_load_tensors: ggml ctx size = 0.27 MiB load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading 32 repeating layers to GPU load_tensors: offloading output layer to GPU
llm_load_tensors: offloading non-repeating layers to GPU load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU load_tensors: CPU_Mapped model buffer size = 70.31 MiB
llm_load_tensors: SYCL0 buffer size = 4095.05 MiB load_tensors: SYCL0 model buffer size = 4095.05 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB .................................................................................................
.............................................................................................. llama_init_from_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 512 llama_init_from_model: n_ctx = 1024
llama_new_context_with_model: n_batch = 512 llama_init_from_model: n_ctx_per_seq = 1024
llama_new_context_with_model: n_ubatch = 512 llama_init_from_model: n_batch = 1024
llama_new_context_with_model: flash_attn = 0 llama_init_from_model: n_ubatch = 1024
llama_new_context_with_model: freq_base = 10000.0 llama_init_from_model: flash_attn = 0
llama_new_context_with_model: freq_scale = 1 llama_init_from_model: freq_base = 10000.0
[SYCL] call ggml_check_sycl llama_init_from_model: freq_scale = 1
ggml_check_sycl: GGML_SYCL_DEBUG: 0 llama_init_from_model: n_ctx_per_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_check_sycl: GGML_SYCL_F16: no Running with Environment Variables:
found 1 SYCL devices: GGML_SYCL_DEBUG: 0
GGML_SYCL_DISABLE_OPT: 1
Build with Macros:
GGML_SYCL_FORCE_MMQ: no
GGML_SYCL_F16: no
Found 1 SYCL devices:
| | | | |Max | |Max |Global | | | | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | | | | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version| |ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------| |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc Graphics| 1.3| 112| 1024| 32| 13578M| 1.3.27504| | 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 12.55| 512| 1024| 32| 16225M| 1.6.31294.120000|
llama_kv_cache_init: SYCL0 KV buffer size = 64.00 MiB SYCL Optimization Feature:
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB |ID| Device Type|Reorder|
llama_new_context_with_model: SYCL_Host output buffer size = 0.12 MiB |--|-------------------|-------|
llama_new_context_with_model: SYCL0 compute buffer size = 81.00 MiB | 0| [level_zero:gpu:0]| Y|
llama_new_context_with_model: SYCL_Host compute buffer size = 9.01 MiB llama_kv_cache_init: kv_size = 1024, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
llama_new_context_with_model: graph nodes = 902 llama_kv_cache_init: SYCL0 KV buffer size = 128.00 MiB
llama_new_context_with_model: graph splits = 2 llama_init_from_model: KV self size = 128.00 MiB, K (f16): 64.00 MiB, V (f16): 64.00 MiB
llama_init_from_model: SYCL_Host output buffer size = 0.12 MiB
llama_init_from_model: SYCL0 compute buffer size = 164.01 MiB
llama_init_from_model: SYCL_Host compute buffer size = 20.01 MiB
llama_init_from_model: graph nodes = 902
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 1024
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
system_info: n_threads = 8 / 18 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampling:
sampler seed: 403565315
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 1024
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 1024, n_batch = 4096, n_predict = 32, n_keep = 1
generate: n_ctx = 512, n_batch = 2048, n_predict = 32, n_keep = 1
Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun exploring the world. But sometimes, she found it hard to find friends who shared her interests.
Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun exploring the world. She lived in a small village where there weren't many opportunities for adventures, but that didn't stop her. She would often read One day, she decided to take matters into her own
llama_print_timings: load time = xxxx ms
llama_print_timings: sample time = x.xx ms / 32 runs ( xx.xx ms per token, xx.xx tokens per second)
llama_print_timings: prompt eval time = xx.xx ms / 31 tokens ( xx.xx ms per token, xx.xx tokens per second)
llama_print_timings: eval time = xx.xx ms / 31 runs ( xx.xx ms per token, xx.xx tokens per second)
llama_print_timings: total time = xx.xx ms / 62 tokens
Log end
llama_perf_sampler_print: sampling time = x.xx ms / 63 runs ( x.xx ms per token, xx.xx tokens per second)
llama_perf_context_print: load time = xx.xx ms
llama_perf_context_print: prompt eval time = xx.xx ms / 31 tokens ( xx.xx ms per token, xx.xx tokens per second)
llama_perf_context_print: eval time = xx.xx ms / 31 runs ( xx.xx ms per token, xx.xx tokens per second)
llama_perf_context_print: total time = xx.xx ms / 62 tokens
``` ```
### 故障排除 ### 故障排除

View file

@ -64,7 +64,7 @@ Before running, you should download or copy community GGUF model to your local d
#### Run GGUF model #### Run GGUF model
Please change `PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` to your model path before your run below command. Please change `PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` to your model path before your run below command.
```cmd ```cmd
llama-cli.exe -m PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048 -t 8 -e -ngl 99 --color -c 2500 --temp 0 llama-cli.exe -m PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048 -t 8 -e -ngl 99 --color -c 2500 --temp 0 -no-cnv
``` ```
Part of outputs: Part of outputs:
@ -75,27 +75,32 @@ Found 1 SYCL devices:
| | | | |compute|Max work|sub |mem | | | | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version| |ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------| |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc Graphics| 12.71| 128| 1024| 32| 13578M| 1.3.27504| | 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 12.55| 512| 1024| 32| 16225M| 1.6.31294.120000|
SYCL Optimization Feature:
|ID| Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]| Y|
llama_kv_cache_init: kv_size = 2528, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init: SYCL0 KV buffer size = 138.25 MiB llama_kv_cache_init: SYCL0 KV buffer size = 138.25 MiB
llama_new_context_with_model: KV self size = 138.25 MiB, K (f16): 69.12 MiB, V (f16): 69.12 MiB llama_init_from_model: KV self size = 138.25 MiB, K (f16): 69.12 MiB, V (f16): 69.12 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.58 MiB llama_init_from_model: SYCL_Host output buffer size = 0.58 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 1501.00 MiB llama_init_from_model: SYCL0 compute buffer size = 1501.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 58.97 MiB llama_init_from_model: SYCL_Host compute buffer size = 59.28 MiB
llama_new_context_with_model: graph nodes = 874 llama_init_from_model: graph nodes = 874
llama_new_context_with_model: graph splits = 2 llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2528
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8 main: llama threadpool init, n_threads = 8
system_info: n_threads = 8 (n_threads_batch = 8) / 22 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampler seed: 341519086 sampler seed: 1856767110
sampler params: sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2528
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000 top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1 generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1
<think> <think>
@ -143,7 +148,7 @@ Before running, you should download or copy community GGUF model to your local d
#### Run GGUF model #### Run GGUF model
Please change `/PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` to your model path before your run below command. Please change `/PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` to your model path before your run below command.
```bash ```bash
./llama-cli -m /PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048 -t 8 -e -ngl 99 --color -c 2500 --temp 0 ./llama-cli -m /PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048 -t 8 -e -ngl 99 --color -c 2500 --temp 0 -no-cnv
``` ```
Part of outputs: Part of outputs:
@ -154,27 +159,32 @@ Found 1 SYCL devices:
| | | | |compute|Max work|sub |mem | | | | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version| |ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------| |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc Graphics| 12.71| 128| 1024| 32| 13578M| 1.3.27504| | 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 12.55| 512| 1024| 32| 16225M| 1.6.31294.120000|
SYCL Optimization Feature:
|ID| Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]| Y|
llama_kv_cache_init: kv_size = 2528, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init: SYCL0 KV buffer size = 138.25 MiB llama_kv_cache_init: SYCL0 KV buffer size = 138.25 MiB
llama_new_context_with_model: KV self size = 138.25 MiB, K (f16): 69.12 MiB, V (f16): 69.12 MiB llama_init_from_model: KV self size = 138.25 MiB, K (f16): 69.12 MiB, V (f16): 69.12 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.58 MiB llama_init_from_model: SYCL_Host output buffer size = 0.58 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 1501.00 MiB llama_init_from_model: SYCL0 compute buffer size = 1501.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 58.97 MiB llama_init_from_model: SYCL_Host compute buffer size = 59.28 MiB
llama_new_context_with_model: graph nodes = 874 llama_init_from_model: graph nodes = 874
llama_new_context_with_model: graph splits = 2 llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2528
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8 main: llama threadpool init, n_threads = 8
system_info: n_threads = 8 (n_threads_batch = 8) / 22 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampler seed: 341519086 sampler seed: 1856767110
sampler params: sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2528
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000 top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1 generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1
<think> <think>
@ -211,7 +221,7 @@ Before running, you should download or copy community GGUF model to your local d
Change `/PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf` to your model path, then run `DeepSeek-R1-Q4_K_M.gguf` Change `/PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf` to your model path, then run `DeepSeek-R1-Q4_K_M.gguf`
```bash ```bash
./flash-moe -m /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --prompt "What's AI?" ./flash-moe -m /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --prompt "What's AI?" -no-cnv
``` ```
Part of outputs Part of outputs

View file

@ -66,7 +66,7 @@
在运行以下命令之前,请将 `PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` 更改为你的模型路径。 在运行以下命令之前,请将 `PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` 更改为你的模型路径。
```cmd ```cmd
llama-cli.exe -m PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048 -t 8 -e -ngl 99 --color -c 2500 --temp 0 llama-cli.exe -m PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048 -t 8 -e -ngl 99 --color -c 2500 --temp 0 -no-cnv
``` ```
部分输出: 部分输出:
@ -77,27 +77,32 @@ Found 1 SYCL devices:
| | | | |compute|Max work|sub |mem | | | | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version| |ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------| |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc Graphics| 12.71| 128| 1024| 32| 13578M| 1.3.27504| | 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 12.55| 512| 1024| 32| 16225M| 1.6.31294.120000|
SYCL Optimization Feature:
|ID| Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]| Y|
llama_kv_cache_init: kv_size = 2528, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init: SYCL0 KV buffer size = 138.25 MiB llama_kv_cache_init: SYCL0 KV buffer size = 138.25 MiB
llama_new_context_with_model: KV self size = 138.25 MiB, K (f16): 69.12 MiB, V (f16): 69.12 MiB llama_init_from_model: KV self size = 138.25 MiB, K (f16): 69.12 MiB, V (f16): 69.12 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.58 MiB llama_init_from_model: SYCL_Host output buffer size = 0.58 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 1501.00 MiB llama_init_from_model: SYCL0 compute buffer size = 1501.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 58.97 MiB llama_init_from_model: SYCL_Host compute buffer size = 59.28 MiB
llama_new_context_with_model: graph nodes = 874 llama_init_from_model: graph nodes = 874
llama_new_context_with_model: graph splits = 2 llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2528
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8 main: llama threadpool init, n_threads = 8
system_info: n_threads = 8 (n_threads_batch = 8) / 22 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampler seed: 341519086 sampler seed: 1856767110
sampler params: sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2528
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000 top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1 generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1
<think> <think>
@ -147,7 +152,7 @@ llama_perf_context_print: total time = xxxxx.xx ms / 1385 tokens
在运行以下命令之前,请将 `PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` 更改为你的模型路径。 在运行以下命令之前,请将 `PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` 更改为你的模型路径。
```bash ```bash
./llama-cli -m /PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048 -t 8 -e -ngl 99 --color -c 2500 --temp 0 ./llama-cli -m /PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048 -t 8 -e -ngl 99 --color -c 2500 --temp 0 -no-cnv
``` ```
部分输出: 部分输出:
@ -158,27 +163,32 @@ Found 1 SYCL devices:
| | | | |compute|Max work|sub |mem | | | | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version| |ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------| |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc Graphics| 12.71| 128| 1024| 32| 13578M| 1.3.27504| | 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 12.55| 512| 1024| 32| 16225M| 1.6.31294.120000|
SYCL Optimization Feature:
|ID| Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]| Y|
llama_kv_cache_init: kv_size = 2528, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init: SYCL0 KV buffer size = 138.25 MiB llama_kv_cache_init: SYCL0 KV buffer size = 138.25 MiB
llama_new_context_with_model: KV self size = 138.25 MiB, K (f16): 69.12 MiB, V (f16): 69.12 MiB llama_init_from_model: KV self size = 138.25 MiB, K (f16): 69.12 MiB, V (f16): 69.12 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.58 MiB llama_init_from_model: SYCL_Host output buffer size = 0.58 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 1501.00 MiB llama_init_from_model: SYCL0 compute buffer size = 1501.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 58.97 MiB llama_init_from_model: SYCL_Host compute buffer size = 59.28 MiB
llama_new_context_with_model: graph nodes = 874 llama_init_from_model: graph nodes = 874
llama_new_context_with_model: graph splits = 2 llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2528
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8 main: llama threadpool init, n_threads = 8
system_info: n_threads = 8 (n_threads_batch = 8) / 22 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampler seed: 341519086 sampler seed: 1856767110
sampler params: sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2528
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000 top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1 generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1
<think> <think>
@ -213,7 +223,7 @@ FlashMoE 是一款基于 `llama.cpp` 构建的命令行工具,针对 DeepSeek
请将 `/PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf` 更改为您的模型路径,然后运行 `DeepSeek-R1-Q4_K_M.gguf` 请将 `/PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf` 更改为您的模型路径,然后运行 `DeepSeek-R1-Q4_K_M.gguf`
```bash ```bash
./flash-moe -m /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --prompt "What's AI?" ./flash-moe -m /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --prompt "What's AI?" -no-cnv
``` ```
部分输出: 部分输出: