update llama.cpp related quickstart with rebased llama.cpp (#12996)
* update doc with reabsed llama.cpp * revert table of contents * update demo output log
This commit is contained in:
		
							parent
							
								
									7a86dd0569
								
							
						
					
					
						commit
						0e0786a63c
					
				
					 4 changed files with 312 additions and 264 deletions
				
			
		| 
						 | 
					@ -12,9 +12,9 @@
 | 
				
			||||||
> For installation on Intel Arc B-Series GPU (such as **B580**), please refer to this [guide](./bmg_quickstart.md).
 | 
					> For installation on Intel Arc B-Series GPU (such as **B580**), please refer to this [guide](./bmg_quickstart.md).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!NOTE]
 | 
					> [!NOTE]
 | 
				
			||||||
> Our latest version is consistent with [3f1ae2e](https://github.com/ggerganov/llama.cpp/commit/3f1ae2e32cde00c39b96be6d01c2997c29bae555) of llama.cpp.
 | 
					> Our latest version is consistent with [d7cfe1f](https://github.com/ggml-org/llama.cpp/commit/d7cfe1ffe0f435d0048a6058d529daf76e072d9c) of llama.cpp.
 | 
				
			||||||
>
 | 
					>
 | 
				
			||||||
> `ipex-llm[cpp]==2.2.0b20241204` is consistent with [a1631e5](https://github.com/ggerganov/llama.cpp/commit/a1631e53f6763e17da522ba219b030d8932900bd) of llama.cpp.
 | 
					> `ipex-llm[cpp]==2.2.0b20250320` is consistent with [ba1cb19](https://github.com/ggml-org/llama.cpp/commit/ba1cb19cdd0d92e012e0f6e009e0620f854b6afd) of llama.cpp.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
See the demo of running LLaMA2-7B on Intel Arc GPU below.
 | 
					See the demo of running LLaMA2-7B on Intel Arc GPU below.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -158,7 +158,7 @@ Before running, you should download or copy community GGUF model to your current
 | 
				
			||||||
- For **Linux users**:
 | 
					- For **Linux users**:
 | 
				
			||||||
  
 | 
					  
 | 
				
			||||||
  ```bash
 | 
					  ```bash
 | 
				
			||||||
  ./llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color
 | 
					  ./llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color -no-cnv
 | 
				
			||||||
  ```
 | 
					  ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  > **Note**:
 | 
					  > **Note**:
 | 
				
			||||||
| 
						 | 
					@ -170,7 +170,7 @@ Before running, you should download or copy community GGUF model to your current
 | 
				
			||||||
  Please run the following command in Miniforge Prompt.
 | 
					  Please run the following command in Miniforge Prompt.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  ```cmd
 | 
					  ```cmd
 | 
				
			||||||
  llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color
 | 
					  llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color -no-cnv
 | 
				
			||||||
  ```
 | 
					  ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  > **Note**:
 | 
					  > **Note**:
 | 
				
			||||||
| 
						 | 
					@ -179,11 +179,10 @@ Before running, you should download or copy community GGUF model to your current
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### Sample Output
 | 
					#### Sample Output
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
Log start
 | 
					main: llama backend init
 | 
				
			||||||
main: build = 1 (6f4ec98)
 | 
					main: load the model and apply lora adapter, if any
 | 
				
			||||||
main: built with MSVC 19.39.33519.0 for
 | 
					llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
 | 
				
			||||||
main: seed  = 1724921424
 | 
					llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/arda/ruonan/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
 | 
				
			||||||
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from D:\gguf-models\mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
 | 
					 | 
				
			||||||
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 | 
					llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 | 
				
			||||||
llama_model_loader: - kv   0:                       general.architecture str              = llama
 | 
					llama_model_loader: - kv   0:                       general.architecture str              = llama
 | 
				
			||||||
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
 | 
					llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
 | 
				
			||||||
| 
						 | 
					@ -208,108 +207,123 @@ llama_model_loader: - kv  19:               general.quantization_version u32
 | 
				
			||||||
llama_model_loader: - type  f32:   65 tensors
 | 
					llama_model_loader: - type  f32:   65 tensors
 | 
				
			||||||
llama_model_loader: - type q4_K:  193 tensors
 | 
					llama_model_loader: - type q4_K:  193 tensors
 | 
				
			||||||
llama_model_loader: - type q6_K:   33 tensors
 | 
					llama_model_loader: - type q6_K:   33 tensors
 | 
				
			||||||
llm_load_vocab: special tokens cache size = 3
 | 
					print_info: file format = GGUF V2
 | 
				
			||||||
llm_load_vocab: token to piece cache size = 0.1637 MB
 | 
					print_info: file type   = Q4_K - Medium
 | 
				
			||||||
llm_load_print_meta: format           = GGUF V2
 | 
					print_info: file size   = 4.07 GiB (4.83 BPW) 
 | 
				
			||||||
llm_load_print_meta: arch             = llama
 | 
					load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
 | 
				
			||||||
llm_load_print_meta: vocab type       = SPM
 | 
					load: special tokens cache size = 3
 | 
				
			||||||
llm_load_print_meta: n_vocab          = 32000
 | 
					load: token to piece cache size = 0.1637 MB
 | 
				
			||||||
llm_load_print_meta: n_merges         = 0
 | 
					print_info: arch             = llama
 | 
				
			||||||
llm_load_print_meta: vocab_only       = 0
 | 
					print_info: vocab_only       = 0
 | 
				
			||||||
llm_load_print_meta: n_ctx_train      = 32768
 | 
					print_info: n_ctx_train      = 32768
 | 
				
			||||||
llm_load_print_meta: n_embd           = 4096
 | 
					print_info: n_embd           = 4096
 | 
				
			||||||
llm_load_print_meta: n_layer          = 32
 | 
					print_info: n_layer          = 32
 | 
				
			||||||
llm_load_print_meta: n_head           = 32
 | 
					print_info: n_head           = 32
 | 
				
			||||||
llm_load_print_meta: n_head_kv        = 8
 | 
					print_info: n_head_kv        = 8
 | 
				
			||||||
llm_load_print_meta: n_rot            = 128
 | 
					print_info: n_rot            = 128
 | 
				
			||||||
llm_load_print_meta: n_swa            = 0
 | 
					print_info: n_swa            = 0
 | 
				
			||||||
llm_load_print_meta: n_embd_head_k    = 128
 | 
					print_info: n_embd_head_k    = 128
 | 
				
			||||||
llm_load_print_meta: n_embd_head_v    = 128
 | 
					print_info: n_embd_head_v    = 128
 | 
				
			||||||
llm_load_print_meta: n_gqa            = 4
 | 
					print_info: n_gqa            = 4
 | 
				
			||||||
llm_load_print_meta: n_embd_k_gqa     = 1024
 | 
					print_info: n_embd_k_gqa     = 1024
 | 
				
			||||||
llm_load_print_meta: n_embd_v_gqa     = 1024
 | 
					print_info: n_embd_v_gqa     = 1024
 | 
				
			||||||
llm_load_print_meta: f_norm_eps       = 0.0e+00
 | 
					print_info: f_norm_eps       = 0.0e+00
 | 
				
			||||||
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
 | 
					print_info: f_norm_rms_eps   = 1.0e-05
 | 
				
			||||||
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
 | 
					print_info: f_clamp_kqv      = 0.0e+00
 | 
				
			||||||
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
 | 
					print_info: f_max_alibi_bias = 0.0e+00
 | 
				
			||||||
llm_load_print_meta: f_logit_scale    = 0.0e+00
 | 
					print_info: f_logit_scale    = 0.0e+00
 | 
				
			||||||
llm_load_print_meta: n_ff             = 14336
 | 
					print_info: n_ff             = 14336
 | 
				
			||||||
llm_load_print_meta: n_expert         = 0
 | 
					print_info: n_expert         = 0
 | 
				
			||||||
llm_load_print_meta: n_expert_used    = 0
 | 
					print_info: n_expert_used    = 0
 | 
				
			||||||
llm_load_print_meta: causal attn      = 1
 | 
					print_info: causal attn      = 1
 | 
				
			||||||
llm_load_print_meta: pooling type     = 0
 | 
					print_info: pooling type     = 0
 | 
				
			||||||
llm_load_print_meta: rope type        = 0
 | 
					print_info: rope type        = 0
 | 
				
			||||||
llm_load_print_meta: rope scaling     = linear
 | 
					print_info: rope scaling     = linear
 | 
				
			||||||
llm_load_print_meta: freq_base_train  = 10000.0
 | 
					print_info: freq_base_train  = 10000.0
 | 
				
			||||||
llm_load_print_meta: freq_scale_train = 1
 | 
					print_info: freq_scale_train = 1
 | 
				
			||||||
llm_load_print_meta: n_ctx_orig_yarn  = 32768
 | 
					print_info: n_ctx_orig_yarn  = 32768
 | 
				
			||||||
llm_load_print_meta: rope_finetuned   = unknown
 | 
					print_info: rope_finetuned   = unknown
 | 
				
			||||||
llm_load_print_meta: ssm_d_conv       = 0
 | 
					print_info: ssm_d_conv       = 0
 | 
				
			||||||
llm_load_print_meta: ssm_d_inner      = 0
 | 
					print_info: ssm_d_inner      = 0
 | 
				
			||||||
llm_load_print_meta: ssm_d_state      = 0
 | 
					print_info: ssm_d_state      = 0
 | 
				
			||||||
llm_load_print_meta: ssm_dt_rank      = 0
 | 
					print_info: ssm_dt_rank      = 0
 | 
				
			||||||
llm_load_print_meta: ssm_dt_b_c_rms   = 0
 | 
					print_info: ssm_dt_b_c_rms   = 0
 | 
				
			||||||
llm_load_print_meta: model type       = 7B
 | 
					print_info: model type       = 7B
 | 
				
			||||||
llm_load_print_meta: model ftype      = Q4_K - Medium
 | 
					print_info: model params     = 7.24 B
 | 
				
			||||||
llm_load_print_meta: model params     = 7.24 B
 | 
					print_info: general.name     = mistralai_mistral-7b-instruct-v0.1
 | 
				
			||||||
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW)
 | 
					print_info: vocab type       = SPM
 | 
				
			||||||
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.1
 | 
					print_info: n_vocab          = 32000
 | 
				
			||||||
llm_load_print_meta: BOS token        = 1 '<s>'
 | 
					print_info: n_merges         = 0
 | 
				
			||||||
llm_load_print_meta: EOS token        = 2 '</s>'
 | 
					print_info: BOS token        = 1 '<s>'
 | 
				
			||||||
llm_load_print_meta: UNK token        = 0 '<unk>'
 | 
					print_info: EOS token        = 2 '</s>'
 | 
				
			||||||
llm_load_print_meta: LF token         = 13 '<0x0A>'
 | 
					print_info: UNK token        = 0 '<unk>'
 | 
				
			||||||
llm_load_print_meta: max token length = 48
 | 
					print_info: LF token         = 13 '<0x0A>'
 | 
				
			||||||
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
 | 
					print_info: EOG token        = 2 '</s>'
 | 
				
			||||||
ggml_sycl_init: SYCL_USE_XMX: yes
 | 
					print_info: max token length = 48
 | 
				
			||||||
ggml_sycl_init: found 1 SYCL devices:
 | 
					load_tensors: loading model tensors, this can take a while... (mmap = true)
 | 
				
			||||||
llm_load_tensors: ggml ctx size =    0.27 MiB
 | 
					load_tensors: offloading 32 repeating layers to GPU
 | 
				
			||||||
llm_load_tensors: offloading 32 repeating layers to GPU
 | 
					load_tensors: offloading output layer to GPU
 | 
				
			||||||
llm_load_tensors: offloading non-repeating layers to GPU
 | 
					load_tensors: offloaded 33/33 layers to GPU
 | 
				
			||||||
llm_load_tensors: offloaded 33/33 layers to GPU
 | 
					load_tensors:   CPU_Mapped model buffer size =    70.31 MiB
 | 
				
			||||||
llm_load_tensors:      SYCL0 buffer size =  4095.05 MiB
 | 
					load_tensors:        SYCL0 model buffer size =  4095.05 MiB
 | 
				
			||||||
llm_load_tensors:        CPU buffer size =    70.31 MiB
 | 
					.................................................................................................
 | 
				
			||||||
..............................................................................................
 | 
					llama_init_from_model: n_seq_max     = 1
 | 
				
			||||||
llama_new_context_with_model: n_ctx      = 512
 | 
					llama_init_from_model: n_ctx         = 1024
 | 
				
			||||||
llama_new_context_with_model: n_batch    = 512
 | 
					llama_init_from_model: n_ctx_per_seq = 1024
 | 
				
			||||||
llama_new_context_with_model: n_ubatch   = 512
 | 
					llama_init_from_model: n_batch       = 1024
 | 
				
			||||||
llama_new_context_with_model: flash_attn = 0
 | 
					llama_init_from_model: n_ubatch      = 1024
 | 
				
			||||||
llama_new_context_with_model: freq_base  = 10000.0
 | 
					llama_init_from_model: flash_attn    = 0
 | 
				
			||||||
llama_new_context_with_model: freq_scale = 1
 | 
					llama_init_from_model: freq_base     = 10000.0
 | 
				
			||||||
[SYCL] call ggml_check_sycl
 | 
					llama_init_from_model: freq_scale    = 1
 | 
				
			||||||
ggml_check_sycl: GGML_SYCL_DEBUG: 0
 | 
					llama_init_from_model: n_ctx_per_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
 | 
				
			||||||
ggml_check_sycl: GGML_SYCL_F16: no
 | 
					Running with Environment Variables:
 | 
				
			||||||
found 1 SYCL devices:
 | 
					  GGML_SYCL_DEBUG: 0
 | 
				
			||||||
 | 
					  GGML_SYCL_DISABLE_OPT: 1
 | 
				
			||||||
 | 
					Build with Macros:
 | 
				
			||||||
 | 
					  GGML_SYCL_FORCE_MMQ: no
 | 
				
			||||||
 | 
					  GGML_SYCL_F16: no
 | 
				
			||||||
 | 
					Found 1 SYCL devices:
 | 
				
			||||||
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
 | 
					|  |                   |                                       |       |Max    |        |Max  |Global |                     |
 | 
				
			||||||
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
 | 
					|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
 | 
				
			||||||
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
 | 
					|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
 | 
				
			||||||
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
 | 
					|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
 | 
				
			||||||
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|    1.3|    112|    1024|   32| 13578M|            1.3.27504|
 | 
					| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|  12.55|    512|    1024|   32| 16225M|     1.6.31294.120000|
 | 
				
			||||||
llama_kv_cache_init:      SYCL0 KV buffer size =    64.00 MiB
 | 
					SYCL Optimization Feature:
 | 
				
			||||||
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
 | 
					|ID|        Device Type|Reorder|
 | 
				
			||||||
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.12 MiB
 | 
					|--|-------------------|-------|
 | 
				
			||||||
llama_new_context_with_model:      SYCL0 compute buffer size =    81.00 MiB
 | 
					| 0| [level_zero:gpu:0]|      Y|
 | 
				
			||||||
llama_new_context_with_model:  SYCL_Host compute buffer size =     9.01 MiB
 | 
					llama_kv_cache_init: kv_size = 1024, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
 | 
				
			||||||
llama_new_context_with_model: graph nodes  = 902
 | 
					llama_kv_cache_init:      SYCL0 KV buffer size =   128.00 MiB
 | 
				
			||||||
llama_new_context_with_model: graph splits = 2
 | 
					llama_init_from_model: KV self size  =  128.00 MiB, K (f16):   64.00 MiB, V (f16):   64.00 MiB
 | 
				
			||||||
 | 
					llama_init_from_model:  SYCL_Host  output buffer size =     0.12 MiB
 | 
				
			||||||
 | 
					llama_init_from_model:      SYCL0 compute buffer size =   164.01 MiB
 | 
				
			||||||
 | 
					llama_init_from_model:  SYCL_Host compute buffer size =    20.01 MiB
 | 
				
			||||||
 | 
					llama_init_from_model: graph nodes  = 902
 | 
				
			||||||
 | 
					llama_init_from_model: graph splits = 2
 | 
				
			||||||
 | 
					common_init_from_params: setting dry_penalty_last_n to ctx_size = 1024
 | 
				
			||||||
 | 
					common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 | 
				
			||||||
 | 
					main: llama threadpool init, n_threads = 8
 | 
				
			||||||
 | 
					
 | 
				
			||||||
system_info: n_threads = 8 / 18 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
 | 
					system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
 | 
				
			||||||
sampling:
 | 
					
 | 
				
			||||||
 | 
					sampler seed: 403565315
 | 
				
			||||||
 | 
					sampler params: 
 | 
				
			||||||
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
 | 
					        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
 | 
				
			||||||
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
 | 
					        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 1024
 | 
				
			||||||
 | 
					        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
 | 
				
			||||||
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 | 
					        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 | 
				
			||||||
sampling order:
 | 
					sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 | 
				
			||||||
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
 | 
					generate: n_ctx = 1024, n_batch = 4096, n_predict = 32, n_keep = 1
 | 
				
			||||||
generate: n_ctx = 512, n_batch = 2048, n_predict = 32, n_keep = 1
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					 Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun exploring the world. But sometimes, she found it hard to find friends who shared her interests.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun exploring the world. She lived in a small village where there weren't many opportunities for adventures, but that didn't stop her. She would often read
 | 
					One day, she decided to take matters into her own
 | 
				
			||||||
llama_print_timings:        load time =    xxxx ms
 | 
					 | 
				
			||||||
llama_print_timings:      sample time =     x.xx ms /    32 runs   (   xx.xx ms per token,  xx.xx tokens per second)
 | 
					 | 
				
			||||||
llama_print_timings: prompt eval time =    xx.xx ms /    31 tokens (   xx.xx ms per token,  xx.xx tokens per second)
 | 
					 | 
				
			||||||
llama_print_timings:        eval time =    xx.xx ms /    31 runs   (   xx.xx ms per token,  xx.xx tokens per second)
 | 
					 | 
				
			||||||
llama_print_timings:       total time =    xx.xx ms /    62 tokens
 | 
					 | 
				
			||||||
Log end
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					llama_perf_sampler_print:    sampling time =       x.xx ms /    63 runs   (    x.xx ms per token, xx.xx tokens per second)
 | 
				
			||||||
 | 
					llama_perf_context_print:        load time =      xx.xx ms
 | 
				
			||||||
 | 
					llama_perf_context_print: prompt eval time =      xx.xx ms /    31 tokens (   xx.xx ms per token,    xx.xx tokens per second)
 | 
				
			||||||
 | 
					llama_perf_context_print:        eval time =      xx.xx ms /    31 runs   (   xx.xx ms per token,    xx.xx tokens per second)
 | 
				
			||||||
 | 
					llama_perf_context_print:       total time =      xx.xx ms /    62 tokens
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Troubleshooting
 | 
					### Troubleshooting
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -12,9 +12,9 @@
 | 
				
			||||||
> 如果是在 Intel Arc B 系列 GPU 上安装(例,**B580**),请参阅本[指南](./bmg_quickstart.md)。
 | 
					> 如果是在 Intel Arc B 系列 GPU 上安装(例,**B580**),请参阅本[指南](./bmg_quickstart.md)。
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!NOTE]
 | 
					> [!NOTE]
 | 
				
			||||||
> `ipex-llm[cpp]` 的最新版本与官方 llama.cpp 的 [3f1ae2e](https://github.com/ggerganov/llama.cpp/commit/3f1ae2e32cde00c39b96be6d01c2997c29bae555) 版本保持一致。 
 | 
					> `ipex-llm[cpp]` 的最新版本与官方 llama.cpp 的 [d7cfe1f](https://github.com/ggml-org/llama.cpp/commit/d7cfe1ffe0f435d0048a6058d529daf76e072d9c) 版本保持一致。 
 | 
				
			||||||
>
 | 
					>
 | 
				
			||||||
> `ipex-llm[cpp]==2.2.0b20241204` 与官方 llama.cpp 的 [a1631e5](https://github.com/ggerganov/llama.cpp/commit/a1631e53f6763e17da522ba219b030d8932900bd) 版本保持一致。
 | 
					> `ipex-llm[cpp]==2.2.0b20250320` 与官方 llama.cpp 的 [ba1cb19](https://github.com/ggml-org/llama.cpp/commit/ba1cb19cdd0d92e012e0f6e009e0620f854b6afd) 版本保持一致。
 | 
				
			||||||
 | 
					
 | 
				
			||||||
以下是在 Intel Arc GPU 上运行 LLaMA2-7B 的 DEMO 演示。
 | 
					以下是在 Intel Arc GPU 上运行 LLaMA2-7B 的 DEMO 演示。
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -159,7 +159,7 @@ cd llama-cpp
 | 
				
			||||||
- **Linux 用户**:
 | 
					- **Linux 用户**:
 | 
				
			||||||
  
 | 
					  
 | 
				
			||||||
  ```bash
 | 
					  ```bash
 | 
				
			||||||
  ./llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color
 | 
					  ./llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color -no-cnv
 | 
				
			||||||
  ```
 | 
					  ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  > **Note**:
 | 
					  > **Note**:
 | 
				
			||||||
| 
						 | 
					@ -171,7 +171,7 @@ cd llama-cpp
 | 
				
			||||||
  请在 Miniforge Prompt 中运行以下命令。
 | 
					  请在 Miniforge Prompt 中运行以下命令。
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  ```cmd
 | 
					  ```cmd
 | 
				
			||||||
  llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color
 | 
					  llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -c 1024 -t 8 -e -ngl 99 --color -no-cnv
 | 
				
			||||||
  ```
 | 
					  ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  > **Note**:
 | 
					  > **Note**:
 | 
				
			||||||
| 
						 | 
					@ -180,11 +180,10 @@ cd llama-cpp
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### 示例输出
 | 
					#### 示例输出
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
Log start
 | 
					main: llama backend init
 | 
				
			||||||
main: build = 1 (6f4ec98)
 | 
					main: load the model and apply lora adapter, if any
 | 
				
			||||||
main: built with MSVC 19.39.33519.0 for
 | 
					llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
 | 
				
			||||||
main: seed  = 1724921424
 | 
					llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/arda/ruonan/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
 | 
				
			||||||
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from D:\gguf-models\mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
 | 
					 | 
				
			||||||
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 | 
					llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 | 
				
			||||||
llama_model_loader: - kv   0:                       general.architecture str              = llama
 | 
					llama_model_loader: - kv   0:                       general.architecture str              = llama
 | 
				
			||||||
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
 | 
					llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
 | 
				
			||||||
| 
						 | 
					@ -209,108 +208,123 @@ llama_model_loader: - kv  19:               general.quantization_version u32
 | 
				
			||||||
llama_model_loader: - type  f32:   65 tensors
 | 
					llama_model_loader: - type  f32:   65 tensors
 | 
				
			||||||
llama_model_loader: - type q4_K:  193 tensors
 | 
					llama_model_loader: - type q4_K:  193 tensors
 | 
				
			||||||
llama_model_loader: - type q6_K:   33 tensors
 | 
					llama_model_loader: - type q6_K:   33 tensors
 | 
				
			||||||
llm_load_vocab: special tokens cache size = 3
 | 
					print_info: file format = GGUF V2
 | 
				
			||||||
llm_load_vocab: token to piece cache size = 0.1637 MB
 | 
					print_info: file type   = Q4_K - Medium
 | 
				
			||||||
llm_load_print_meta: format           = GGUF V2
 | 
					print_info: file size   = 4.07 GiB (4.83 BPW) 
 | 
				
			||||||
llm_load_print_meta: arch             = llama
 | 
					load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
 | 
				
			||||||
llm_load_print_meta: vocab type       = SPM
 | 
					load: special tokens cache size = 3
 | 
				
			||||||
llm_load_print_meta: n_vocab          = 32000
 | 
					load: token to piece cache size = 0.1637 MB
 | 
				
			||||||
llm_load_print_meta: n_merges         = 0
 | 
					print_info: arch             = llama
 | 
				
			||||||
llm_load_print_meta: vocab_only       = 0
 | 
					print_info: vocab_only       = 0
 | 
				
			||||||
llm_load_print_meta: n_ctx_train      = 32768
 | 
					print_info: n_ctx_train      = 32768
 | 
				
			||||||
llm_load_print_meta: n_embd           = 4096
 | 
					print_info: n_embd           = 4096
 | 
				
			||||||
llm_load_print_meta: n_layer          = 32
 | 
					print_info: n_layer          = 32
 | 
				
			||||||
llm_load_print_meta: n_head           = 32
 | 
					print_info: n_head           = 32
 | 
				
			||||||
llm_load_print_meta: n_head_kv        = 8
 | 
					print_info: n_head_kv        = 8
 | 
				
			||||||
llm_load_print_meta: n_rot            = 128
 | 
					print_info: n_rot            = 128
 | 
				
			||||||
llm_load_print_meta: n_swa            = 0
 | 
					print_info: n_swa            = 0
 | 
				
			||||||
llm_load_print_meta: n_embd_head_k    = 128
 | 
					print_info: n_embd_head_k    = 128
 | 
				
			||||||
llm_load_print_meta: n_embd_head_v    = 128
 | 
					print_info: n_embd_head_v    = 128
 | 
				
			||||||
llm_load_print_meta: n_gqa            = 4
 | 
					print_info: n_gqa            = 4
 | 
				
			||||||
llm_load_print_meta: n_embd_k_gqa     = 1024
 | 
					print_info: n_embd_k_gqa     = 1024
 | 
				
			||||||
llm_load_print_meta: n_embd_v_gqa     = 1024
 | 
					print_info: n_embd_v_gqa     = 1024
 | 
				
			||||||
llm_load_print_meta: f_norm_eps       = 0.0e+00
 | 
					print_info: f_norm_eps       = 0.0e+00
 | 
				
			||||||
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
 | 
					print_info: f_norm_rms_eps   = 1.0e-05
 | 
				
			||||||
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
 | 
					print_info: f_clamp_kqv      = 0.0e+00
 | 
				
			||||||
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
 | 
					print_info: f_max_alibi_bias = 0.0e+00
 | 
				
			||||||
llm_load_print_meta: f_logit_scale    = 0.0e+00
 | 
					print_info: f_logit_scale    = 0.0e+00
 | 
				
			||||||
llm_load_print_meta: n_ff             = 14336
 | 
					print_info: n_ff             = 14336
 | 
				
			||||||
llm_load_print_meta: n_expert         = 0
 | 
					print_info: n_expert         = 0
 | 
				
			||||||
llm_load_print_meta: n_expert_used    = 0
 | 
					print_info: n_expert_used    = 0
 | 
				
			||||||
llm_load_print_meta: causal attn      = 1
 | 
					print_info: causal attn      = 1
 | 
				
			||||||
llm_load_print_meta: pooling type     = 0
 | 
					print_info: pooling type     = 0
 | 
				
			||||||
llm_load_print_meta: rope type        = 0
 | 
					print_info: rope type        = 0
 | 
				
			||||||
llm_load_print_meta: rope scaling     = linear
 | 
					print_info: rope scaling     = linear
 | 
				
			||||||
llm_load_print_meta: freq_base_train  = 10000.0
 | 
					print_info: freq_base_train  = 10000.0
 | 
				
			||||||
llm_load_print_meta: freq_scale_train = 1
 | 
					print_info: freq_scale_train = 1
 | 
				
			||||||
llm_load_print_meta: n_ctx_orig_yarn  = 32768
 | 
					print_info: n_ctx_orig_yarn  = 32768
 | 
				
			||||||
llm_load_print_meta: rope_finetuned   = unknown
 | 
					print_info: rope_finetuned   = unknown
 | 
				
			||||||
llm_load_print_meta: ssm_d_conv       = 0
 | 
					print_info: ssm_d_conv       = 0
 | 
				
			||||||
llm_load_print_meta: ssm_d_inner      = 0
 | 
					print_info: ssm_d_inner      = 0
 | 
				
			||||||
llm_load_print_meta: ssm_d_state      = 0
 | 
					print_info: ssm_d_state      = 0
 | 
				
			||||||
llm_load_print_meta: ssm_dt_rank      = 0
 | 
					print_info: ssm_dt_rank      = 0
 | 
				
			||||||
llm_load_print_meta: ssm_dt_b_c_rms   = 0
 | 
					print_info: ssm_dt_b_c_rms   = 0
 | 
				
			||||||
llm_load_print_meta: model type       = 7B
 | 
					print_info: model type       = 7B
 | 
				
			||||||
llm_load_print_meta: model ftype      = Q4_K - Medium
 | 
					print_info: model params     = 7.24 B
 | 
				
			||||||
llm_load_print_meta: model params     = 7.24 B
 | 
					print_info: general.name     = mistralai_mistral-7b-instruct-v0.1
 | 
				
			||||||
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW)
 | 
					print_info: vocab type       = SPM
 | 
				
			||||||
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.1
 | 
					print_info: n_vocab          = 32000
 | 
				
			||||||
llm_load_print_meta: BOS token        = 1 '<s>'
 | 
					print_info: n_merges         = 0
 | 
				
			||||||
llm_load_print_meta: EOS token        = 2 '</s>'
 | 
					print_info: BOS token        = 1 '<s>'
 | 
				
			||||||
llm_load_print_meta: UNK token        = 0 '<unk>'
 | 
					print_info: EOS token        = 2 '</s>'
 | 
				
			||||||
llm_load_print_meta: LF token         = 13 '<0x0A>'
 | 
					print_info: UNK token        = 0 '<unk>'
 | 
				
			||||||
llm_load_print_meta: max token length = 48
 | 
					print_info: LF token         = 13 '<0x0A>'
 | 
				
			||||||
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
 | 
					print_info: EOG token        = 2 '</s>'
 | 
				
			||||||
ggml_sycl_init: SYCL_USE_XMX: yes
 | 
					print_info: max token length = 48
 | 
				
			||||||
ggml_sycl_init: found 1 SYCL devices:
 | 
					load_tensors: loading model tensors, this can take a while... (mmap = true)
 | 
				
			||||||
llm_load_tensors: ggml ctx size =    0.27 MiB
 | 
					load_tensors: offloading 32 repeating layers to GPU
 | 
				
			||||||
llm_load_tensors: offloading 32 repeating layers to GPU
 | 
					load_tensors: offloading output layer to GPU
 | 
				
			||||||
llm_load_tensors: offloading non-repeating layers to GPU
 | 
					load_tensors: offloaded 33/33 layers to GPU
 | 
				
			||||||
llm_load_tensors: offloaded 33/33 layers to GPU
 | 
					load_tensors:   CPU_Mapped model buffer size =    70.31 MiB
 | 
				
			||||||
llm_load_tensors:      SYCL0 buffer size =  4095.05 MiB
 | 
					load_tensors:        SYCL0 model buffer size =  4095.05 MiB
 | 
				
			||||||
llm_load_tensors:        CPU buffer size =    70.31 MiB
 | 
					.................................................................................................
 | 
				
			||||||
..............................................................................................
 | 
					llama_init_from_model: n_seq_max     = 1
 | 
				
			||||||
llama_new_context_with_model: n_ctx      = 512
 | 
					llama_init_from_model: n_ctx         = 1024
 | 
				
			||||||
llama_new_context_with_model: n_batch    = 512
 | 
					llama_init_from_model: n_ctx_per_seq = 1024
 | 
				
			||||||
llama_new_context_with_model: n_ubatch   = 512
 | 
					llama_init_from_model: n_batch       = 1024
 | 
				
			||||||
llama_new_context_with_model: flash_attn = 0
 | 
					llama_init_from_model: n_ubatch      = 1024
 | 
				
			||||||
llama_new_context_with_model: freq_base  = 10000.0
 | 
					llama_init_from_model: flash_attn    = 0
 | 
				
			||||||
llama_new_context_with_model: freq_scale = 1
 | 
					llama_init_from_model: freq_base     = 10000.0
 | 
				
			||||||
[SYCL] call ggml_check_sycl
 | 
					llama_init_from_model: freq_scale    = 1
 | 
				
			||||||
ggml_check_sycl: GGML_SYCL_DEBUG: 0
 | 
					llama_init_from_model: n_ctx_per_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
 | 
				
			||||||
ggml_check_sycl: GGML_SYCL_F16: no
 | 
					Running with Environment Variables:
 | 
				
			||||||
found 1 SYCL devices:
 | 
					  GGML_SYCL_DEBUG: 0
 | 
				
			||||||
 | 
					  GGML_SYCL_DISABLE_OPT: 1
 | 
				
			||||||
 | 
					Build with Macros:
 | 
				
			||||||
 | 
					  GGML_SYCL_FORCE_MMQ: no
 | 
				
			||||||
 | 
					  GGML_SYCL_F16: no
 | 
				
			||||||
 | 
					Found 1 SYCL devices:
 | 
				
			||||||
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
 | 
					|  |                   |                                       |       |Max    |        |Max  |Global |                     |
 | 
				
			||||||
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
 | 
					|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
 | 
				
			||||||
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
 | 
					|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
 | 
				
			||||||
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
 | 
					|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
 | 
				
			||||||
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|    1.3|    112|    1024|   32| 13578M|            1.3.27504|
 | 
					| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|  12.55|    512|    1024|   32| 16225M|     1.6.31294.120000|
 | 
				
			||||||
llama_kv_cache_init:      SYCL0 KV buffer size =    64.00 MiB
 | 
					SYCL Optimization Feature:
 | 
				
			||||||
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
 | 
					|ID|        Device Type|Reorder|
 | 
				
			||||||
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.12 MiB
 | 
					|--|-------------------|-------|
 | 
				
			||||||
llama_new_context_with_model:      SYCL0 compute buffer size =    81.00 MiB
 | 
					| 0| [level_zero:gpu:0]|      Y|
 | 
				
			||||||
llama_new_context_with_model:  SYCL_Host compute buffer size =     9.01 MiB
 | 
					llama_kv_cache_init: kv_size = 1024, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
 | 
				
			||||||
llama_new_context_with_model: graph nodes  = 902
 | 
					llama_kv_cache_init:      SYCL0 KV buffer size =   128.00 MiB
 | 
				
			||||||
llama_new_context_with_model: graph splits = 2
 | 
					llama_init_from_model: KV self size  =  128.00 MiB, K (f16):   64.00 MiB, V (f16):   64.00 MiB
 | 
				
			||||||
 | 
					llama_init_from_model:  SYCL_Host  output buffer size =     0.12 MiB
 | 
				
			||||||
 | 
					llama_init_from_model:      SYCL0 compute buffer size =   164.01 MiB
 | 
				
			||||||
 | 
					llama_init_from_model:  SYCL_Host compute buffer size =    20.01 MiB
 | 
				
			||||||
 | 
					llama_init_from_model: graph nodes  = 902
 | 
				
			||||||
 | 
					llama_init_from_model: graph splits = 2
 | 
				
			||||||
 | 
					common_init_from_params: setting dry_penalty_last_n to ctx_size = 1024
 | 
				
			||||||
 | 
					common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 | 
				
			||||||
 | 
					main: llama threadpool init, n_threads = 8
 | 
				
			||||||
 | 
					
 | 
				
			||||||
system_info: n_threads = 8 / 18 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
 | 
					system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
 | 
				
			||||||
sampling:
 | 
					
 | 
				
			||||||
 | 
					sampler seed: 403565315
 | 
				
			||||||
 | 
					sampler params: 
 | 
				
			||||||
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
 | 
					        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
 | 
				
			||||||
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
 | 
					        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 1024
 | 
				
			||||||
 | 
					        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
 | 
				
			||||||
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 | 
					        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 | 
				
			||||||
sampling order:
 | 
					sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 | 
				
			||||||
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
 | 
					generate: n_ctx = 1024, n_batch = 4096, n_predict = 32, n_keep = 1
 | 
				
			||||||
generate: n_ctx = 512, n_batch = 2048, n_predict = 32, n_keep = 1
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					 Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun exploring the world. But sometimes, she found it hard to find friends who shared her interests.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun exploring the world. She lived in a small village where there weren't many opportunities for adventures, but that didn't stop her. She would often read
 | 
					One day, she decided to take matters into her own
 | 
				
			||||||
llama_print_timings:        load time =    xxxx ms
 | 
					 | 
				
			||||||
llama_print_timings:      sample time =     x.xx ms /    32 runs   (   xx.xx ms per token,  xx.xx tokens per second)
 | 
					 | 
				
			||||||
llama_print_timings: prompt eval time =    xx.xx ms /    31 tokens (   xx.xx ms per token,  xx.xx tokens per second)
 | 
					 | 
				
			||||||
llama_print_timings:        eval time =    xx.xx ms /    31 runs   (   xx.xx ms per token,  xx.xx tokens per second)
 | 
					 | 
				
			||||||
llama_print_timings:       total time =    xx.xx ms /    62 tokens
 | 
					 | 
				
			||||||
Log end
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					llama_perf_sampler_print:    sampling time =       x.xx ms /    63 runs   (    x.xx ms per token, xx.xx tokens per second)
 | 
				
			||||||
 | 
					llama_perf_context_print:        load time =      xx.xx ms
 | 
				
			||||||
 | 
					llama_perf_context_print: prompt eval time =      xx.xx ms /    31 tokens (   xx.xx ms per token,    xx.xx tokens per second)
 | 
				
			||||||
 | 
					llama_perf_context_print:        eval time =      xx.xx ms /    31 runs   (   xx.xx ms per token,    xx.xx tokens per second)
 | 
				
			||||||
 | 
					llama_perf_context_print:       total time =      xx.xx ms /    62 tokens
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 故障排除
 | 
					### 故障排除
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -64,7 +64,7 @@ Before running, you should download or copy community GGUF model to your local d
 | 
				
			||||||
#### Run GGUF model
 | 
					#### Run GGUF model
 | 
				
			||||||
Please change `PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` to your model path before your run below command.
 | 
					Please change `PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` to your model path before your run below command.
 | 
				
			||||||
```cmd
 | 
					```cmd
 | 
				
			||||||
llama-cli.exe -m PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048  -t 8 -e -ngl 99 --color -c 2500 --temp 0
 | 
					llama-cli.exe -m PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048  -t 8 -e -ngl 99 --color -c 2500 --temp 0 -no-cnv
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Part of outputs:
 | 
					Part of outputs:
 | 
				
			||||||
| 
						 | 
					@ -75,27 +75,32 @@ Found 1 SYCL devices:
 | 
				
			||||||
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
 | 
					|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
 | 
				
			||||||
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
 | 
					|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
 | 
				
			||||||
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
 | 
					|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
 | 
				
			||||||
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|  12.71|    128|    1024|   32| 13578M|            1.3.27504|
 | 
					| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|  12.55|    512|    1024|   32| 16225M|     1.6.31294.120000|
 | 
				
			||||||
 | 
					SYCL Optimization Feature:
 | 
				
			||||||
 | 
					|ID|        Device Type|Reorder|
 | 
				
			||||||
 | 
					|--|-------------------|-------|
 | 
				
			||||||
 | 
					| 0| [level_zero:gpu:0]|      Y|
 | 
				
			||||||
 | 
					llama_kv_cache_init: kv_size = 2528, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
 | 
				
			||||||
llama_kv_cache_init:      SYCL0 KV buffer size =   138.25 MiB
 | 
					llama_kv_cache_init:      SYCL0 KV buffer size =   138.25 MiB
 | 
				
			||||||
llama_new_context_with_model: KV self size  =  138.25 MiB, K (f16):   69.12 MiB, V (f16):   69.12 MiB
 | 
					llama_init_from_model: KV self size  =  138.25 MiB, K (f16):   69.12 MiB, V (f16):   69.12 MiB
 | 
				
			||||||
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.58 MiB
 | 
					llama_init_from_model:  SYCL_Host  output buffer size =     0.58 MiB
 | 
				
			||||||
llama_new_context_with_model:      SYCL0 compute buffer size =  1501.00 MiB
 | 
					llama_init_from_model:      SYCL0 compute buffer size =  1501.00 MiB
 | 
				
			||||||
llama_new_context_with_model:  SYCL_Host compute buffer size =    58.97 MiB
 | 
					llama_init_from_model:  SYCL_Host compute buffer size =    59.28 MiB
 | 
				
			||||||
llama_new_context_with_model: graph nodes  = 874
 | 
					llama_init_from_model: graph nodes  = 874
 | 
				
			||||||
llama_new_context_with_model: graph splits = 2
 | 
					llama_init_from_model: graph splits = 2
 | 
				
			||||||
 | 
					common_init_from_params: setting dry_penalty_last_n to ctx_size = 2528
 | 
				
			||||||
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 | 
					common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 | 
				
			||||||
main: llama threadpool init, n_threads = 8
 | 
					main: llama threadpool init, n_threads = 8
 | 
				
			||||||
 | 
					
 | 
				
			||||||
system_info: n_threads = 8 (n_threads_batch = 8) / 22 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
 | 
					system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
 | 
				
			||||||
 | 
					
 | 
				
			||||||
sampler seed: 341519086
 | 
					sampler seed: 1856767110
 | 
				
			||||||
sampler params: 
 | 
					sampler params: 
 | 
				
			||||||
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
 | 
					        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
 | 
				
			||||||
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 | 
					        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2528
 | 
				
			||||||
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000
 | 
					        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000
 | 
				
			||||||
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 | 
					        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 | 
				
			||||||
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 | 
					sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 | 
				
			||||||
 | 
					 | 
				
			||||||
generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1
 | 
					generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<think>
 | 
					<think>
 | 
				
			||||||
| 
						 | 
					@ -143,7 +148,7 @@ Before running, you should download or copy community GGUF model to your local d
 | 
				
			||||||
#### Run GGUF model
 | 
					#### Run GGUF model
 | 
				
			||||||
Please change `/PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` to your model path before your run below command.  
 | 
					Please change `/PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` to your model path before your run below command.  
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
./llama-cli -m /PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048  -t 8 -e -ngl 99 --color -c 2500 --temp 0
 | 
					./llama-cli -m /PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048  -t 8 -e -ngl 99 --color -c 2500 --temp 0 -no-cnv
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Part of outputs:
 | 
					Part of outputs:
 | 
				
			||||||
| 
						 | 
					@ -154,27 +159,32 @@ Found 1 SYCL devices:
 | 
				
			||||||
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
 | 
					|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
 | 
				
			||||||
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
 | 
					|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
 | 
				
			||||||
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
 | 
					|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
 | 
				
			||||||
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|  12.71|    128|    1024|   32| 13578M|            1.3.27504|
 | 
					| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|  12.55|    512|    1024|   32| 16225M|     1.6.31294.120000|
 | 
				
			||||||
 | 
					SYCL Optimization Feature:
 | 
				
			||||||
 | 
					|ID|        Device Type|Reorder|
 | 
				
			||||||
 | 
					|--|-------------------|-------|
 | 
				
			||||||
 | 
					| 0| [level_zero:gpu:0]|      Y|
 | 
				
			||||||
 | 
					llama_kv_cache_init: kv_size = 2528, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
 | 
				
			||||||
llama_kv_cache_init:      SYCL0 KV buffer size =   138.25 MiB
 | 
					llama_kv_cache_init:      SYCL0 KV buffer size =   138.25 MiB
 | 
				
			||||||
llama_new_context_with_model: KV self size  =  138.25 MiB, K (f16):   69.12 MiB, V (f16):   69.12 MiB
 | 
					llama_init_from_model: KV self size  =  138.25 MiB, K (f16):   69.12 MiB, V (f16):   69.12 MiB
 | 
				
			||||||
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.58 MiB
 | 
					llama_init_from_model:  SYCL_Host  output buffer size =     0.58 MiB
 | 
				
			||||||
llama_new_context_with_model:      SYCL0 compute buffer size =  1501.00 MiB
 | 
					llama_init_from_model:      SYCL0 compute buffer size =  1501.00 MiB
 | 
				
			||||||
llama_new_context_with_model:  SYCL_Host compute buffer size =    58.97 MiB
 | 
					llama_init_from_model:  SYCL_Host compute buffer size =    59.28 MiB
 | 
				
			||||||
llama_new_context_with_model: graph nodes  = 874
 | 
					llama_init_from_model: graph nodes  = 874
 | 
				
			||||||
llama_new_context_with_model: graph splits = 2
 | 
					llama_init_from_model: graph splits = 2
 | 
				
			||||||
 | 
					common_init_from_params: setting dry_penalty_last_n to ctx_size = 2528
 | 
				
			||||||
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 | 
					common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 | 
				
			||||||
main: llama threadpool init, n_threads = 8
 | 
					main: llama threadpool init, n_threads = 8
 | 
				
			||||||
 | 
					
 | 
				
			||||||
system_info: n_threads = 8 (n_threads_batch = 8) / 22 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
 | 
					system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
 | 
				
			||||||
 | 
					
 | 
				
			||||||
sampler seed: 341519086
 | 
					sampler seed: 1856767110
 | 
				
			||||||
sampler params: 
 | 
					sampler params: 
 | 
				
			||||||
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
 | 
					        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
 | 
				
			||||||
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 | 
					        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2528
 | 
				
			||||||
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000
 | 
					        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000
 | 
				
			||||||
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 | 
					        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 | 
				
			||||||
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 | 
					sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 | 
				
			||||||
 | 
					 | 
				
			||||||
generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1
 | 
					generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<think>
 | 
					<think>
 | 
				
			||||||
| 
						 | 
					@ -211,7 +221,7 @@ Before running, you should download or copy community GGUF model to your local d
 | 
				
			||||||
Change `/PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf` to your model path, then run `DeepSeek-R1-Q4_K_M.gguf`
 | 
					Change `/PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf` to your model path, then run `DeepSeek-R1-Q4_K_M.gguf`
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
./flash-moe -m /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --prompt "What's AI?"
 | 
					./flash-moe -m /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --prompt "What's AI?" -no-cnv
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Part of outputs
 | 
					Part of outputs
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -66,7 +66,7 @@
 | 
				
			||||||
在运行以下命令之前,请将 `PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` 更改为你的模型路径。
 | 
					在运行以下命令之前,请将 `PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` 更改为你的模型路径。
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```cmd
 | 
					```cmd
 | 
				
			||||||
llama-cli.exe -m PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048  -t 8 -e -ngl 99 --color -c 2500 --temp 0
 | 
					llama-cli.exe -m PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048  -t 8 -e -ngl 99 --color -c 2500 --temp 0 -no-cnv
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
部分输出:
 | 
					部分输出:
 | 
				
			||||||
| 
						 | 
					@ -77,27 +77,32 @@ Found 1 SYCL devices:
 | 
				
			||||||
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
 | 
					|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
 | 
				
			||||||
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
 | 
					|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
 | 
				
			||||||
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
 | 
					|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
 | 
				
			||||||
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|  12.71|    128|    1024|   32| 13578M|            1.3.27504|
 | 
					| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|  12.55|    512|    1024|   32| 16225M|     1.6.31294.120000|
 | 
				
			||||||
 | 
					SYCL Optimization Feature:
 | 
				
			||||||
 | 
					|ID|        Device Type|Reorder|
 | 
				
			||||||
 | 
					|--|-------------------|-------|
 | 
				
			||||||
 | 
					| 0| [level_zero:gpu:0]|      Y|
 | 
				
			||||||
 | 
					llama_kv_cache_init: kv_size = 2528, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
 | 
				
			||||||
llama_kv_cache_init:      SYCL0 KV buffer size =   138.25 MiB
 | 
					llama_kv_cache_init:      SYCL0 KV buffer size =   138.25 MiB
 | 
				
			||||||
llama_new_context_with_model: KV self size  =  138.25 MiB, K (f16):   69.12 MiB, V (f16):   69.12 MiB
 | 
					llama_init_from_model: KV self size  =  138.25 MiB, K (f16):   69.12 MiB, V (f16):   69.12 MiB
 | 
				
			||||||
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.58 MiB
 | 
					llama_init_from_model:  SYCL_Host  output buffer size =     0.58 MiB
 | 
				
			||||||
llama_new_context_with_model:      SYCL0 compute buffer size =  1501.00 MiB
 | 
					llama_init_from_model:      SYCL0 compute buffer size =  1501.00 MiB
 | 
				
			||||||
llama_new_context_with_model:  SYCL_Host compute buffer size =    58.97 MiB
 | 
					llama_init_from_model:  SYCL_Host compute buffer size =    59.28 MiB
 | 
				
			||||||
llama_new_context_with_model: graph nodes  = 874
 | 
					llama_init_from_model: graph nodes  = 874
 | 
				
			||||||
llama_new_context_with_model: graph splits = 2
 | 
					llama_init_from_model: graph splits = 2
 | 
				
			||||||
 | 
					common_init_from_params: setting dry_penalty_last_n to ctx_size = 2528
 | 
				
			||||||
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 | 
					common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 | 
				
			||||||
main: llama threadpool init, n_threads = 8
 | 
					main: llama threadpool init, n_threads = 8
 | 
				
			||||||
 | 
					
 | 
				
			||||||
system_info: n_threads = 8 (n_threads_batch = 8) / 22 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
 | 
					system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
 | 
				
			||||||
 | 
					
 | 
				
			||||||
sampler seed: 341519086
 | 
					sampler seed: 1856767110
 | 
				
			||||||
sampler params: 
 | 
					sampler params: 
 | 
				
			||||||
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
 | 
					        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
 | 
				
			||||||
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 | 
					        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2528
 | 
				
			||||||
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000
 | 
					        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000
 | 
				
			||||||
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 | 
					        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 | 
				
			||||||
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 | 
					sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 | 
				
			||||||
 | 
					 | 
				
			||||||
generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1
 | 
					generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<think>
 | 
					<think>
 | 
				
			||||||
| 
						 | 
					@ -147,7 +152,7 @@ llama_perf_context_print:       total time =   xxxxx.xx ms /  1385 tokens
 | 
				
			||||||
在运行以下命令之前,请将 `PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` 更改为你的模型路径。
 | 
					在运行以下命令之前,请将 `PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` 更改为你的模型路径。
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
./llama-cli -m /PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048  -t 8 -e -ngl 99 --color -c 2500 --temp 0
 | 
					./llama-cli -m /PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048  -t 8 -e -ngl 99 --color -c 2500 --temp 0 -no-cnv
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
部分输出:
 | 
					部分输出:
 | 
				
			||||||
| 
						 | 
					@ -158,27 +163,32 @@ Found 1 SYCL devices:
 | 
				
			||||||
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
 | 
					|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
 | 
				
			||||||
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
 | 
					|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
 | 
				
			||||||
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
 | 
					|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
 | 
				
			||||||
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|  12.71|    128|    1024|   32| 13578M|            1.3.27504|
 | 
					| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|  12.55|    512|    1024|   32| 16225M|     1.6.31294.120000|
 | 
				
			||||||
 | 
					SYCL Optimization Feature:
 | 
				
			||||||
 | 
					|ID|        Device Type|Reorder|
 | 
				
			||||||
 | 
					|--|-------------------|-------|
 | 
				
			||||||
 | 
					| 0| [level_zero:gpu:0]|      Y|
 | 
				
			||||||
 | 
					llama_kv_cache_init: kv_size = 2528, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
 | 
				
			||||||
llama_kv_cache_init:      SYCL0 KV buffer size =   138.25 MiB
 | 
					llama_kv_cache_init:      SYCL0 KV buffer size =   138.25 MiB
 | 
				
			||||||
llama_new_context_with_model: KV self size  =  138.25 MiB, K (f16):   69.12 MiB, V (f16):   69.12 MiB
 | 
					llama_init_from_model: KV self size  =  138.25 MiB, K (f16):   69.12 MiB, V (f16):   69.12 MiB
 | 
				
			||||||
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.58 MiB
 | 
					llama_init_from_model:  SYCL_Host  output buffer size =     0.58 MiB
 | 
				
			||||||
llama_new_context_with_model:      SYCL0 compute buffer size =  1501.00 MiB
 | 
					llama_init_from_model:      SYCL0 compute buffer size =  1501.00 MiB
 | 
				
			||||||
llama_new_context_with_model:  SYCL_Host compute buffer size =    58.97 MiB
 | 
					llama_init_from_model:  SYCL_Host compute buffer size =    59.28 MiB
 | 
				
			||||||
llama_new_context_with_model: graph nodes  = 874
 | 
					llama_init_from_model: graph nodes  = 874
 | 
				
			||||||
llama_new_context_with_model: graph splits = 2
 | 
					llama_init_from_model: graph splits = 2
 | 
				
			||||||
 | 
					common_init_from_params: setting dry_penalty_last_n to ctx_size = 2528
 | 
				
			||||||
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 | 
					common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 | 
				
			||||||
main: llama threadpool init, n_threads = 8
 | 
					main: llama threadpool init, n_threads = 8
 | 
				
			||||||
 | 
					
 | 
				
			||||||
system_info: n_threads = 8 (n_threads_batch = 8) / 22 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
 | 
					system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
 | 
				
			||||||
 | 
					
 | 
				
			||||||
sampler seed: 341519086
 | 
					sampler seed: 1856767110
 | 
				
			||||||
sampler params: 
 | 
					sampler params: 
 | 
				
			||||||
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
 | 
					        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
 | 
				
			||||||
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 | 
					        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2528
 | 
				
			||||||
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000
 | 
					        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000
 | 
				
			||||||
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 | 
					        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 | 
				
			||||||
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 | 
					sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 | 
				
			||||||
 | 
					 | 
				
			||||||
generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1
 | 
					generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<think>
 | 
					<think>
 | 
				
			||||||
| 
						 | 
					@ -213,7 +223,7 @@ FlashMoE 是一款基于 `llama.cpp` 构建的命令行工具,针对 DeepSeek
 | 
				
			||||||
请将 `/PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf` 更改为您的模型路径,然后运行 `DeepSeek-R1-Q4_K_M.gguf`
 | 
					请将 `/PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf` 更改为您的模型路径,然后运行 `DeepSeek-R1-Q4_K_M.gguf`
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```bash
 | 
					```bash
 | 
				
			||||||
./flash-moe -m /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --prompt "What's AI?"
 | 
					./flash-moe -m /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --prompt "What's AI?" -no-cnv
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
部分输出:
 | 
					部分输出:
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
		Loading…
	
		Reference in a new issue