Update llama cpp portable zip quickstart (#12928)
* Update llamacpp_portable_zip_gpu_quickstart.md * Update llamacpp_portable_zip_gpu_quickstart.md * Update llamacpp_portable_zip_gpu_quickstart.md * Update llamacpp_portable_zip_gpu_quickstart.md * Update llamacpp_portable_zip_gpu_quickstart.md * Update llamacpp_portable_zip_gpu_quickstart.md * Update llamacpp_portable_zip_gpu_quickstart.md * Update llamacpp_portable_zip_gpu_quickstart.md * Update llamacpp_portable_zip_gpu_quickstart.md * Update llamacpp_portable_zip_gpu_quickstart.md
This commit is contained in:
		
							parent
							
								
									de09590ca3
								
							
						
					
					
						commit
						33da3a3cb7
					
				
					 1 changed files with 26 additions and 16 deletions
				
			
		| 
						 | 
				
			
			@ -13,14 +13,14 @@ This guide demonstrates how to use [llama.cpp portable zip](https://github.com/i
 | 
			
		|||
- [Windows Quickstart](#windows-quickstart)
 | 
			
		||||
  - [Prerequisites](#prerequisites)
 | 
			
		||||
  - [Step 1: Download and Unzip](#step-1-download-and-unzip)
 | 
			
		||||
  - [Step 3: Runtime Configuration](#step-2-runtime-configuration)
 | 
			
		||||
  - [Step 2: Runtime Configuration](#step-2-runtime-configuration)
 | 
			
		||||
  - [Step 3: Run GGUF models](#step-3-run-gguf-models)
 | 
			
		||||
- [Linux Quickstart](#linux-quickstart)
 | 
			
		||||
  - [Prerequisites](#prerequisites-1)
 | 
			
		||||
  - [Step 1: Download and Extract](#step-1-download-and-extract)
 | 
			
		||||
  - [Step 2: Runtime Configuration](#step-2-runtime-configuration-1)
 | 
			
		||||
  - [Step 3: Run GGUF models](#step-3-run-gguf-models-1)
 | 
			
		||||
  - [(New) FlashMoE for MoE Models (e.g., DeeSeek V3/R1) using llama.cpp](#flashmoe-for-deeseek-v3r1)
 | 
			
		||||
  - [(New) FlashMoE for MoE Models (e.g., DeepSeek V3/R1) using llama.cpp](#flashmoe-for-deepseek-v3r1)
 | 
			
		||||
- [Tips & Troubleshooting](#tips--troubleshooting)
 | 
			
		||||
  - [Error: Detected different sycl devices](#error-detected-different-sycl-devices)
 | 
			
		||||
  - [Multi-GPUs usage](#multi-gpus-usage)
 | 
			
		||||
| 
						 | 
				
			
			@ -57,12 +57,12 @@ Then, extract the zip file to a folder.
 | 
			
		|||
Here we provide a simple example to show how to run a community GGUF model with IPEX-LLM.  
 | 
			
		||||
 | 
			
		||||
#### Model Download
 | 
			
		||||
Before running, you should download or copy community GGUF model to your current directory. For instance,  `DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` of [bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF/blob/main/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf).
 | 
			
		||||
Before running, you should download or copy community GGUF model to your local directory. For instance,  `DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` of [bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF/blob/main/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf).
 | 
			
		||||
 | 
			
		||||
#### Run GGUF model
 | 
			
		||||
 | 
			
		||||
Please change `PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` to your model path before your run below command.
 | 
			
		||||
```cmd
 | 
			
		||||
llama-cli.exe -m D:\llm-models\gguf\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048  -t 8 -e -ngl 99 --color -c 2500 --temp 0
 | 
			
		||||
llama-cli.exe -m PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048  -t 8 -e -ngl 99 --color -c 2500 --temp 0
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Part of outputs:
 | 
			
		||||
| 
						 | 
				
			
			@ -136,12 +136,12 @@ Then, extract the tgz file to a folder.
 | 
			
		|||
Here we provide a simple example to show how to run a community GGUF model with IPEX-LLM.  
 | 
			
		||||
 | 
			
		||||
#### Model Download
 | 
			
		||||
Before running, you should download or copy community GGUF model to your current directory. For instance,  `DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` of [bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF/blob/main/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf).
 | 
			
		||||
Before running, you should download or copy community GGUF model to your local directory. For instance,  `DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` of [bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF/blob/main/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf).
 | 
			
		||||
 | 
			
		||||
#### Run GGUF model
 | 
			
		||||
 | 
			
		||||
Please change `/PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` to your model path before your run below command.  
 | 
			
		||||
```bash
 | 
			
		||||
llama-cli -m D:\llm-models\gguf\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048  -t 8 -e -ngl 99 --color -c 2500 --temp 0
 | 
			
		||||
./llama-cli -m /PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048  -t 8 -e -ngl 99 --color -c 2500 --temp 0
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Part of outputs:
 | 
			
		||||
| 
						 | 
				
			
			@ -182,7 +182,7 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 | 
			
		|||
<answer>XXXX</answer> [end of text]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### FlashMoE for DeeSeek V3/R1
 | 
			
		||||
### FlashMoE for DeepSeek V3/R1
 | 
			
		||||
 | 
			
		||||
FlashMoE is a command-line tool built on llama.cpp, optimized for mixture-of-experts (MoE) models such as DeepSeek V3/R1. Now, it's available for Linux platforms.
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -203,12 +203,12 @@ Note:
 | 
			
		|||
- Larger models and other precisions may require more resources.
 | 
			
		||||
- For 1 ARC A770 platform, please reduce context length (e.g., 1024) to avoid OOM. Add this option `-c 1024` at the end of below command.
 | 
			
		||||
 | 
			
		||||
Before running, you should download or copy community GGUF model to your current directory. For instance,  `DeepSeek-R1-Q4_K_M.gguf` of [DeepSeek-R1-Q4_K_M.gguf](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-Q4_K_M).
 | 
			
		||||
Before running, you should download or copy community GGUF model to your local directory. For instance,  `DeepSeek-R1-Q4_K_M.gguf` of [DeepSeek-R1-Q4_K_M.gguf](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-Q4_K_M).
 | 
			
		||||
 | 
			
		||||
Run `DeepSeek-R1-Q4_K_M.gguf`
 | 
			
		||||
Change `/PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf` to your model path, then run `DeepSeek-R1-Q4_K_M.gguf`
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
flash-moe -m DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --prompt "What's AI?"
 | 
			
		||||
./flash-moe -m /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --prompt "What's AI?"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Part of outputs
 | 
			
		||||
| 
						 | 
				
			
			@ -278,9 +278,11 @@ If you just want to use one of the devices, please set environment like ONEAPI_D
 | 
			
		|||
If you want to use two or more deivces, please set environment like ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"
 | 
			
		||||
See https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Overview/KeyFeatures/multi_gpus_selection.md for details. Exiting.
 | 
			
		||||
```
 | 
			
		||||
Because the GPUs are not the same, the jobs will be allocated according to device's memory. Upon example, the iGPU(Intel UHD Graphics 770) will get 2/3 of the computing tasks. The performance will be quit bad.  
 | 
			
		||||
So disable the iGPU will can get the best performance. Visit [Multi-GPUs usage](#multi-gpus-usage) for details.  
 | 
			
		||||
If you still want to disable this check, you can run `set SYCL_DEVICE_CHECK=0`.  
 | 
			
		||||
Because the GPUs are not the same, the jobs will be allocated according to device's memory. Upon example, the iGPU(Intel UHD Graphics 770) will get 2/3 of the computing tasks. The performance will be quit bad. So you have below two choices: 
 | 
			
		||||
1. Disable the iGPU will get the best performance. Visit [Multi-GPUs usage](#multi-gpus-usage) for details.
 | 
			
		||||
2. Disable this check and use all of them, you can run below command:  
 | 
			
		||||
   - `set SYCL_DEVICE_CHECK=0` (Windows user)   
 | 
			
		||||
   - `export SYCL_DEVICE_CHECK=0` (Linux user)
 | 
			
		||||
 | 
			
		||||
### Multi-GPUs usage
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -293,6 +295,7 @@ Found 3 SYCL devices:
 | 
			
		|||
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
 | 
			
		||||
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|  12.55|    512|    1024|   32| 16225M|     1.6.31907.700000|
 | 
			
		||||
| 1| [level_zero:gpu:1]|                Intel Arc A770 Graphics|  12.55|    512|    1024|   32| 16225M|     1.6.31907.700000|
 | 
			
		||||
| 2| [level_zero:gpu:2]|                 Intel UHD Graphics 770|   12.2|     32|     512|   32| 63218M|     1.6.31907.700000|
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
To specify which Intel GPU you would like llama.cpp to use, you could set environment variable `ONEAPI_DEVICE_SELECTOR` **before starting llama.cpp command**, as follows:  
 | 
			
		||||
| 
						 | 
				
			
			@ -302,9 +305,16 @@ To specify which Intel GPU you would like llama.cpp to use, you could set enviro
 | 
			
		|||
  set ONEAPI_DEVICE_SELECTOR=level_zero:0 (If you want to run on one GPU, llama.cpp will use the first GPU.) 
 | 
			
		||||
  set ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1" (If you want to run on two GPUs, llama.cpp will use the first and second GPUs.)
 | 
			
		||||
  ```
 | 
			
		||||
- For **Linux** users:
 | 
			
		||||
  ```bash
 | 
			
		||||
  export ONEAPI_DEVICE_SELECTOR=level_zero:0 (If you want to run on one GPU, llama.cpp will use the first GPU.) 
 | 
			
		||||
  export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1" (If you want to run on two GPUs, llama.cpp will use the first and second GPUs.)
 | 
			
		||||
  ```
 | 
			
		||||
 
 | 
			
		||||
### Performance Environment
 | 
			
		||||
#### SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS
 | 
			
		||||
To enable SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS, you can run  `set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1`.   
 | 
			
		||||
To enable SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS, you can run below command:
 | 
			
		||||
- `set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1`(Windows user)   
 | 
			
		||||
- `export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1`(Linux user)
 | 
			
		||||
> [!NOTE]
 | 
			
		||||
> The environment variable SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS determines the usage of immediate command lists for task submission to the GPU. While this mode typically enhances performance, exceptions may occur. Please consider experimenting with and without this environment variable for best performance. For more details, you can refer to [this article](https://www.intel.com/content/www/us/en/developer/articles/guide/level-zero-immediate-command-lists.html).  
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
		Loading…
	
		Reference in a new issue