Create flashmoe quickstart (#13147)
This commit is contained in:
		
							parent
							
								
									da08c9ca60
								
							
						
					
					
						commit
						9da1c56fa8
					
				
					 3 changed files with 112 additions and 0 deletions
				
			
		| 
						 | 
				
			
			@ -9,6 +9,7 @@
 | 
			
		|||
> - ***70+ models** have been optimized/verified on `ipex-llm` (e.g., Llama, Phi, Mistral, Mixtral, DeepSeek, Qwen, ChatGLM, MiniCPM, Qwen-VL, MiniCPM-V and more), with state-of-art **LLM optimizations**, **XPU acceleration** and **low-bit (FP8/FP6/FP4/INT4) support**; see the complete list [here](#verified-models).*
 | 
			
		||||
 | 
			
		||||
## Latest Update 🔥 
 | 
			
		||||
- [2025/05] You can now run ***DeepSeek V3/R1 671B*** and ***Qwen3MoE 235B*** models with just 1 or 2 Intel Arc GPU (such as A770 or B580) using [FlashMoE](docs/mddocs/Quickstart/flashmoe_quickstart.md) in `ipex-llm`.
 | 
			
		||||
- [2025/04] We released `ipex-llm 2.2.0`, which includes [Ollama Portable Zip](docs/mddocs/Quickstart/ollama_portable_zip_quickstart.md) and [llama.cpp Portable Zip](docs/mddocs/Quickstart/llamacpp_portable_zip_gpu_quickstart.md).
 | 
			
		||||
- [2025/04] We added support of [PyTorch 2.6](docs/mddocs/Quickstart/install_pytorch26_gpu.md) for Intel GPU.
 | 
			
		||||
- [2025/03] We added support for **Gemma3** model in the latest [llama.cpp Portable Zip](https://github.com/intel/ipex-llm/issues/12963#issuecomment-2724032898).
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -9,6 +9,7 @@
 | 
			
		|||
> - ***70+** 模型已经在 `ipex-llm` 上得到优化和验证(如 Llama, Phi, Mistral, Mixtral, DeepSeek, Qwen, ChatGLM, MiniCPM, Qwen-VL, MiniCPM-V 等), 以获得先进的 **大模型算法优化**, **XPU 加速** 以及 **低比特(FP8FP8/FP6/FP4/INT4)支持**;更多模型信息请参阅[这里](#模型验证)。*
 | 
			
		||||
 | 
			
		||||
## 最新更新 🔥 
 | 
			
		||||
- [2025/05] 通过 `ipex-llm` 中的 [FlashMoE](docs/mddocs/Quickstart/flashmoe_quickstart.md),现可使用 1 到 2 张 Intel Arc GPU (如 A770 或 B580) 运行 ***DeepSeek V3/R1 671B*** 和 ***Qwen3MoE 235B*** 模型。
 | 
			
		||||
- [2025/04] 发布 `ipex-llm 2.2.0`, 其中包括 [Ollama Portable Zip 和 llama.cpp Portable Zip](https://github.com/ipex-llm/ipex-llm/releases/tag/v2.2.0)。
 | 
			
		||||
- [2025/04] 新增在 Intel GPU 上对于 [PyTorch 2.6](docs/mddocs/Quickstart/install_pytorch26_gpu.md) 的支持。
 | 
			
		||||
- [2025/03] 通过最新 [llama.cpp Portable Zip](https://github.com/intel/ipex-llm/issues/12963#issuecomment-2724032898) 可运行 **Gemma3** 模型。
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
							
								
								
									
										110
									
								
								docs/mddocs/Quickstart/flashmoe_quickstart.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										110
									
								
								docs/mddocs/Quickstart/flashmoe_quickstart.md
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,110 @@
 | 
			
		|||
# FlashMoE
 | 
			
		||||
The `FlashMoe` support in `ipex-llm` allows you to run ***DeepSeek V3/R1 671B*** and ***Qwen3MoE 235B*** models with just 1 or 2 Intel Arc GPU.
 | 
			
		||||
 | 
			
		||||
## Install
 | 
			
		||||
### Prerequisites
 | 
			
		||||
Check your GPU driver version, and update it if needed; we recommend following [Intel client GPU driver installation guide](https://dgpu-docs.intel.com/driver/client/overview.html) to install your GPU driver.
 | 
			
		||||
 | 
			
		||||
### Download and Extract
 | 
			
		||||
1. Download IPEX-LLM llama.cpp portable tgz for Linux from the [link](https://github.com/ipex-llm/ipex-llm/releases/tag/v2.3.0-nightly).
 | 
			
		||||
 | 
			
		||||
2. Extract the tgz file to a folder.
 | 
			
		||||
 | 
			
		||||
3. Open a "Terminal", and enter the extracted folder through `cd /PATH/TO/EXTRACTED/FOLDER`
 | 
			
		||||
 | 
			
		||||
> [!NOTE]
 | 
			
		||||
> Hardware Requirements: 
 | 
			
		||||
> - 380GB CPU memory for ***DeepSeek V3/R1 671B*** INT4 model 
 | 
			
		||||
> - 128GB CPU memory for ***Qwen3MoE 235B*** INT4 model 
 | 
			
		||||
> - 1-8 ARC A770 or B580
 | 
			
		||||
> - 500GB Disk space
 | 
			
		||||
 | 
			
		||||
## Run
 | 
			
		||||
Before running, you should download or copy community GGUF model to your local directory. For instance,  `DeepSeek-R1-Q4_K_M.gguf` of [DeepSeek-R1-Q4_K_M.gguf](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-Q4_K_M).
 | 
			
		||||
 | 
			
		||||
Run `DeepSeek-R1-Q4_K_M.gguf`as shown below (change `/PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf` to your model path)
 | 
			
		||||
 | 
			
		||||
#### CLI
 | 
			
		||||
The CLI version of `flashmoe` is built on top of `llama.cpp llama-cli`:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
./flash-moe -m /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --prompt "What's AI?" -no-cnv
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Part of outputs
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
llama_kv_cache_init:      SYCL0 KV buffer size =  1280.00 MiB
 | 
			
		||||
llama_kv_cache_init:      SYCL1 KV buffer size =  1280.00 MiB
 | 
			
		||||
llama_kv_cache_init:      SYCL2 KV buffer size =  1280.00 MiB
 | 
			
		||||
llama_kv_cache_init:      SYCL3 KV buffer size =  1280.00 MiB
 | 
			
		||||
llama_kv_cache_init:      SYCL4 KV buffer size =  1120.00 MiB
 | 
			
		||||
llama_kv_cache_init:      SYCL5 KV buffer size =  1280.00 MiB
 | 
			
		||||
llama_kv_cache_init:      SYCL6 KV buffer size =  1280.00 MiB
 | 
			
		||||
llama_kv_cache_init:      SYCL7 KV buffer size =   960.00 MiB
 | 
			
		||||
llama_new_context_with_model: KV self size  = 9760.00 MiB, K (i8): 5856.00 MiB, V (i8): 3904.00 MiB
 | 
			
		||||
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.49 MiB
 | 
			
		||||
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
 | 
			
		||||
llama_new_context_with_model:      SYCL0 compute buffer size =  2076.02 MiB
 | 
			
		||||
llama_new_context_with_model:      SYCL1 compute buffer size =  2076.02 MiB
 | 
			
		||||
llama_new_context_with_model:      SYCL2 compute buffer size =  2076.02 MiB
 | 
			
		||||
llama_new_context_with_model:      SYCL3 compute buffer size =  2076.02 MiB
 | 
			
		||||
llama_new_context_with_model:      SYCL4 compute buffer size =  2076.02 MiB
 | 
			
		||||
llama_new_context_with_model:      SYCL5 compute buffer size =  2076.02 MiB
 | 
			
		||||
llama_new_context_with_model:      SYCL6 compute buffer size =  2076.02 MiB
 | 
			
		||||
llama_new_context_with_model:      SYCL7 compute buffer size =  3264.00 MiB
 | 
			
		||||
llama_new_context_with_model:  SYCL_Host compute buffer size =  1332.05 MiB
 | 
			
		||||
llama_new_context_with_model: graph nodes  = 5184 (with bs=4096), 4720 (with bs=1)
 | 
			
		||||
llama_new_context_with_model: graph splits = 125
 | 
			
		||||
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 | 
			
		||||
main: llama threadpool init, n_threads = 48
 | 
			
		||||
 | 
			
		||||
system_info: n_threads = 48 (n_threads_batch = 48) / 192 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
 | 
			
		||||
 | 
			
		||||
sampler seed: 2052631435
 | 
			
		||||
sampler params:
 | 
			
		||||
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
 | 
			
		||||
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 | 
			
		||||
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
 | 
			
		||||
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 | 
			
		||||
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
 | 
			
		||||
generate: n_ctx = 4096, n_batch = 4096, n_predict = -1, n_keep = 1
 | 
			
		||||
 | 
			
		||||
<think>
 | 
			
		||||
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 | 
			
		||||
</think>
 | 
			
		||||
 | 
			
		||||
<answer>XXXX</answer> [end of text]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
#### Serving
 | 
			
		||||
The serving version of `flashmoe` is built on top of `llama.cpp server`:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
./flash-moe -m /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --serve -n 512 -np 2 -c 4096
 | 
			
		||||
```
 | 
			
		||||
> `-n` means number of tokens to predict, `-np` means number of parallel sequences to decode, `-c` means the size of whole context, you can adjust these values based on your requirements.
 | 
			
		||||
 | 
			
		||||
Part of outputs
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
...
 | 
			
		||||
llama_init_from_model: graph nodes  = 3560
 | 
			
		||||
llama_init_from_model: graph splits = 121
 | 
			
		||||
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
 | 
			
		||||
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 | 
			
		||||
srv          init: initializing slots, n_slots = 2
 | 
			
		||||
slot         init: id  0 | task -1 | new slot n_ctx_slot = 2048
 | 
			
		||||
slot         init: id  1 | task -1 | new slot n_ctx_slot = 2048
 | 
			
		||||
main: model loaded
 | 
			
		||||
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='', is_first_sp=true) %}{%- for message in messages %}{%- if message['role'] == 'system' %}{%- if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{%- else %}{% set ns.system_prompt = ns.system_prompt + '\n\n' + message['content'] %}{%- endif %}{%- endif %}{%- endfor %}{{ bos_token }}{{ ns.system_prompt }}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and 'tool_calls' in message %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls'] %}{%- if not ns.is_first %}{%- if message['content'] is none %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- else %}{{'<|Assistant|>' + message['content'] + '<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- endif %}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- endif %}{%- endfor %}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- if message['role'] == 'assistant' and 'tool_calls' not in message %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}, example_format: 'You are a helpful assistant
 | 
			
		||||
 | 
			
		||||
<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
 | 
			
		||||
main: server is listening on http://127.0.0.1:8080 - starting the main loop
 | 
			
		||||
srv  update_slots: all slots are idle
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
## Notes 
 | 
			
		||||
- Larger models and higher precisions may require more resources.
 | 
			
		||||
- For 1 ARC A770 platform, please reduce context length (e.g., 1024) to avoid OOM. Add this option `-c 1024` at the CLI command.
 | 
			
		||||
- For dual-sockets Xeon system, consider enabling `SNC (Sub-NUMA Clustering)` in BIOS and add `numactl --interleave=all` before launch command for *better decoding performance*.
 | 
			
		||||
		Loading…
	
		Reference in a new issue