Create flashmoe quickstart (#13147)
This commit is contained in:
parent
da08c9ca60
commit
9da1c56fa8
3 changed files with 112 additions and 0 deletions
|
|
@ -9,6 +9,7 @@
|
|||
> - ***70+ models** have been optimized/verified on `ipex-llm` (e.g., Llama, Phi, Mistral, Mixtral, DeepSeek, Qwen, ChatGLM, MiniCPM, Qwen-VL, MiniCPM-V and more), with state-of-art **LLM optimizations**, **XPU acceleration** and **low-bit (FP8/FP6/FP4/INT4) support**; see the complete list [here](#verified-models).*
|
||||
|
||||
## Latest Update 🔥
|
||||
- [2025/05] You can now run ***DeepSeek V3/R1 671B*** and ***Qwen3MoE 235B*** models with just 1 or 2 Intel Arc GPU (such as A770 or B580) using [FlashMoE](docs/mddocs/Quickstart/flashmoe_quickstart.md) in `ipex-llm`.
|
||||
- [2025/04] We released `ipex-llm 2.2.0`, which includes [Ollama Portable Zip](docs/mddocs/Quickstart/ollama_portable_zip_quickstart.md) and [llama.cpp Portable Zip](docs/mddocs/Quickstart/llamacpp_portable_zip_gpu_quickstart.md).
|
||||
- [2025/04] We added support of [PyTorch 2.6](docs/mddocs/Quickstart/install_pytorch26_gpu.md) for Intel GPU.
|
||||
- [2025/03] We added support for **Gemma3** model in the latest [llama.cpp Portable Zip](https://github.com/intel/ipex-llm/issues/12963#issuecomment-2724032898).
|
||||
|
|
|
|||
|
|
@ -9,6 +9,7 @@
|
|||
> - ***70+** 模型已经在 `ipex-llm` 上得到优化和验证(如 Llama, Phi, Mistral, Mixtral, DeepSeek, Qwen, ChatGLM, MiniCPM, Qwen-VL, MiniCPM-V 等), 以获得先进的 **大模型算法优化**, **XPU 加速** 以及 **低比特(FP8FP8/FP6/FP4/INT4)支持**;更多模型信息请参阅[这里](#模型验证)。*
|
||||
|
||||
## 最新更新 🔥
|
||||
- [2025/05] 通过 `ipex-llm` 中的 [FlashMoE](docs/mddocs/Quickstart/flashmoe_quickstart.md),现可使用 1 到 2 张 Intel Arc GPU (如 A770 或 B580) 运行 ***DeepSeek V3/R1 671B*** 和 ***Qwen3MoE 235B*** 模型。
|
||||
- [2025/04] 发布 `ipex-llm 2.2.0`, 其中包括 [Ollama Portable Zip 和 llama.cpp Portable Zip](https://github.com/ipex-llm/ipex-llm/releases/tag/v2.2.0)。
|
||||
- [2025/04] 新增在 Intel GPU 上对于 [PyTorch 2.6](docs/mddocs/Quickstart/install_pytorch26_gpu.md) 的支持。
|
||||
- [2025/03] 通过最新 [llama.cpp Portable Zip](https://github.com/intel/ipex-llm/issues/12963#issuecomment-2724032898) 可运行 **Gemma3** 模型。
|
||||
|
|
|
|||
110
docs/mddocs/Quickstart/flashmoe_quickstart.md
Normal file
110
docs/mddocs/Quickstart/flashmoe_quickstart.md
Normal file
|
|
@ -0,0 +1,110 @@
|
|||
# FlashMoE
|
||||
The `FlashMoe` support in `ipex-llm` allows you to run ***DeepSeek V3/R1 671B*** and ***Qwen3MoE 235B*** models with just 1 or 2 Intel Arc GPU.
|
||||
|
||||
## Install
|
||||
### Prerequisites
|
||||
Check your GPU driver version, and update it if needed; we recommend following [Intel client GPU driver installation guide](https://dgpu-docs.intel.com/driver/client/overview.html) to install your GPU driver.
|
||||
|
||||
### Download and Extract
|
||||
1. Download IPEX-LLM llama.cpp portable tgz for Linux from the [link](https://github.com/ipex-llm/ipex-llm/releases/tag/v2.3.0-nightly).
|
||||
|
||||
2. Extract the tgz file to a folder.
|
||||
|
||||
3. Open a "Terminal", and enter the extracted folder through `cd /PATH/TO/EXTRACTED/FOLDER`
|
||||
|
||||
> [!NOTE]
|
||||
> Hardware Requirements:
|
||||
> - 380GB CPU memory for ***DeepSeek V3/R1 671B*** INT4 model
|
||||
> - 128GB CPU memory for ***Qwen3MoE 235B*** INT4 model
|
||||
> - 1-8 ARC A770 or B580
|
||||
> - 500GB Disk space
|
||||
|
||||
## Run
|
||||
Before running, you should download or copy community GGUF model to your local directory. For instance, `DeepSeek-R1-Q4_K_M.gguf` of [DeepSeek-R1-Q4_K_M.gguf](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-Q4_K_M).
|
||||
|
||||
Run `DeepSeek-R1-Q4_K_M.gguf`as shown below (change `/PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf` to your model path)
|
||||
|
||||
#### CLI
|
||||
The CLI version of `flashmoe` is built on top of `llama.cpp llama-cli`:
|
||||
|
||||
```bash
|
||||
./flash-moe -m /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --prompt "What's AI?" -no-cnv
|
||||
```
|
||||
|
||||
Part of outputs
|
||||
|
||||
```bash
|
||||
llama_kv_cache_init: SYCL0 KV buffer size = 1280.00 MiB
|
||||
llama_kv_cache_init: SYCL1 KV buffer size = 1280.00 MiB
|
||||
llama_kv_cache_init: SYCL2 KV buffer size = 1280.00 MiB
|
||||
llama_kv_cache_init: SYCL3 KV buffer size = 1280.00 MiB
|
||||
llama_kv_cache_init: SYCL4 KV buffer size = 1120.00 MiB
|
||||
llama_kv_cache_init: SYCL5 KV buffer size = 1280.00 MiB
|
||||
llama_kv_cache_init: SYCL6 KV buffer size = 1280.00 MiB
|
||||
llama_kv_cache_init: SYCL7 KV buffer size = 960.00 MiB
|
||||
llama_new_context_with_model: KV self size = 9760.00 MiB, K (i8): 5856.00 MiB, V (i8): 3904.00 MiB
|
||||
llama_new_context_with_model: SYCL_Host output buffer size = 0.49 MiB
|
||||
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
|
||||
llama_new_context_with_model: SYCL0 compute buffer size = 2076.02 MiB
|
||||
llama_new_context_with_model: SYCL1 compute buffer size = 2076.02 MiB
|
||||
llama_new_context_with_model: SYCL2 compute buffer size = 2076.02 MiB
|
||||
llama_new_context_with_model: SYCL3 compute buffer size = 2076.02 MiB
|
||||
llama_new_context_with_model: SYCL4 compute buffer size = 2076.02 MiB
|
||||
llama_new_context_with_model: SYCL5 compute buffer size = 2076.02 MiB
|
||||
llama_new_context_with_model: SYCL6 compute buffer size = 2076.02 MiB
|
||||
llama_new_context_with_model: SYCL7 compute buffer size = 3264.00 MiB
|
||||
llama_new_context_with_model: SYCL_Host compute buffer size = 1332.05 MiB
|
||||
llama_new_context_with_model: graph nodes = 5184 (with bs=4096), 4720 (with bs=1)
|
||||
llama_new_context_with_model: graph splits = 125
|
||||
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
||||
main: llama threadpool init, n_threads = 48
|
||||
|
||||
system_info: n_threads = 48 (n_threads_batch = 48) / 192 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
|
||||
|
||||
sampler seed: 2052631435
|
||||
sampler params:
|
||||
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
|
||||
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
|
||||
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
|
||||
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
|
||||
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
|
||||
generate: n_ctx = 4096, n_batch = 4096, n_predict = -1, n_keep = 1
|
||||
|
||||
<think>
|
||||
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|
||||
</think>
|
||||
|
||||
<answer>XXXX</answer> [end of text]
|
||||
```
|
||||
|
||||
#### Serving
|
||||
The serving version of `flashmoe` is built on top of `llama.cpp server`:
|
||||
|
||||
```bash
|
||||
./flash-moe -m /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --serve -n 512 -np 2 -c 4096
|
||||
```
|
||||
> `-n` means number of tokens to predict, `-np` means number of parallel sequences to decode, `-c` means the size of whole context, you can adjust these values based on your requirements.
|
||||
|
||||
Part of outputs
|
||||
|
||||
```bash
|
||||
...
|
||||
llama_init_from_model: graph nodes = 3560
|
||||
llama_init_from_model: graph splits = 121
|
||||
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
|
||||
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
||||
srv init: initializing slots, n_slots = 2
|
||||
slot init: id 0 | task -1 | new slot n_ctx_slot = 2048
|
||||
slot init: id 1 | task -1 | new slot n_ctx_slot = 2048
|
||||
main: model loaded
|
||||
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='', is_first_sp=true) %}{%- for message in messages %}{%- if message['role'] == 'system' %}{%- if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{%- else %}{% set ns.system_prompt = ns.system_prompt + '\n\n' + message['content'] %}{%- endif %}{%- endif %}{%- endfor %}{{ bos_token }}{{ ns.system_prompt }}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and 'tool_calls' in message %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls'] %}{%- if not ns.is_first %}{%- if message['content'] is none %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- else %}{{'<|Assistant|>' + message['content'] + '<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- endif %}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- endif %}{%- endfor %}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- if message['role'] == 'assistant' and 'tool_calls' not in message %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}, example_format: 'You are a helpful assistant
|
||||
|
||||
<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
|
||||
main: server is listening on http://127.0.0.1:8080 - starting the main loop
|
||||
srv update_slots: all slots are idle
|
||||
```
|
||||
|
||||
## Notes
|
||||
- Larger models and higher precisions may require more resources.
|
||||
- For 1 ARC A770 platform, please reduce context length (e.g., 1024) to avoid OOM. Add this option `-c 1024` at the CLI command.
|
||||
- For dual-sockets Xeon system, consider enabling `SNC (Sub-NUMA Clustering)` in BIOS and add `numactl --interleave=all` before launch command for *better decoding performance*.
|
||||
Loading…
Reference in a new issue