Update Ollama portable zip QuickStart regarding saving VRAM (#13155)

* Update Ollama portable zip quickstart regarding saving VRAM * Small fix
2025-05-13 13:25:22 +08:00 · 2025-05-13 13:25:22 +08:00 · aa12f69bbf
commit aa12f69bbf
parent 086a8b3ab9
2 changed files with 76 additions and 2 deletions
--- a/docs/mddocs/Quickstart/ollama_portable_zip_quickstart.md
+++ b/docs/mddocs/Quickstart/ollama_portable_zip_quickstart.md
@ -28,6 +28,7 @@ This guide demonstrates how to use [Ollama portable zip](https://github.com/ipex
  - [Increase context length in Ollama](#increase-context-length-in-ollama)
  - [Select specific GPU(s) to run Ollama when multiple ones are available](#select-specific-gpus-to-run-ollama-when-multiple-ones-are-available)
  - [Tune performance](#tune-performance)
+  - [Save VRAM](#save-vram)
  - [Additional models supported after Ollama v0.6.2](#additional-models-supported-after-ollama-v062)
  - [Signature Verification](#signature-verification)
 - [More details](ollama_quickstart.md)
@ -138,6 +139,9 @@ For example, if you would like to run `deepseek-r1:7b` but the download speed fr
 > ```
 > Except for `ollama run` and `ollama pull`, the model should be identified through its actual id, e.g. `ollama rm modelscope.cn/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q4_K_M`

+> [!NOTE]
+> For versions earlier than `2.3.0b20250429`, please use `IPEX_LLM_MODEL_SOURCE` instead.
+
 ### Increase context length in Ollama

 By default, Ollama runs model with a context window of 2048 tokens. That is, the model can "remember" at most 2048 tokens of context.
@ -160,7 +164,7 @@ To increase the context length, you could set environment variable `OLLAMA_NUM_C
 > `OLLAMA_NUM_CTX` has a higher priority than the `num_ctx` settings in a models' `Modelfile`.

 > [!NOTE]
-> For versions earlier than 2.7.0b20250429, please use the `IPEX_LLM_NUM_CTX` instead.
+> For versions earlier than `2.3.0b20250429`, please use `IPEX_LLM_NUM_CTX` instead.

 ### Select specific GPU(s) to run Ollama when multiple ones are available

@ -211,6 +215,39 @@ To enable `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS`, set it **before start
 > [!TIP]
 > You could refer to [here](https://www.intel.com/content/www/us/en/developer/articles/guide/level-zero-immediate-command-lists.html) regarding more information about Level Zero Immediate Command Lists.

+### Save VRAM
+
+To save VRAM, you could set environment variable `OLLAMA_NUM_PARALLEL` as `1` **before starting Ollama Serve**, as follows (if Ollama serve is already running, please make sure to stop it first):
+
+- For **Windows** users:
+
+  - Open "Command Prompt", and navigate to the extracted folder through `cd /d PATH\TO\EXTRACTED\FOLDER`
+  - Run `set OLLAMA_NUM_PARALLEL=1` in "Command Prompt"
+  - Start Ollama serve through `start-ollama.bat`
+
+- For **Linux** users:
+
+  - In a terminal, navigate to the extracted folder through `cd PATH\TO\EXTRACTED\FOLDER`
+  - Run `export OLLAMA_NUM_PARALLEL=1` in the terminal
+  - Start Ollama serve through `./start-ollama.sh`
+
+For **MoE model** (such as `qwen3:30b`), you could save VRAM by moving experts to CPU through setting environment variable `OLLAMA_SET_OT`, as follows (if Ollama serve is already running, please make sure to stop it first):
+
+- For **Windows** users:
+
+  - Open "Command Prompt", and navigate to the extracted folder through `cd /d PATH\TO\EXTRACTED\FOLDER`
+  - Run `set OLLAMA_SET_OT="exps=CPU"` in "Command Prompt" to put all experts on CPU; `OLLAMA_SET_OT` can also be set through regular expression, such as `set OLLAMA_SET_OT="(2[4-9]|[3-9][0-9])\.ffn_.*_exps\.=CPU"` to put experts at layer `24` to `99` on CPU
+  - Start Ollama serve through `start-ollama.bat`
+
+- For **Linux** users:
+
+  - In a terminal, navigate to the extracted folder through `cd PATH\TO\EXTRACTED\FOLDER`
+  - Run `export OLLAMA_SET_OT="exps=CPU"` in the terminal; `OLLAMA_SET_OT` can also be set through regular expression, such as `export OLLAMA_SET_OT="(2[4-9]|[3-9][0-9])\.ffn_.*_exps\.=CPU"` to put experts at layer `24` to `99` on CPU
+  - Start Ollama serve through `./start-ollama.sh`
+
+> [!NOTE]
+> `OLLAMA_SET_OT` is only effective for version `2.3.0b20250429` and later.
+
 ### Additional models supported after Ollama v0.6.2

 The currently Ollama Portable Zip is based on Ollama v0.6.2; in addition, the following new models have also been supported in the Ollama Portable Zip:
--- a/docs/mddocs/Quickstart/ollama_portable_zip_quickstart.zh-CN.md
+++ b/docs/mddocs/Quickstart/ollama_portable_zip_quickstart.zh-CN.md
@ -29,6 +29,7 @@
  - [在 Ollama 中增加上下文长度](#在-ollama-中增加上下文长度)
  - [在多块 GPU 可用时选择特定的 GPU 来运行 Ollama](#在多块-gpu-可用时选择特定的-gpu-来运行-ollama)
  - [性能调优](#性能调优)
+  - [节省 VRAM](#节省-vram)
  - [Ollama v0.6.2 之后新增模型支持](#ollama-v062-之后新增模型支持)
  - [签名验证](#签名验证)
 - [更多信息](ollama_quickstart.zh-CN.md)
@ -136,6 +137,9 @@ Ollama 默认从 Ollama 库下载模型。通过在**运行 Ollama 之前**设
 > ```
 > 除了 `ollama run` 和 `ollama pull`，其他操作中模型应通过其实际 ID 进行识别，例如： `ollama rm modelscope.cn/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q4_K_M`

+> [!NOTE]
+> 对早于 `2.3.0b20250429` 的版本，请改用 `IPEX_LLM_MODEL_SOURCE` 变量。
+
 ### 在 Ollama 中增加上下文长度

 默认情况下，Ollama 使用 2048 个 token 的上下文窗口运行模型。也就是说，模型最多能 “记住” 2048 个 token 的上下文。
@ -158,7 +162,7 @@ Ollama 默认从 Ollama 库下载模型。通过在**运行 Ollama 之前**设
 > `OLLAMA_NUM_CTX` 的优先级高于模型 `Modelfile` 中设置的 `num_ctx`。

 > [!NOTE]
-> 对早于 2.7.0b20250429 的版本，请改用 IPEX_LLM_NUM_CTX 变量。
+> 对早于 `2.3.0b20250429` 的版本，请改用 `IPEX_LLM_NUM_CTX` 变量。

 ### 在多块 GPU 可用时选择特定的 GPU 来运行 Ollama

@ -209,6 +213,39 @@ Ollama 默认从 Ollama 库下载模型。通过在**运行 Ollama 之前**设
 > [!TIP]
 > 参考[此处文档](https://www.intel.com/content/www/us/en/developer/articles/guide/level-zero-immediate-command-lists.html)以获取更多 Level Zero Immediate Command Lists 相关信息。

+### 节省 VRAM
+
+你可以通过在**启动 Ollama serve 之前**设置环境变量 `OLLAMA_NUM_PARALLEL` 为 `1` 来节约显存，步骤如下（如果 Ollama serve 已经在运行，请确保先将其停止）：
+
+- 对于 **Windows** 用户：
+
+  - 打开命令提示符，并通过 `cd /d PATH\TO\EXTRACTED\FOLDER` 命令进入解压后的文件夹
+  - 在命令提示符中设置 `set OLLAMA_NUM_PARALLEL=1`
+  - 通过运行 `start-ollama.bat` 启动 Ollama serve
+
+- 对于 **Linux** 用户：
+
+  - 在终端中输入指令 `cd PATH/TO/EXTRACTED/FOLDER` 进入解压后的文件夹
+  - 在终端中设置 `export OLLAMA_NUM_PARALLEL=1`
+  - 通过运行 `./start-ollama.sh` 启动 Ollama serve
+
+对于 **MoE 模型**（比如 `qwen3:30b`），你可以通过在**启动 Ollama serve 之前**设置环境变量 `OLLAMA_SET_OT` 把 experts 移到 CPU 运行上来节约显存（如果 Ollama serve 已经在运行，请确保先将其停止）：
+
+- 对于 **Windows** 用户：
+
+  - 打开命令提示符，并通过 `cd /d PATH\TO\EXTRACTED\FOLDER` 命令进入解压后的文件夹
+  - 在命令提示符中设置 `set OLLAMA_SET_OT="exps=CPU"` 把所有的 experts 放在 CPU 上；也可以通过设置正则表达式，如 `set OLLAMA_SET_OT="(2[4-9]|[3-9][0-9])\.ffn_.*_exps\.=CPU"` 把 `24` 到 `99` 层的 experts 放到 CPU 上
+  - 通过运行 `start-ollama.bat` 启动 Ollama serve
+
+- 对于 **Linux** 用户：
+
+  - 在终端中输入指令 `cd PATH/TO/EXTRACTED/FOLDER` 进入解压后的文件夹
+  - 在终端中设置 `export OLLAMA_SET_OT="exps=CPU"` 把所有的 experts 放在 CPU 上；也可以通过设置正则表达式，如 `export OLLAMA_SET_OT="(2[4-9]|[3-9][0-9])\.ffn_.*_exps\.=CPU"` 把 `24` 到 `99` 层的 experts 放到 CPU 上
+  - 通过运行 `./start-ollama.sh` 启动 Ollama serve
+
+> [!NOTE]
+> `OLLAMA_SET_OT` 仅对于 `2.3.0b20250429` 及以后的版本生效。
+
 ### Ollama v0.6.2 之后新增模型支持

 当前的 Ollama Portable Zip 基于 Ollama v0.6.2；此外，以下新模型也已在 Ollama Portable Zip 中得到支持：