Update Ollama portable zip QuickStart regarding saving VRAM (#13155)

* Update Ollama portable zip quickstart regarding saving VRAM

* Small fix
This commit is contained in:
Yuwen Hu 2025-05-13 13:25:22 +08:00 committed by GitHub
parent 086a8b3ab9
commit aa12f69bbf
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 76 additions and 2 deletions

View file

@ -28,6 +28,7 @@ This guide demonstrates how to use [Ollama portable zip](https://github.com/ipex
- [Increase context length in Ollama](#increase-context-length-in-ollama)
- [Select specific GPU(s) to run Ollama when multiple ones are available](#select-specific-gpus-to-run-ollama-when-multiple-ones-are-available)
- [Tune performance](#tune-performance)
- [Save VRAM](#save-vram)
- [Additional models supported after Ollama v0.6.2](#additional-models-supported-after-ollama-v062)
- [Signature Verification](#signature-verification)
- [More details](ollama_quickstart.md)
@ -138,6 +139,9 @@ For example, if you would like to run `deepseek-r1:7b` but the download speed fr
> ```
> Except for `ollama run` and `ollama pull`, the model should be identified through its actual id, e.g. `ollama rm modelscope.cn/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q4_K_M`
> [!NOTE]
> For versions earlier than `2.3.0b20250429`, please use `IPEX_LLM_MODEL_SOURCE` instead.
### Increase context length in Ollama
By default, Ollama runs model with a context window of 2048 tokens. That is, the model can "remember" at most 2048 tokens of context.
@ -160,7 +164,7 @@ To increase the context length, you could set environment variable `OLLAMA_NUM_C
> `OLLAMA_NUM_CTX` has a higher priority than the `num_ctx` settings in a models' `Modelfile`.
> [!NOTE]
> For versions earlier than 2.7.0b20250429, please use the `IPEX_LLM_NUM_CTX` instead.
> For versions earlier than `2.3.0b20250429`, please use `IPEX_LLM_NUM_CTX` instead.
### Select specific GPU(s) to run Ollama when multiple ones are available
@ -211,6 +215,39 @@ To enable `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS`, set it **before start
> [!TIP]
> You could refer to [here](https://www.intel.com/content/www/us/en/developer/articles/guide/level-zero-immediate-command-lists.html) regarding more information about Level Zero Immediate Command Lists.
### Save VRAM
To save VRAM, you could set environment variable `OLLAMA_NUM_PARALLEL` as `1` **before starting Ollama Serve**, as follows (if Ollama serve is already running, please make sure to stop it first):
- For **Windows** users:
- Open "Command Prompt", and navigate to the extracted folder through `cd /d PATH\TO\EXTRACTED\FOLDER`
- Run `set OLLAMA_NUM_PARALLEL=1` in "Command Prompt"
- Start Ollama serve through `start-ollama.bat`
- For **Linux** users:
- In a terminal, navigate to the extracted folder through `cd PATH\TO\EXTRACTED\FOLDER`
- Run `export OLLAMA_NUM_PARALLEL=1` in the terminal
- Start Ollama serve through `./start-ollama.sh`
For **MoE model** (such as `qwen3:30b`), you could save VRAM by moving experts to CPU through setting environment variable `OLLAMA_SET_OT`, as follows (if Ollama serve is already running, please make sure to stop it first):
- For **Windows** users:
- Open "Command Prompt", and navigate to the extracted folder through `cd /d PATH\TO\EXTRACTED\FOLDER`
- Run `set OLLAMA_SET_OT="exps=CPU"` in "Command Prompt" to put all experts on CPU; `OLLAMA_SET_OT` can also be set through regular expression, such as `set OLLAMA_SET_OT="(2[4-9]|[3-9][0-9])\.ffn_.*_exps\.=CPU"` to put experts at layer `24` to `99` on CPU
- Start Ollama serve through `start-ollama.bat`
- For **Linux** users:
- In a terminal, navigate to the extracted folder through `cd PATH\TO\EXTRACTED\FOLDER`
- Run `export OLLAMA_SET_OT="exps=CPU"` in the terminal; `OLLAMA_SET_OT` can also be set through regular expression, such as `export OLLAMA_SET_OT="(2[4-9]|[3-9][0-9])\.ffn_.*_exps\.=CPU"` to put experts at layer `24` to `99` on CPU
- Start Ollama serve through `./start-ollama.sh`
> [!NOTE]
> `OLLAMA_SET_OT` is only effective for version `2.3.0b20250429` and later.
### Additional models supported after Ollama v0.6.2
The currently Ollama Portable Zip is based on Ollama v0.6.2; in addition, the following new models have also been supported in the Ollama Portable Zip:

View file

@ -29,6 +29,7 @@
- [在 Ollama 中增加上下文长度](#在-ollama-中增加上下文长度)
- [在多块 GPU 可用时选择特定的 GPU 来运行 Ollama](#在多块-gpu-可用时选择特定的-gpu-来运行-ollama)
- [性能调优](#性能调优)
- [节省 VRAM](#节省-vram)
- [Ollama v0.6.2 之后新增模型支持](#ollama-v062-之后新增模型支持)
- [签名验证](#签名验证)
- [更多信息](ollama_quickstart.zh-CN.md)
@ -136,6 +137,9 @@ Ollama 默认从 Ollama 库下载模型。通过在**运行 Ollama 之前**设
> ```
> 除了 `ollama run``ollama pull`,其他操作中模型应通过其实际 ID 进行识别,例如: `ollama rm modelscope.cn/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q4_K_M`
> [!NOTE]
> 对早于 `2.3.0b20250429` 的版本,请改用 `IPEX_LLM_MODEL_SOURCE` 变量。
### 在 Ollama 中增加上下文长度
默认情况下Ollama 使用 2048 个 token 的上下文窗口运行模型。也就是说,模型最多能 “记住” 2048 个 token 的上下文。
@ -158,7 +162,7 @@ Ollama 默认从 Ollama 库下载模型。通过在**运行 Ollama 之前**设
> `OLLAMA_NUM_CTX` 的优先级高于模型 `Modelfile` 中设置的 `num_ctx`
> [!NOTE]
> 对早于 2.7.0b20250429 的版本,请改用 IPEX_LLM_NUM_CTX 变量。
> 对早于 `2.3.0b20250429` 的版本,请改用 `IPEX_LLM_NUM_CTX` 变量。
### 在多块 GPU 可用时选择特定的 GPU 来运行 Ollama
@ -209,6 +213,39 @@ Ollama 默认从 Ollama 库下载模型。通过在**运行 Ollama 之前**设
> [!TIP]
> 参考[此处文档](https://www.intel.com/content/www/us/en/developer/articles/guide/level-zero-immediate-command-lists.html)以获取更多 Level Zero Immediate Command Lists 相关信息。
### 节省 VRAM
你可以通过在**启动 Ollama serve 之前**设置环境变量 `OLLAMA_NUM_PARALLEL``1` 来节约显存,步骤如下(如果 Ollama serve 已经在运行,请确保先将其停止):
- 对于 **Windows** 用户:
- 打开命令提示符,并通过 `cd /d PATH\TO\EXTRACTED\FOLDER` 命令进入解压后的文件夹
- 在命令提示符中设置 `set OLLAMA_NUM_PARALLEL=1`
- 通过运行 `start-ollama.bat` 启动 Ollama serve
- 对于 **Linux** 用户:
- 在终端中输入指令 `cd PATH/TO/EXTRACTED/FOLDER` 进入解压后的文件夹
- 在终端中设置 `export OLLAMA_NUM_PARALLEL=1`
- 通过运行 `./start-ollama.sh` 启动 Ollama serve
对于 **MoE 模型**(比如 `qwen3:30b`),你可以通过在**启动 Ollama serve 之前**设置环境变量 `OLLAMA_SET_OT` 把 experts 移到 CPU 运行上来节约显存(如果 Ollama serve 已经在运行,请确保先将其停止):
- 对于 **Windows** 用户:
- 打开命令提示符,并通过 `cd /d PATH\TO\EXTRACTED\FOLDER` 命令进入解压后的文件夹
- 在命令提示符中设置 `set OLLAMA_SET_OT="exps=CPU"` 把所有的 experts 放在 CPU 上;也可以通过设置正则表达式,如 `set OLLAMA_SET_OT="(2[4-9]|[3-9][0-9])\.ffn_.*_exps\.=CPU"``24``99` 层的 experts 放到 CPU 上
- 通过运行 `start-ollama.bat` 启动 Ollama serve
- 对于 **Linux** 用户:
- 在终端中输入指令 `cd PATH/TO/EXTRACTED/FOLDER` 进入解压后的文件夹
- 在终端中设置 `export OLLAMA_SET_OT="exps=CPU"` 把所有的 experts 放在 CPU 上;也可以通过设置正则表达式,如 `export OLLAMA_SET_OT="(2[4-9]|[3-9][0-9])\.ffn_.*_exps\.=CPU"``24``99` 层的 experts 放到 CPU 上
- 通过运行 `./start-ollama.sh` 启动 Ollama serve
> [!NOTE]
> `OLLAMA_SET_OT` 仅对于 `2.3.0b20250429` 及以后的版本生效。
### Ollama v0.6.2 之后新增模型支持
当前的 Ollama Portable Zip 基于 Ollama v0.6.2;此外,以下新模型也已在 Ollama Portable Zip 中得到支持: