From aa12f69bbf232a58a9fb4c51fd46beb86ae20f9d Mon Sep 17 00:00:00 2001 From: Yuwen Hu <54161268+Oscilloscope98@users.noreply.github.com> Date: Tue, 13 May 2025 13:25:22 +0800 Subject: [PATCH] Update Ollama portable zip QuickStart regarding saving VRAM (#13155) * Update Ollama portable zip quickstart regarding saving VRAM * Small fix --- .../ollama_portable_zip_quickstart.md | 39 ++++++++++++++++++- .../ollama_portable_zip_quickstart.zh-CN.md | 39 ++++++++++++++++++- 2 files changed, 76 insertions(+), 2 deletions(-) diff --git a/docs/mddocs/Quickstart/ollama_portable_zip_quickstart.md b/docs/mddocs/Quickstart/ollama_portable_zip_quickstart.md index 5de4607e..f6803276 100644 --- a/docs/mddocs/Quickstart/ollama_portable_zip_quickstart.md +++ b/docs/mddocs/Quickstart/ollama_portable_zip_quickstart.md @@ -28,6 +28,7 @@ This guide demonstrates how to use [Ollama portable zip](https://github.com/ipex - [Increase context length in Ollama](#increase-context-length-in-ollama) - [Select specific GPU(s) to run Ollama when multiple ones are available](#select-specific-gpus-to-run-ollama-when-multiple-ones-are-available) - [Tune performance](#tune-performance) + - [Save VRAM](#save-vram) - [Additional models supported after Ollama v0.6.2](#additional-models-supported-after-ollama-v062) - [Signature Verification](#signature-verification) - [More details](ollama_quickstart.md) @@ -138,6 +139,9 @@ For example, if you would like to run `deepseek-r1:7b` but the download speed fr > ``` > Except for `ollama run` and `ollama pull`, the model should be identified through its actual id, e.g. `ollama rm modelscope.cn/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q4_K_M` +> [!NOTE] +> For versions earlier than `2.3.0b20250429`, please use `IPEX_LLM_MODEL_SOURCE` instead. + ### Increase context length in Ollama By default, Ollama runs model with a context window of 2048 tokens. That is, the model can "remember" at most 2048 tokens of context. @@ -160,7 +164,7 @@ To increase the context length, you could set environment variable `OLLAMA_NUM_C > `OLLAMA_NUM_CTX` has a higher priority than the `num_ctx` settings in a models' `Modelfile`. > [!NOTE] -> For versions earlier than 2.7.0b20250429, please use the `IPEX_LLM_NUM_CTX` instead. +> For versions earlier than `2.3.0b20250429`, please use `IPEX_LLM_NUM_CTX` instead. ### Select specific GPU(s) to run Ollama when multiple ones are available @@ -211,6 +215,39 @@ To enable `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS`, set it **before start > [!TIP] > You could refer to [here](https://www.intel.com/content/www/us/en/developer/articles/guide/level-zero-immediate-command-lists.html) regarding more information about Level Zero Immediate Command Lists. +### Save VRAM + +To save VRAM, you could set environment variable `OLLAMA_NUM_PARALLEL` as `1` **before starting Ollama Serve**, as follows (if Ollama serve is already running, please make sure to stop it first): + +- For **Windows** users: + + - Open "Command Prompt", and navigate to the extracted folder through `cd /d PATH\TO\EXTRACTED\FOLDER` + - Run `set OLLAMA_NUM_PARALLEL=1` in "Command Prompt" + - Start Ollama serve through `start-ollama.bat` + +- For **Linux** users: + + - In a terminal, navigate to the extracted folder through `cd PATH\TO\EXTRACTED\FOLDER` + - Run `export OLLAMA_NUM_PARALLEL=1` in the terminal + - Start Ollama serve through `./start-ollama.sh` + +For **MoE model** (such as `qwen3:30b`), you could save VRAM by moving experts to CPU through setting environment variable `OLLAMA_SET_OT`, as follows (if Ollama serve is already running, please make sure to stop it first): + +- For **Windows** users: + + - Open "Command Prompt", and navigate to the extracted folder through `cd /d PATH\TO\EXTRACTED\FOLDER` + - Run `set OLLAMA_SET_OT="exps=CPU"` in "Command Prompt" to put all experts on CPU; `OLLAMA_SET_OT` can also be set through regular expression, such as `set OLLAMA_SET_OT="(2[4-9]|[3-9][0-9])\.ffn_.*_exps\.=CPU"` to put experts at layer `24` to `99` on CPU + - Start Ollama serve through `start-ollama.bat` + +- For **Linux** users: + + - In a terminal, navigate to the extracted folder through `cd PATH\TO\EXTRACTED\FOLDER` + - Run `export OLLAMA_SET_OT="exps=CPU"` in the terminal; `OLLAMA_SET_OT` can also be set through regular expression, such as `export OLLAMA_SET_OT="(2[4-9]|[3-9][0-9])\.ffn_.*_exps\.=CPU"` to put experts at layer `24` to `99` on CPU + - Start Ollama serve through `./start-ollama.sh` + +> [!NOTE] +> `OLLAMA_SET_OT` is only effective for version `2.3.0b20250429` and later. + ### Additional models supported after Ollama v0.6.2 The currently Ollama Portable Zip is based on Ollama v0.6.2; in addition, the following new models have also been supported in the Ollama Portable Zip: diff --git a/docs/mddocs/Quickstart/ollama_portable_zip_quickstart.zh-CN.md b/docs/mddocs/Quickstart/ollama_portable_zip_quickstart.zh-CN.md index 334946f6..a46cd566 100644 --- a/docs/mddocs/Quickstart/ollama_portable_zip_quickstart.zh-CN.md +++ b/docs/mddocs/Quickstart/ollama_portable_zip_quickstart.zh-CN.md @@ -29,6 +29,7 @@ - [在 Ollama 中增加上下文长度](#在-ollama-中增加上下文长度) - [在多块 GPU 可用时选择特定的 GPU 来运行 Ollama](#在多块-gpu-可用时选择特定的-gpu-来运行-ollama) - [性能调优](#性能调优) + - [节省 VRAM](#节省-vram) - [Ollama v0.6.2 之后新增模型支持](#ollama-v062-之后新增模型支持) - [签名验证](#签名验证) - [更多信息](ollama_quickstart.zh-CN.md) @@ -136,6 +137,9 @@ Ollama 默认从 Ollama 库下载模型。通过在**运行 Ollama 之前**设 > ``` > 除了 `ollama run` 和 `ollama pull`,其他操作中模型应通过其实际 ID 进行识别,例如: `ollama rm modelscope.cn/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q4_K_M` +> [!NOTE] +> 对早于 `2.3.0b20250429` 的版本,请改用 `IPEX_LLM_MODEL_SOURCE` 变量。 + ### 在 Ollama 中增加上下文长度 默认情况下,Ollama 使用 2048 个 token 的上下文窗口运行模型。也就是说,模型最多能 “记住” 2048 个 token 的上下文。 @@ -158,7 +162,7 @@ Ollama 默认从 Ollama 库下载模型。通过在**运行 Ollama 之前**设 > `OLLAMA_NUM_CTX` 的优先级高于模型 `Modelfile` 中设置的 `num_ctx`。 > [!NOTE] -> 对早于 2.7.0b20250429 的版本,请改用 IPEX_LLM_NUM_CTX 变量。 +> 对早于 `2.3.0b20250429` 的版本,请改用 `IPEX_LLM_NUM_CTX` 变量。 ### 在多块 GPU 可用时选择特定的 GPU 来运行 Ollama @@ -209,6 +213,39 @@ Ollama 默认从 Ollama 库下载模型。通过在**运行 Ollama 之前**设 > [!TIP] > 参考[此处文档](https://www.intel.com/content/www/us/en/developer/articles/guide/level-zero-immediate-command-lists.html)以获取更多 Level Zero Immediate Command Lists 相关信息。 +### 节省 VRAM + +你可以通过在**启动 Ollama serve 之前**设置环境变量 `OLLAMA_NUM_PARALLEL` 为 `1` 来节约显存,步骤如下(如果 Ollama serve 已经在运行,请确保先将其停止): + +- 对于 **Windows** 用户: + + - 打开命令提示符,并通过 `cd /d PATH\TO\EXTRACTED\FOLDER` 命令进入解压后的文件夹 + - 在命令提示符中设置 `set OLLAMA_NUM_PARALLEL=1` + - 通过运行 `start-ollama.bat` 启动 Ollama serve + +- 对于 **Linux** 用户: + + - 在终端中输入指令 `cd PATH/TO/EXTRACTED/FOLDER` 进入解压后的文件夹 + - 在终端中设置 `export OLLAMA_NUM_PARALLEL=1` + - 通过运行 `./start-ollama.sh` 启动 Ollama serve + +对于 **MoE 模型**(比如 `qwen3:30b`),你可以通过在**启动 Ollama serve 之前**设置环境变量 `OLLAMA_SET_OT` 把 experts 移到 CPU 运行上来节约显存(如果 Ollama serve 已经在运行,请确保先将其停止): + +- 对于 **Windows** 用户: + + - 打开命令提示符,并通过 `cd /d PATH\TO\EXTRACTED\FOLDER` 命令进入解压后的文件夹 + - 在命令提示符中设置 `set OLLAMA_SET_OT="exps=CPU"` 把所有的 experts 放在 CPU 上;也可以通过设置正则表达式,如 `set OLLAMA_SET_OT="(2[4-9]|[3-9][0-9])\.ffn_.*_exps\.=CPU"` 把 `24` 到 `99` 层的 experts 放到 CPU 上 + - 通过运行 `start-ollama.bat` 启动 Ollama serve + +- 对于 **Linux** 用户: + + - 在终端中输入指令 `cd PATH/TO/EXTRACTED/FOLDER` 进入解压后的文件夹 + - 在终端中设置 `export OLLAMA_SET_OT="exps=CPU"` 把所有的 experts 放在 CPU 上;也可以通过设置正则表达式,如 `export OLLAMA_SET_OT="(2[4-9]|[3-9][0-9])\.ffn_.*_exps\.=CPU"` 把 `24` 到 `99` 层的 experts 放到 CPU 上 + - 通过运行 `./start-ollama.sh` 启动 Ollama serve + +> [!NOTE] +> `OLLAMA_SET_OT` 仅对于 `2.3.0b20250429` 及以后的版本生效。 + ### Ollama v0.6.2 之后新增模型支持 当前的 Ollama Portable Zip 基于 Ollama v0.6.2;此外,以下新模型也已在 Ollama Portable Zip 中得到支持: