Update Ollama portable zip QuickStart regarding saving VRAM (#13155)
* Update Ollama portable zip quickstart regarding saving VRAM * Small fix
This commit is contained in:
parent
086a8b3ab9
commit
aa12f69bbf
2 changed files with 76 additions and 2 deletions
|
|
@ -28,6 +28,7 @@ This guide demonstrates how to use [Ollama portable zip](https://github.com/ipex
|
|||
- [Increase context length in Ollama](#increase-context-length-in-ollama)
|
||||
- [Select specific GPU(s) to run Ollama when multiple ones are available](#select-specific-gpus-to-run-ollama-when-multiple-ones-are-available)
|
||||
- [Tune performance](#tune-performance)
|
||||
- [Save VRAM](#save-vram)
|
||||
- [Additional models supported after Ollama v0.6.2](#additional-models-supported-after-ollama-v062)
|
||||
- [Signature Verification](#signature-verification)
|
||||
- [More details](ollama_quickstart.md)
|
||||
|
|
@ -138,6 +139,9 @@ For example, if you would like to run `deepseek-r1:7b` but the download speed fr
|
|||
> ```
|
||||
> Except for `ollama run` and `ollama pull`, the model should be identified through its actual id, e.g. `ollama rm modelscope.cn/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q4_K_M`
|
||||
|
||||
> [!NOTE]
|
||||
> For versions earlier than `2.3.0b20250429`, please use `IPEX_LLM_MODEL_SOURCE` instead.
|
||||
|
||||
### Increase context length in Ollama
|
||||
|
||||
By default, Ollama runs model with a context window of 2048 tokens. That is, the model can "remember" at most 2048 tokens of context.
|
||||
|
|
@ -160,7 +164,7 @@ To increase the context length, you could set environment variable `OLLAMA_NUM_C
|
|||
> `OLLAMA_NUM_CTX` has a higher priority than the `num_ctx` settings in a models' `Modelfile`.
|
||||
|
||||
> [!NOTE]
|
||||
> For versions earlier than 2.7.0b20250429, please use the `IPEX_LLM_NUM_CTX` instead.
|
||||
> For versions earlier than `2.3.0b20250429`, please use `IPEX_LLM_NUM_CTX` instead.
|
||||
|
||||
### Select specific GPU(s) to run Ollama when multiple ones are available
|
||||
|
||||
|
|
@ -211,6 +215,39 @@ To enable `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS`, set it **before start
|
|||
> [!TIP]
|
||||
> You could refer to [here](https://www.intel.com/content/www/us/en/developer/articles/guide/level-zero-immediate-command-lists.html) regarding more information about Level Zero Immediate Command Lists.
|
||||
|
||||
### Save VRAM
|
||||
|
||||
To save VRAM, you could set environment variable `OLLAMA_NUM_PARALLEL` as `1` **before starting Ollama Serve**, as follows (if Ollama serve is already running, please make sure to stop it first):
|
||||
|
||||
- For **Windows** users:
|
||||
|
||||
- Open "Command Prompt", and navigate to the extracted folder through `cd /d PATH\TO\EXTRACTED\FOLDER`
|
||||
- Run `set OLLAMA_NUM_PARALLEL=1` in "Command Prompt"
|
||||
- Start Ollama serve through `start-ollama.bat`
|
||||
|
||||
- For **Linux** users:
|
||||
|
||||
- In a terminal, navigate to the extracted folder through `cd PATH\TO\EXTRACTED\FOLDER`
|
||||
- Run `export OLLAMA_NUM_PARALLEL=1` in the terminal
|
||||
- Start Ollama serve through `./start-ollama.sh`
|
||||
|
||||
For **MoE model** (such as `qwen3:30b`), you could save VRAM by moving experts to CPU through setting environment variable `OLLAMA_SET_OT`, as follows (if Ollama serve is already running, please make sure to stop it first):
|
||||
|
||||
- For **Windows** users:
|
||||
|
||||
- Open "Command Prompt", and navigate to the extracted folder through `cd /d PATH\TO\EXTRACTED\FOLDER`
|
||||
- Run `set OLLAMA_SET_OT="exps=CPU"` in "Command Prompt" to put all experts on CPU; `OLLAMA_SET_OT` can also be set through regular expression, such as `set OLLAMA_SET_OT="(2[4-9]|[3-9][0-9])\.ffn_.*_exps\.=CPU"` to put experts at layer `24` to `99` on CPU
|
||||
- Start Ollama serve through `start-ollama.bat`
|
||||
|
||||
- For **Linux** users:
|
||||
|
||||
- In a terminal, navigate to the extracted folder through `cd PATH\TO\EXTRACTED\FOLDER`
|
||||
- Run `export OLLAMA_SET_OT="exps=CPU"` in the terminal; `OLLAMA_SET_OT` can also be set through regular expression, such as `export OLLAMA_SET_OT="(2[4-9]|[3-9][0-9])\.ffn_.*_exps\.=CPU"` to put experts at layer `24` to `99` on CPU
|
||||
- Start Ollama serve through `./start-ollama.sh`
|
||||
|
||||
> [!NOTE]
|
||||
> `OLLAMA_SET_OT` is only effective for version `2.3.0b20250429` and later.
|
||||
|
||||
### Additional models supported after Ollama v0.6.2
|
||||
|
||||
The currently Ollama Portable Zip is based on Ollama v0.6.2; in addition, the following new models have also been supported in the Ollama Portable Zip:
|
||||
|
|
|
|||
|
|
@ -29,6 +29,7 @@
|
|||
- [在 Ollama 中增加上下文长度](#在-ollama-中增加上下文长度)
|
||||
- [在多块 GPU 可用时选择特定的 GPU 来运行 Ollama](#在多块-gpu-可用时选择特定的-gpu-来运行-ollama)
|
||||
- [性能调优](#性能调优)
|
||||
- [节省 VRAM](#节省-vram)
|
||||
- [Ollama v0.6.2 之后新增模型支持](#ollama-v062-之后新增模型支持)
|
||||
- [签名验证](#签名验证)
|
||||
- [更多信息](ollama_quickstart.zh-CN.md)
|
||||
|
|
@ -136,6 +137,9 @@ Ollama 默认从 Ollama 库下载模型。通过在**运行 Ollama 之前**设
|
|||
> ```
|
||||
> 除了 `ollama run` 和 `ollama pull`,其他操作中模型应通过其实际 ID 进行识别,例如: `ollama rm modelscope.cn/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q4_K_M`
|
||||
|
||||
> [!NOTE]
|
||||
> 对早于 `2.3.0b20250429` 的版本,请改用 `IPEX_LLM_MODEL_SOURCE` 变量。
|
||||
|
||||
### 在 Ollama 中增加上下文长度
|
||||
|
||||
默认情况下,Ollama 使用 2048 个 token 的上下文窗口运行模型。也就是说,模型最多能 “记住” 2048 个 token 的上下文。
|
||||
|
|
@ -158,7 +162,7 @@ Ollama 默认从 Ollama 库下载模型。通过在**运行 Ollama 之前**设
|
|||
> `OLLAMA_NUM_CTX` 的优先级高于模型 `Modelfile` 中设置的 `num_ctx`。
|
||||
|
||||
> [!NOTE]
|
||||
> 对早于 2.7.0b20250429 的版本,请改用 IPEX_LLM_NUM_CTX 变量。
|
||||
> 对早于 `2.3.0b20250429` 的版本,请改用 `IPEX_LLM_NUM_CTX` 变量。
|
||||
|
||||
### 在多块 GPU 可用时选择特定的 GPU 来运行 Ollama
|
||||
|
||||
|
|
@ -209,6 +213,39 @@ Ollama 默认从 Ollama 库下载模型。通过在**运行 Ollama 之前**设
|
|||
> [!TIP]
|
||||
> 参考[此处文档](https://www.intel.com/content/www/us/en/developer/articles/guide/level-zero-immediate-command-lists.html)以获取更多 Level Zero Immediate Command Lists 相关信息。
|
||||
|
||||
### 节省 VRAM
|
||||
|
||||
你可以通过在**启动 Ollama serve 之前**设置环境变量 `OLLAMA_NUM_PARALLEL` 为 `1` 来节约显存,步骤如下(如果 Ollama serve 已经在运行,请确保先将其停止):
|
||||
|
||||
- 对于 **Windows** 用户:
|
||||
|
||||
- 打开命令提示符,并通过 `cd /d PATH\TO\EXTRACTED\FOLDER` 命令进入解压后的文件夹
|
||||
- 在命令提示符中设置 `set OLLAMA_NUM_PARALLEL=1`
|
||||
- 通过运行 `start-ollama.bat` 启动 Ollama serve
|
||||
|
||||
- 对于 **Linux** 用户:
|
||||
|
||||
- 在终端中输入指令 `cd PATH/TO/EXTRACTED/FOLDER` 进入解压后的文件夹
|
||||
- 在终端中设置 `export OLLAMA_NUM_PARALLEL=1`
|
||||
- 通过运行 `./start-ollama.sh` 启动 Ollama serve
|
||||
|
||||
对于 **MoE 模型**(比如 `qwen3:30b`),你可以通过在**启动 Ollama serve 之前**设置环境变量 `OLLAMA_SET_OT` 把 experts 移到 CPU 运行上来节约显存(如果 Ollama serve 已经在运行,请确保先将其停止):
|
||||
|
||||
- 对于 **Windows** 用户:
|
||||
|
||||
- 打开命令提示符,并通过 `cd /d PATH\TO\EXTRACTED\FOLDER` 命令进入解压后的文件夹
|
||||
- 在命令提示符中设置 `set OLLAMA_SET_OT="exps=CPU"` 把所有的 experts 放在 CPU 上;也可以通过设置正则表达式,如 `set OLLAMA_SET_OT="(2[4-9]|[3-9][0-9])\.ffn_.*_exps\.=CPU"` 把 `24` 到 `99` 层的 experts 放到 CPU 上
|
||||
- 通过运行 `start-ollama.bat` 启动 Ollama serve
|
||||
|
||||
- 对于 **Linux** 用户:
|
||||
|
||||
- 在终端中输入指令 `cd PATH/TO/EXTRACTED/FOLDER` 进入解压后的文件夹
|
||||
- 在终端中设置 `export OLLAMA_SET_OT="exps=CPU"` 把所有的 experts 放在 CPU 上;也可以通过设置正则表达式,如 `export OLLAMA_SET_OT="(2[4-9]|[3-9][0-9])\.ffn_.*_exps\.=CPU"` 把 `24` 到 `99` 层的 experts 放到 CPU 上
|
||||
- 通过运行 `./start-ollama.sh` 启动 Ollama serve
|
||||
|
||||
> [!NOTE]
|
||||
> `OLLAMA_SET_OT` 仅对于 `2.3.0b20250429` 及以后的版本生效。
|
||||
|
||||
### Ollama v0.6.2 之后新增模型支持
|
||||
|
||||
当前的 Ollama Portable Zip 基于 Ollama v0.6.2;此外,以下新模型也已在 Ollama Portable Zip 中得到支持:
|
||||
|
|
|
|||
Loading…
Reference in a new issue