Modify ollama num_ctx related doc (#13139 )

* Modify ollama num_ctx related doc

* meet comments

2025-05-07 16:44:58 +08:00

11 KiB

Raw Blame History

Run Ollama Portable Zip on Intel GPU with IPEX-LLM

< English | 中文 >

This guide demonstrates how to use Ollama portable zip to directly run Ollama on Intel GPU with ipex-llm (without the need of manual installations).

Note

Ollama portable zip has been verified on:

Intel Core Ultra processors

Intel Core 11th - 14th gen processors

Intel Arc A-Series GPU

Intel Arc B-Series GPU

Windows Quickstart
Linux Quickstart
Tips & Troubleshooting
More details

Windows Quickstart

Note

We recommand using Windows 11 for Windows users.

Prerequisites

We recommend updating your GPU driver to the latest.

Step 1: Download and Unzip

Download IPEX-LLM Ollama portable zip for Windows users from the link.

Then, extract the zip file to a folder.

Step 2: Start Ollama Serve

Start Ollama serve as follows:

Open "Command Prompt" (cmd), and enter the extracted folder through cd /d PATH\TO\EXTRACTED\FOLDER
Run start-ollama.bat in the "Command Prompt. A window will then pop up as shown below:

Step 3: Run Ollama

You could then use Ollama to run LLMs on Intel GPUs through running ollama run deepseek-r1:7b in the same "Command Prompt" (not the pop-up window). You may use any other model.

Linux Quickstart

Prerequisites

Check your GPU driver version, and update it if needed; we recommend following Intel client GPU driver installation guide to install your GPU driver.

Step 1: Download and Extract

Download IPEX-LLM Ollama portable tgz for Ubuntu users from the link.

Then open a terminal, extract the tgz file to a folder.

tar -xvf [Downloaded tgz file path]

Step 2: Start Ollama Serve

Enter the extracted folder, and run start-ollama.sh to start Ollama service.

cd PATH/TO/EXTRACTED/FOLDER
./start-ollama.sh

Step 3: Run Ollama

You could then use Ollama to run LLMs on Intel GPUs as follows:

Open another ternimal, and enter the extracted folder through cd PATH/TO/EXTRACTED/FOLDER
Run ./ollama run deepseek-r1:7b (you may use any other model)

Tips & Troubleshooting

Speed up model download using alternative sources

Ollama by default downloads model from the Ollama library. By setting the environment variable OLLAMA_MODEL_SOURCE to modelscope or ollama before running Ollama, you could switch the source where the model is downloaded.

For example, if you would like to run deepseek-r1:7b but the download speed from the Ollama library is slow, you could download the model from ModelScope as follows:

For Windows users:
- In the "Command Prompt", navigate to the extracted folder by cd /d PATH\TO\EXTRACTED\FOLDER
- Run set OLLAMA_MODEL_SOURCE=modelscope in "Command Prompt"
- Run ollama run deepseek-r1:7b
For Linux users:
- In a terminal other than the one for Ollama serve, navigate to the extracted folder by cd PATH\TO\EXTRACTED\FOLDER
- Run export OLLAMA_MODEL_SOURCE=modelscope in the terminal
- Run ./ollama run deepseek-r1:7b

Tip

Model downloaded with set OLLAMA_MODEL_SOURCE=modelscope will still show actual model id in ollama list, e.g.
NAME                                                             ID              SIZE      MODIFIED
modelscope.cn/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q4_K_M    f482d5af6aec    4.7 GB    About a minute ago
Except for ollama run and ollama pull, the model should be identified through its actual id, e.g. ollama rm modelscope.cn/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q4_K_M

Increase context length in Ollama

By default, Ollama runs model with a context window of 2048 tokens. That is, the model can "remember" at most 2048 tokens of context.

To increase the context length, you could set environment variable OLLAMA_NUM_CTX before staring Ollama Serve, as shwon below (if Ollama serve is already running, please make sure to stop it first):

For Windows users:
- Open "Command Prompt", and navigate to the extracted folder through cd /d PATH\TO\EXTRACTED\FOLDER
- Set OLLAMA_NUM_CTX to the desired length in the "Command Prompt, e.g. set OLLAMA_NUM_CTX=16384
- Start Ollama serve through start-ollama.bat
For Linux users:
- In a terminal, navigate to the extracted folder through cd PATH\TO\EXTRACTED\FOLDER
- Set OLLAMA_NUM_CTX to the desired length in the terminal, e.g. export OLLAMA_NUM_CTX=16384
- Start Ollama serve through ./start-ollama.sh

Tip

OLLAMA_NUM_CTX has a higher priority than the num_ctx settings in a models' Modelfile.

Note

For versions earlier than 2.7.0b20250429, please use the IPEX_LLM_NUM_CTX instead.

Select specific GPU(s) to run Ollama when multiple ones are available

If your machine has multiple Intel GPUs, Ollama will by default runs on all of them.

To specify which Intel GPU(s) you would like Ollama to use, you could set environment variable ONEAPI_DEVICE_SELECTOR before starting Ollama Serve, as follows (if Ollama serve is already running, please make sure to stop it first):

Identify the id (e.g. 0, 1, etc.) for your multiple GPUs. You could find them in the logs of Ollama serve when loading any models, e.g.:
For Windows users:
- Open "Command Prompt", and navigate to the extracted folder by cd /d PATH\TO\EXTRACTED\FOLDER
- In the "Command Prompt", set ONEAPI_DEVICE_SELECTOR to define the Intel GPU(s) you want to use, e.g. set ONEAPI_DEVICE_SELECTOR=level_zero:0 (on single Intel GPU), or set ONEAPI_DEVICE_SELECTOR=level_zero:0;level_zero:1 (on multiple Intel GPUs), in which 0, 1 should be changed to your desired GPU id
- Start Ollama serve through start-ollama.bat
For Linux users:
- In a terminal, navigate to the extracted folder by cd PATH\TO\EXTRACTED\FOLDER
- Set ONEAPI_DEVICE_SELECTOR to define the Intel GPU(s) you want to use, e.g. export ONEAPI_DEVICE_SELECTOR=level_zero:0 (on single Intel GPU), or export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1" (on multiple Intel GPUs), in which 0, 1 should be changed to your desired GPU id
- Start Ollama serve through ./start-ollama.sh

Tune performance

Here are some settings you could try to tune the performance:

Environment variable `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS`

The environment variable SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS determines the usage of immediate command lists for task submission to the GPU. You could experiment with SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 or 0 for best performance.

To enable SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS, set it before starting Ollama Serve, as shown below (if Ollama serve is already running, please make sure to stop it first):

For Windows users:
- Open "Command Prompt", and navigate to the extracted folder through cd /d PATH\TO\EXTRACTED\FOLDER
- Run set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 in "Command Prompt"
- Start Ollama serve through start-ollama.bat
For Linux users:
- In a terminal, navigate to the extracted folder through cd PATH\TO\EXTRACTED\FOLDER
- Run export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 in the terminal
- Start Ollama serve through ./start-ollama.sh

Tip

You could refer to here regarding more information about Level Zero Immediate Command Lists.

Additional models supported after Ollama v0.6.2

The currently Ollama Portable Zip is based on Ollama v0.6.2; in addition, the following new models have also been supported in the Ollama Portable Zip:

Model	Download (Windows)	Download (Linux)	Model Link
DeepSeek-R1	`ollama run deepseek-r1`	`./ollama run deepseek-r1`	deepseek-r1
Openthinker	`ollama run openthinker`	`./ollama run openthinker`	openthinker
DeepScaleR	`ollama run deepscaler`	`./ollama run deepscaler`	deepscaler
Phi-4	`ollama run phi4`	`./ollama run phi4`	phi4
Dolphin 3.0	`ollama run dolphin3`	`./ollama run dolphin3`	dolphin3
Smallthinker	`ollama run smallthinker`	`./ollama run smallthinker`	smallthinker
Granite3.1-Dense	`ollama run granite3-dense`	`./ollama run granite3-dense`	granite3.1-dense
Granite3.1-Moe-3B	`ollama run granite3-moe`	`./ollama run granite3-moe`	granite3.1-moe
Gemma 3 1B	`set IPEX_LLM_MODEL_SOURCE=modelscope` `ollama run gemma3:1b`	`export IPEX_LLM_MODEL_SOURCE=modelscope` `./ollama run gemma3:1b`	gemma3:1b

Signature Verification

For portable zip/tgz version 2.2.0, you could verify its signature with the following command:

openssl cms -verify -in <portable-zip-or-tgz-file-name>.pkcs1.sig -inform DER -content <portable-zip-or-tgz-file-name> -out nul -noverify

Note

Please ensure that openssl is installed on your system before verifying signature.

11 KiB Raw Blame History

Run Ollama Portable Zip on Intel GPU with IPEX-LLM

Table of Contents

Windows Quickstart

Prerequisites

Step 1: Download and Unzip

Step 2: Start Ollama Serve

Step 3: Run Ollama

Linux Quickstart

Prerequisites

Step 1: Download and Extract

Step 2: Start Ollama Serve

Step 3: Run Ollama

Tips & Troubleshooting

Speed up model download using alternative sources

Increase context length in Ollama

Select specific GPU(s) to run Ollama when multiple ones are available

Tune performance

Environment variable SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS

Additional models supported after Ollama v0.6.2

Signature Verification

11 KiB

Raw Blame History

Environment variable `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS`