Run Ollama on Linux with Intel GPU

ollama/ollama is popular framework designed to build and run language models on a local machine; you can now use the C++ interface of ipex-llm as an accelerated backend for ollama running on Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max).

.. note::
   Only Linux is currently supported.

See the demo of running LLaMA2-7B on Intel Arc GPU below.

Quickstart

1 Install IPEX-LLM with Ollama Binaries

Visit Run llama.cpp with IPEX-LLM on Intel GPU Guide, and follow the instructions in section Install Prerequisits on Linux , and section Install IPEX-LLM cpp to install the IPEX-LLM with Ollama binaries.

After the installation, you should have created a conda environment, named llm-cpp for instance, for running llama.cpp commands with IPEX-LLM.

2. Initialize Ollama

Activate the llm-cpp conda environment and initialize Ollama by executing the commands below. A symbolic link to ollama will appear in your current directory.

conda activate llm-cpp
init-ollama

3 Run Ollama Serve

Launch the Ollama service:

conda activate llm-cpp

export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
source /opt/intel/oneapi/setvars.sh

./ollama serve

.. note::
   
  To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`.

The console will display messages similar to the following:

4 Pull Model

Keep the Ollama service on and open another terminal and run ./ollama pull <model_name> to automatically pull a model. e.g. dolphin-phi:latest:

5 Using Ollama

Using Curl

Using curl is the easiest way to verify the API service and model. Execute the following commands in a terminal. Replace the <model_name> with your pulled model, e.g. dolphin-phi.

curl http://localhost:11434/api/generate -d '
{ 
  "model": "<model_name>", 
  "prompt": "Why is the sky blue?", 
  "stream": false 
}'

An example output of using model doplphin-phi looks like the following:

Using Ollama Run

You can also use ollama run to run the model directly on console. Replace the <model_name> with your pulled model, e.g. dolphin-phi. This command will seamlessly download, load the model, and enable you to interact with it through a streaming conversation."

conda activate llm-cpp

export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
source /opt/intel/oneapi/setvars.sh

./ollama run <model_name>

An example process of interacting with model with ollama run looks like the following:

4.1 KiB Raw Blame History