From fbd1743b5ec26edf86d349a9f8d165696eea27b0 Mon Sep 17 00:00:00 2001
From: SONG Ge <38711238+sgwhat@users.noreply.github.com>
Date: Fri, 19 Apr 2024 15:00:25 +0800
Subject: [PATCH] Ollama quickstart update (#10806)

* add ollama doc for OLLAMA_NUM_GPU

* remove useless params

* revert unexpected changes back

* move env setting to server part

* update
---
 .../doc/LLM/Quickstart/ollama_quickstart.md   | 29 +++++++------------
 1 file changed, 10 insertions(+), 19 deletions(-)

diff --git a/docs/readthedocs/source/doc/LLM/Quickstart/ollama_quickstart.md b/docs/readthedocs/source/doc/LLM/Quickstart/ollama_quickstart.md
index 4c893a93..a150c0df 100644
--- a/docs/readthedocs/source/doc/LLM/Quickstart/ollama_quickstart.md
+++ b/docs/readthedocs/source/doc/LLM/Quickstart/ollama_quickstart.md
@@ -44,8 +44,7 @@ Activate the `llm-cpp` conda environment and initialize Ollama by executing the
 
 ### 3 Run Ollama Serve
 
-
-Launch the Ollama service:
+You may launch the Ollama service as below:
 
 ```eval_rst
 .. tabs::
@@ -53,6 +52,7 @@ Launch the Ollama service:
 
       .. code-block:: bash
 
+         export OLLAMA_NUM_GPU=999
          export no_proxy=localhost,127.0.0.1
          export ZES_ENABLE_SYSMAN=1
          source /opt/intel/oneapi/setvars.sh
@@ -65,6 +65,7 @@ Launch the Ollama service:
 
       .. code-block:: bash
 
+         set OLLAMA_NUM_GPU=999
          set no_proxy=localhost,127.0.0.1
          set ZES_ENABLE_SYSMAN=1
          call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
@@ -74,6 +75,11 @@ Launch the Ollama service:
 ```
 
 ```eval_rst
+.. note::
+
+  Please set environment variable ``OLLAMA_NUM_GPU`` to ``999`` to make sure all layers of your model are running on Intel GPU, otherwise, some layers may run on CPU.
+```
+
 .. note::
 
   To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`.
@@ -111,8 +117,7 @@ model**, e.g. `dolphin-phi`.
          { 
             "model": "<model_name>", 
             "prompt": "Why is the sky blue?", 
-            "stream": false,
-            "options":{"num_gpu": 999}
+            "stream": false
          }'
 
    .. tab:: Windows
@@ -125,19 +130,12 @@ model**, e.g. `dolphin-phi`.
          {
             \"model\": \"<model_name>\",
             \"prompt\": \"Why is the sky blue?\",
-            \"stream\": false,
-            \"options\":{\"num_gpu\": 999}
+            \"stream\": false
          }"
 
 ```
 
 
-```eval_rst
-.. note::
-
-  Please don't forget to set ``"options":{"num_gpu": 999}`` to make sure all layers of your model are running on Intel GPU, otherwise, some layers may run on CPU.
-```
-
 #### Using Ollama Run GGUF models
 
 Ollama supports importing GGUF models in the Modelfile, for example, suppose you have downloaded a `mistral-7b-instruct-v0.1.Q4_K_M.gguf` from [Mistral-7B-Instruct-v0.1-GGUF](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/tree/main), then you can create a file named `Modelfile`:
@@ -145,16 +143,9 @@ Ollama supports importing GGUF models in the Modelfile, for example, suppose you
 ```bash
 FROM ./mistral-7b-instruct-v0.1.Q4_K_M.gguf
 TEMPLATE [INST] {{ .Prompt }} [/INST]
-PARAMETER num_gpu 999
 PARAMETER num_predict 64
 ```
 
-```eval_rst
-.. note::
-
-  Please don't forget to set ``PARAMETER num_gpu 999`` to make sure all layers of your model are running on Intel GPU, otherwise, some layers may run on CPU.
-```
-
 Then you can create the model in Ollama by `ollama create example -f Modelfile` and use `ollama run` to run the model directly on console.
 
 ```eval_rst