From 4da93709b12215008eeed56ab37f430024f7e06e Mon Sep 17 00:00:00 2001
From: Ruonan Wang <ruonan1.wang@intel.com>
Date: Thu, 18 Jul 2024 08:04:38 +0300
Subject: [PATCH] update doc/setup to use onednn gemm for cpp (#11598)

* update doc/setup to use onednn gemm

* small fix

* Change TOC of graphrag quickstart back
---
 docs/mddocs/Quickstart/graphrag_quickstart.md    |  7 -------
 .../llama3_llamacpp_ollama_quickstart.md         | 16 ++++------------
 docs/mddocs/Quickstart/llama_cpp_quickstart.md   |  9 ++-------
 docs/mddocs/Quickstart/ollama_quickstart.md      |  9 ++-------
 docs/mddocs/Quickstart/ragflow_quickstart.md     |  7 -------
 python/llm/setup.py                              |  3 ++-
 6 files changed, 10 insertions(+), 41 deletions(-)

diff --git a/docs/mddocs/Quickstart/graphrag_quickstart.md b/docs/mddocs/Quickstart/graphrag_quickstart.md
index c5517847..3903b090 100644
--- a/docs/mddocs/Quickstart/graphrag_quickstart.md
+++ b/docs/mddocs/Quickstart/graphrag_quickstart.md
@@ -16,13 +16,6 @@ The [GraphRAG project](https://github.com/microsoft/graphrag) is designed to lev
 
 Follow the steps in [Run Ollama with IPEX-LLM on Intel GPU Guide](./ollama_quickstart.md) to install and run Ollama on Intel GPU. Ensure that `ollama serve` is running correctly and can be accessed through a local URL (e.g., `https://127.0.0.1:11434`).
 
-> [!TIP]
-> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
->
-> ```bash
-> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
-> ```
-
 ### 2. Prepare LLM and Embedding Model
 
 In another terminal window, separate from where you executed `ollama serve`, download the LLM and embedding model using the following commands:
diff --git a/docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md b/docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md
index 5f2dabe7..f42ffe46 100644
--- a/docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md
+++ b/docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md
@@ -51,6 +51,7 @@ To use GPU acceleration, several environment variables are required or recommend
   ```bash
   source /opt/intel/oneapi/setvars.sh
   export SYCL_CACHE_PERSISTENT=1
+  export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
   ```
 
 - For **Windows users**:
@@ -59,14 +60,9 @@ To use GPU acceleration, several environment variables are required or recommend
 
   ```cmd
   set SYCL_CACHE_PERSISTENT=1
+  set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
   ```
 
-> [!TIP]
-> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance:
->
-> ```bash
-> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
-> ```
 
 ##### Run llama3
 
@@ -131,6 +127,7 @@ Launch the Ollama service:
   export OLLAMA_NUM_GPU=999
   source /opt/intel/oneapi/setvars.sh
   export SYCL_CACHE_PERSISTENT=1
+  export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 
   ./ollama serve
   ```
@@ -144,16 +141,11 @@ Launch the Ollama service:
   set ZES_ENABLE_SYSMAN=1
   set OLLAMA_NUM_GPU=999
   set SYCL_CACHE_PERSISTENT=1
+  set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 
   ollama serve
   ```
 
-> [!TIP]
-> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
->
-> ```bash
-> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
-> ```
 
 > [!NOTE]
 >
diff --git a/docs/mddocs/Quickstart/llama_cpp_quickstart.md b/docs/mddocs/Quickstart/llama_cpp_quickstart.md
index 1faa93e0..afac3bf1 100644
--- a/docs/mddocs/Quickstart/llama_cpp_quickstart.md
+++ b/docs/mddocs/Quickstart/llama_cpp_quickstart.md
@@ -117,6 +117,7 @@ To use GPU acceleration, several environment variables are required or recommend
   ```bash
   source /opt/intel/oneapi/setvars.sh
   export SYCL_CACHE_PERSISTENT=1
+  export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
   ```
 
 - For **Windows users**:
@@ -125,15 +126,9 @@ To use GPU acceleration, several environment variables are required or recommend
 
   ```cmd
   set SYCL_CACHE_PERSISTENT=1
+  set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
   ```
 
-> [!TIP]
-> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance:
->
-> ```bash
-> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
-> ```
-
 ### 3. Example: Running community GGUF models with IPEX-LLM
 
 Here we provide a simple example to show how to run a community GGUF model with IPEX-LLM.
diff --git a/docs/mddocs/Quickstart/ollama_quickstart.md b/docs/mddocs/Quickstart/ollama_quickstart.md
index 4846f82c..8f940c43 100644
--- a/docs/mddocs/Quickstart/ollama_quickstart.md
+++ b/docs/mddocs/Quickstart/ollama_quickstart.md
@@ -72,6 +72,7 @@ You may launch the Ollama service as below:
   export ZES_ENABLE_SYSMAN=1
   source /opt/intel/oneapi/setvars.sh
   export SYCL_CACHE_PERSISTENT=1
+  export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 
   ./ollama serve
   ```
@@ -85,6 +86,7 @@ You may launch the Ollama service as below:
   set no_proxy=localhost,127.0.0.1
   set ZES_ENABLE_SYSMAN=1
   set SYCL_CACHE_PERSISTENT=1
+  set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 
   ollama serve
   ```
@@ -92,13 +94,6 @@ You may launch the Ollama service as below:
 > [!NOTE]
 > Please set environment variable `OLLAMA_NUM_GPU` to `999` to make sure all layers of your model are running on Intel GPU, otherwise, some layers may run on CPU.
 
-> [!TIP]
-> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
->
-> ```bash
-> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
-> ```
-
 > [!NOTE]
 > To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`.
 
diff --git a/docs/mddocs/Quickstart/ragflow_quickstart.md b/docs/mddocs/Quickstart/ragflow_quickstart.md
index 22251831..a41cb663 100644
--- a/docs/mddocs/Quickstart/ragflow_quickstart.md
+++ b/docs/mddocs/Quickstart/ragflow_quickstart.md
@@ -38,13 +38,6 @@ Follow the steps in [Run Ollama with IPEX-LLM on Intel GPU Guide](./ollama_quick
 > [!IMPORTANT]
 > If the `RAGFlow` is not deployed on the same machine where Ollama is running (which means `RAGFlow` needs to connect to a remote Ollama service), you must configure the Ollama service to accept connections from any IP address. To achieve this, set or export the environment variable `OLLAMA_HOST=0.0.0.0` before executing the command `ollama serve`.
 
-> [!TIP]
-> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
->
-> ```bash
-> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
-> ```
-
 ### 2. Pull Model
 
 Now we need to pull a model for RAG using Ollama. Here we use [Qwen/Qwen2-7B](https://huggingface.co/Qwen/Qwen2-7B) model as an example. Open a new terminal window, run the following command to pull [`qwen2:latest`](https://ollama.com/library/qwen2). 
diff --git a/python/llm/setup.py b/python/llm/setup.py
index 082492cc..ecb7aea8 100644
--- a/python/llm/setup.py
+++ b/python/llm/setup.py
@@ -293,7 +293,8 @@ def setup_package():
     xpu_requires = copy.deepcopy(xpu_21_requires)
 
 
-    cpp_requires = ["bigdl-core-cpp==" + CORE_XE_VERSION]
+    cpp_requires = ["bigdl-core-cpp==" + CORE_XE_VERSION,
+                    "onednn-devel==2024.0.0;platform_system=='Windows'"]
     cpp_requires += oneapi_2024_0_requires
 
     serving_requires = ['py-cpuinfo']