update doc/setup to use onednn gemm for cpp (#11598)
* update doc/setup to use onednn gemm * small fix * Change TOC of graphrag quickstart back
This commit is contained in:
		
							parent
							
								
									f4077fa905
								
							
						
					
					
						commit
						4da93709b1
					
				
					 6 changed files with 10 additions and 41 deletions
				
			
		| 
						 | 
					@ -16,13 +16,6 @@ The [GraphRAG project](https://github.com/microsoft/graphrag) is designed to lev
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Follow the steps in [Run Ollama with IPEX-LLM on Intel GPU Guide](./ollama_quickstart.md) to install and run Ollama on Intel GPU. Ensure that `ollama serve` is running correctly and can be accessed through a local URL (e.g., `https://127.0.0.1:11434`).
 | 
					Follow the steps in [Run Ollama with IPEX-LLM on Intel GPU Guide](./ollama_quickstart.md) to install and run Ollama on Intel GPU. Ensure that `ollama serve` is running correctly and can be accessed through a local URL (e.g., `https://127.0.0.1:11434`).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!TIP]
 | 
					 | 
				
			||||||
> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
 | 
					 | 
				
			||||||
>
 | 
					 | 
				
			||||||
> ```bash
 | 
					 | 
				
			||||||
> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					 | 
				
			||||||
> ```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### 2. Prepare LLM and Embedding Model
 | 
					### 2. Prepare LLM and Embedding Model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
In another terminal window, separate from where you executed `ollama serve`, download the LLM and embedding model using the following commands:
 | 
					In another terminal window, separate from where you executed `ollama serve`, download the LLM and embedding model using the following commands:
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -51,6 +51,7 @@ To use GPU acceleration, several environment variables are required or recommend
 | 
				
			||||||
  ```bash
 | 
					  ```bash
 | 
				
			||||||
  source /opt/intel/oneapi/setvars.sh
 | 
					  source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
  export SYCL_CACHE_PERSISTENT=1
 | 
					  export SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					  export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
  ```
 | 
					  ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
- For **Windows users**:
 | 
					- For **Windows users**:
 | 
				
			||||||
| 
						 | 
					@ -59,14 +60,9 @@ To use GPU acceleration, several environment variables are required or recommend
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  ```cmd
 | 
					  ```cmd
 | 
				
			||||||
  set SYCL_CACHE_PERSISTENT=1
 | 
					  set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					  set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
  ```
 | 
					  ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!TIP]
 | 
					 | 
				
			||||||
> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance:
 | 
					 | 
				
			||||||
>
 | 
					 | 
				
			||||||
> ```bash
 | 
					 | 
				
			||||||
> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					 | 
				
			||||||
> ```
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
##### Run llama3
 | 
					##### Run llama3
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -131,6 +127,7 @@ Launch the Ollama service:
 | 
				
			||||||
  export OLLAMA_NUM_GPU=999
 | 
					  export OLLAMA_NUM_GPU=999
 | 
				
			||||||
  source /opt/intel/oneapi/setvars.sh
 | 
					  source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
  export SYCL_CACHE_PERSISTENT=1
 | 
					  export SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					  export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  ./ollama serve
 | 
					  ./ollama serve
 | 
				
			||||||
  ```
 | 
					  ```
 | 
				
			||||||
| 
						 | 
					@ -144,16 +141,11 @@ Launch the Ollama service:
 | 
				
			||||||
  set ZES_ENABLE_SYSMAN=1
 | 
					  set ZES_ENABLE_SYSMAN=1
 | 
				
			||||||
  set OLLAMA_NUM_GPU=999
 | 
					  set OLLAMA_NUM_GPU=999
 | 
				
			||||||
  set SYCL_CACHE_PERSISTENT=1
 | 
					  set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					  set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  ollama serve
 | 
					  ollama serve
 | 
				
			||||||
  ```
 | 
					  ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!TIP]
 | 
					 | 
				
			||||||
> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
 | 
					 | 
				
			||||||
>
 | 
					 | 
				
			||||||
> ```bash
 | 
					 | 
				
			||||||
> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					 | 
				
			||||||
> ```
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!NOTE]
 | 
					> [!NOTE]
 | 
				
			||||||
>
 | 
					>
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -117,6 +117,7 @@ To use GPU acceleration, several environment variables are required or recommend
 | 
				
			||||||
  ```bash
 | 
					  ```bash
 | 
				
			||||||
  source /opt/intel/oneapi/setvars.sh
 | 
					  source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
  export SYCL_CACHE_PERSISTENT=1
 | 
					  export SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					  export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
  ```
 | 
					  ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
- For **Windows users**:
 | 
					- For **Windows users**:
 | 
				
			||||||
| 
						 | 
					@ -125,15 +126,9 @@ To use GPU acceleration, several environment variables are required or recommend
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  ```cmd
 | 
					  ```cmd
 | 
				
			||||||
  set SYCL_CACHE_PERSISTENT=1
 | 
					  set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					  set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
  ```
 | 
					  ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!TIP]
 | 
					 | 
				
			||||||
> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance:
 | 
					 | 
				
			||||||
>
 | 
					 | 
				
			||||||
> ```bash
 | 
					 | 
				
			||||||
> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					 | 
				
			||||||
> ```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### 3. Example: Running community GGUF models with IPEX-LLM
 | 
					### 3. Example: Running community GGUF models with IPEX-LLM
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Here we provide a simple example to show how to run a community GGUF model with IPEX-LLM.
 | 
					Here we provide a simple example to show how to run a community GGUF model with IPEX-LLM.
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -72,6 +72,7 @@ You may launch the Ollama service as below:
 | 
				
			||||||
  export ZES_ENABLE_SYSMAN=1
 | 
					  export ZES_ENABLE_SYSMAN=1
 | 
				
			||||||
  source /opt/intel/oneapi/setvars.sh
 | 
					  source /opt/intel/oneapi/setvars.sh
 | 
				
			||||||
  export SYCL_CACHE_PERSISTENT=1
 | 
					  export SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					  export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  ./ollama serve
 | 
					  ./ollama serve
 | 
				
			||||||
  ```
 | 
					  ```
 | 
				
			||||||
| 
						 | 
					@ -85,6 +86,7 @@ You may launch the Ollama service as below:
 | 
				
			||||||
  set no_proxy=localhost,127.0.0.1
 | 
					  set no_proxy=localhost,127.0.0.1
 | 
				
			||||||
  set ZES_ENABLE_SYSMAN=1
 | 
					  set ZES_ENABLE_SYSMAN=1
 | 
				
			||||||
  set SYCL_CACHE_PERSISTENT=1
 | 
					  set SYCL_CACHE_PERSISTENT=1
 | 
				
			||||||
 | 
					  set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  ollama serve
 | 
					  ollama serve
 | 
				
			||||||
  ```
 | 
					  ```
 | 
				
			||||||
| 
						 | 
					@ -92,13 +94,6 @@ You may launch the Ollama service as below:
 | 
				
			||||||
> [!NOTE]
 | 
					> [!NOTE]
 | 
				
			||||||
> Please set environment variable `OLLAMA_NUM_GPU` to `999` to make sure all layers of your model are running on Intel GPU, otherwise, some layers may run on CPU.
 | 
					> Please set environment variable `OLLAMA_NUM_GPU` to `999` to make sure all layers of your model are running on Intel GPU, otherwise, some layers may run on CPU.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!TIP]
 | 
					 | 
				
			||||||
> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
 | 
					 | 
				
			||||||
>
 | 
					 | 
				
			||||||
> ```bash
 | 
					 | 
				
			||||||
> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					 | 
				
			||||||
> ```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
> [!NOTE]
 | 
					> [!NOTE]
 | 
				
			||||||
> To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`.
 | 
					> To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -38,13 +38,6 @@ Follow the steps in [Run Ollama with IPEX-LLM on Intel GPU Guide](./ollama_quick
 | 
				
			||||||
> [!IMPORTANT]
 | 
					> [!IMPORTANT]
 | 
				
			||||||
> If the `RAGFlow` is not deployed on the same machine where Ollama is running (which means `RAGFlow` needs to connect to a remote Ollama service), you must configure the Ollama service to accept connections from any IP address. To achieve this, set or export the environment variable `OLLAMA_HOST=0.0.0.0` before executing the command `ollama serve`.
 | 
					> If the `RAGFlow` is not deployed on the same machine where Ollama is running (which means `RAGFlow` needs to connect to a remote Ollama service), you must configure the Ollama service to accept connections from any IP address. To achieve this, set or export the environment variable `OLLAMA_HOST=0.0.0.0` before executing the command `ollama serve`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> [!TIP]
 | 
					 | 
				
			||||||
> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
 | 
					 | 
				
			||||||
>
 | 
					 | 
				
			||||||
> ```bash
 | 
					 | 
				
			||||||
> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 | 
					 | 
				
			||||||
> ```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### 2. Pull Model
 | 
					### 2. Pull Model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Now we need to pull a model for RAG using Ollama. Here we use [Qwen/Qwen2-7B](https://huggingface.co/Qwen/Qwen2-7B) model as an example. Open a new terminal window, run the following command to pull [`qwen2:latest`](https://ollama.com/library/qwen2). 
 | 
					Now we need to pull a model for RAG using Ollama. Here we use [Qwen/Qwen2-7B](https://huggingface.co/Qwen/Qwen2-7B) model as an example. Open a new terminal window, run the following command to pull [`qwen2:latest`](https://ollama.com/library/qwen2). 
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -293,7 +293,8 @@ def setup_package():
 | 
				
			||||||
    xpu_requires = copy.deepcopy(xpu_21_requires)
 | 
					    xpu_requires = copy.deepcopy(xpu_21_requires)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    cpp_requires = ["bigdl-core-cpp==" + CORE_XE_VERSION]
 | 
					    cpp_requires = ["bigdl-core-cpp==" + CORE_XE_VERSION,
 | 
				
			||||||
 | 
					                    "onednn-devel==2024.0.0;platform_system=='Windows'"]
 | 
				
			||||||
    cpp_requires += oneapi_2024_0_requires
 | 
					    cpp_requires += oneapi_2024_0_requires
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    serving_requires = ['py-cpuinfo']
 | 
					    serving_requires = ['py-cpuinfo']
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
		Loading…
	
		Reference in a new issue