updated llama.cpp and ollama quickstart (#11732)
* updated llama.cpp and ollama quickstart.md * added qwen2-1.5B sample output * revision on quickstart updates * revision on quickstart updates * revision on qwen2 readme * added 2 troubleshoots“ ” * troubleshoot revision
This commit is contained in:
		
							parent
							
								
									54cc9353db
								
							
						
					
					
						commit
						d0c89fb715
					
				
					 3 changed files with 52 additions and 4 deletions
				
			
		| 
						 | 
					@ -40,7 +40,7 @@ Visit the [Install IPEX-LLM on Linux with Intel GPU](./install_linux_gpu.md), fo
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### Windows (Optional)
 | 
					#### Windows (Optional)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Please make sure your GPU driver version is equal or newer than `31.0.101.5333`. If it is not, follow the instructions in [this section](./install_windows_gpu.md#optional-update-gpu-driver) to update your GPU driver; otherwise, you might encounter gibberish output. 
 | 
					Please make sure your GPU driver version is equal or newer than `31.0.101.5522`. If it is not, follow the instructions in [this section](./install_windows_gpu.md#optional-update-gpu-driver) to update your GPU driver; otherwise, you might encounter gibberish output. 
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 1. Install IPEX-LLM for llama.cpp
 | 
					### 1. Install IPEX-LLM for llama.cpp
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -146,7 +146,7 @@ Before running, you should download or copy community GGUF model to your current
 | 
				
			||||||
- For **Linux users**:
 | 
					- For **Linux users**:
 | 
				
			||||||
  
 | 
					  
 | 
				
			||||||
  ```bash
 | 
					  ```bash
 | 
				
			||||||
  ./main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color
 | 
					  ./main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 99 --color
 | 
				
			||||||
  ```
 | 
					  ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  > **Note**:
 | 
					  > **Note**:
 | 
				
			||||||
| 
						 | 
					@ -158,7 +158,7 @@ Before running, you should download or copy community GGUF model to your current
 | 
				
			||||||
  Please run the following command in Miniforge Prompt.
 | 
					  Please run the following command in Miniforge Prompt.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  ```cmd
 | 
					  ```cmd
 | 
				
			||||||
  main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color
 | 
					  main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 99 --color
 | 
				
			||||||
  ```
 | 
					  ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
  > **Note**:
 | 
					  > **Note**:
 | 
				
			||||||
| 
						 | 
					@ -316,3 +316,13 @@ Also, you can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select device
 | 
				
			||||||
If you run the llama.cpp program on Windows and find that your program crashes or outputs abnormally when accepting Chinese prompts, you can open `Region->Administrative->Change System locale..`, check `Beta: Use Unicode UTF-8 for worldwide language support` option and then restart your computer.
 | 
					If you run the llama.cpp program on Windows and find that your program crashes or outputs abnormally when accepting Chinese prompts, you can open `Region->Administrative->Change System locale..`, check `Beta: Use Unicode UTF-8 for worldwide language support` option and then restart your computer.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For detailed instructions on how to do this, see [this issue](https://github.com/intel-analytics/ipex-llm/issues/10989#issuecomment-2105600469).
 | 
					For detailed instructions on how to do this, see [this issue](https://github.com/intel-analytics/ipex-llm/issues/10989#issuecomment-2105600469).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### sycl7.dll not found error
 | 
				
			||||||
 | 
					If you meet `System Error: sycl7.dll not found` on Windows or you meet similar error on Linux, please check:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. if you have installed conda and if you are in the right conda environment which has pip installed oneapi dependencies on Windows
 | 
				
			||||||
 | 
					2. if you have executed `source /opt/intel/oneapi/setvars.sh` on Linux
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### Check driver first when you meet garbage output
 | 
				
			||||||
 | 
					If you meet garbage output, please check if your GPU driver version is >= [31.0.101.5522](https://www.intel.cn/content/www/cn/zh/download/785597/823163/intel-arc-iris-xe-graphics-windows.html). If not, please follow the instructions in [this section](./install_linux_gpu.md#install-gpu-driver) to update your GPU driver.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -183,3 +183,24 @@ An example process of interacting with model with `ollama run example` looks lik
 | 
				
			||||||
<a href="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" target="_blank">
 | 
					<a href="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" target="_blank">
 | 
				
			||||||
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" width=100%; />
 | 
					  <img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" width=100%; />
 | 
				
			||||||
</a>
 | 
					</a>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Troubleshooting
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### Why model is always loaded again after several minutes
 | 
				
			||||||
 | 
					Ollama will unload model from gpu memory in every 5 minutes as default. For latest version of ollama, you could set `OLLAMA_KEEP_ALIVE=-1` to keep the model loaded in memory. Reference issue: https://github.com/intel-analytics/ipex-llm/issues/11608
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### `exit status 0xc0000135` error when executing  `ollama serve`
 | 
				
			||||||
 | 
					When executing `ollama serve`, if you meet `llama runner process has terminated: exit status 0xc0000135` on Windows or you meet `ollama_llama_server: error while loading shared libraries: libmkl_core.so.2: cannot open shared object file` on Linux, this is most likely caused by the lack of sycl dependency. Please check:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. if you have installed conda and if you are in the right conda environment which has pip installed oneapi dependencies on Windows
 | 
				
			||||||
 | 
					2. if you have executed `source /opt/intel/oneapi/setvars.sh` on Linux
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### Program hang during initial model loading stage
 | 
				
			||||||
 | 
					When launching `ollama serve` for the first time on Windows, it may get stuck during the model loading phase. If you notice that the program is hanging for a long time during the first run, you can manually input a space or other characters on the server side to ensure the program is running.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### How to distinguish the community version of Ollama from the ipex-llm version of Ollama
 | 
				
			||||||
 | 
					In the server log of community version of Ollama, you may see `source=payload_common.go:139 msg="Dynamic LLM libraries [rocm_v60000 cpu_avx2 cuda_v11 cpu cpu_avx]"`.
 | 
				
			||||||
 | 
					But in the server log of ipex-llm version of Ollama, you should only see `source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### Ollama hang when multiple different questions is asked or context is long
 | 
				
			||||||
 | 
					If you find ollama hang when multiple different questions is asked or context is long, and you see `update_slots : failed to free spaces in the KV cache` in the server log, this could be because that sometimes the LLM context is larger than the default `n_ctx` value, you may increase the `n_ctx` and try it again.
 | 
				
			||||||
| 
						 | 
					@ -1,5 +1,5 @@
 | 
				
			||||||
# Qwen2
 | 
					# Qwen2
 | 
				
			||||||
In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) as a reference InternLM model.
 | 
					In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) and [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) as reference Qwen2 models.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## 0. Requirements
 | 
					## 0. Requirements
 | 
				
			||||||
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 | 
					To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 | 
				
			||||||
| 
						 | 
					@ -131,4 +131,21 @@ Inference time: xxxx s
 | 
				
			||||||
What is AI?
 | 
					What is AI?
 | 
				
			||||||
-------------------- Output --------------------
 | 
					-------------------- Output --------------------
 | 
				
			||||||
AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans and mimic their actions. The term may
 | 
					AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans and mimic their actions. The term may
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					##### [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct)
 | 
				
			||||||
 | 
					```log
 | 
				
			||||||
 | 
					Inference time: 0.33887791633605957 s
 | 
				
			||||||
 | 
					-------------------- Prompt --------------------
 | 
				
			||||||
 | 
					AI是什么?
 | 
				
			||||||
 | 
					-------------------- Output --------------------
 | 
				
			||||||
 | 
					AI是人工智能的简称,是一种计算机科学和技术领域,旨在使机器能够完成通常需要人类智能的任务。这包括识别和理解语言、图像处理
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```log
 | 
				
			||||||
 | 
					Inference time: 0.340407133102417 s
 | 
				
			||||||
 | 
					-------------------- Prompt --------------------
 | 
				
			||||||
 | 
					What is AI?
 | 
				
			||||||
 | 
					-------------------- Output --------------------
 | 
				
			||||||
 | 
					Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and work like humans. It involves creating computer programs, algorithms
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
		Loading…
	
		Reference in a new issue