updated llama.cpp and ollama quickstart (#11732)

* updated llama.cpp and ollama quickstart.md * added qwen2-1.5B sample output * revision on quickstart updates * revision on quickstart updates * revision on qwen2 readme * added 2 troubleshoots“ ” * troubleshoot revision
2024-08-08 11:04:01 +08:00 · 2024-08-08 11:04:01 +08:00 · d0c89fb715
commit d0c89fb715
parent 54cc9353db
3 changed files with 52 additions and 4 deletions
--- a/docs/mddocs/Quickstart/llama_cpp_quickstart.md
+++ b/docs/mddocs/Quickstart/llama_cpp_quickstart.md
@ -40,7 +40,7 @@ Visit the [Install IPEX-LLM on Linux with Intel GPU](./install_linux_gpu.md), fo

 #### Windows (Optional)

-Please make sure your GPU driver version is equal or newer than `31.0.101.5333`. If it is not, follow the instructions in [this section](./install_windows_gpu.md#optional-update-gpu-driver) to update your GPU driver; otherwise, you might encounter gibberish output. 
+Please make sure your GPU driver version is equal or newer than `31.0.101.5522`. If it is not, follow the instructions in [this section](./install_windows_gpu.md#optional-update-gpu-driver) to update your GPU driver; otherwise, you might encounter gibberish output. 

 ### 1. Install IPEX-LLM for llama.cpp

@ -146,7 +146,7 @@ Before running, you should download or copy community GGUF model to your current
 - For **Linux users**:
  
  ```bash
-  ./main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color
+  ./main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 99 --color
  ```

  > **Note**:
@ -158,7 +158,7 @@ Before running, you should download or copy community GGUF model to your current
  Please run the following command in Miniforge Prompt.

  ```cmd
-  main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color
+  main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 99 --color
  ```

  > **Note**:
@ -316,3 +316,13 @@ Also, you can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select device
 If you run the llama.cpp program on Windows and find that your program crashes or outputs abnormally when accepting Chinese prompts, you can open `Region->Administrative->Change System locale..`, check `Beta: Use Unicode UTF-8 for worldwide language support` option and then restart your computer.

 For detailed instructions on how to do this, see [this issue](https://github.com/intel-analytics/ipex-llm/issues/10989#issuecomment-2105600469).
+
+#### sycl7.dll not found error
+If you meet `System Error: sycl7.dll not found` on Windows or you meet similar error on Linux, please check:
+
+1. if you have installed conda and if you are in the right conda environment which has pip installed oneapi dependencies on Windows
+2. if you have executed `source /opt/intel/oneapi/setvars.sh` on Linux
+
+#### Check driver first when you meet garbage output
+If you meet garbage output, please check if your GPU driver version is >= [31.0.101.5522](https://www.intel.cn/content/www/cn/zh/download/785597/823163/intel-arc-iris-xe-graphics-windows.html). If not, please follow the instructions in [this section](./install_linux_gpu.md#install-gpu-driver) to update your GPU driver.
+
--- a/docs/mddocs/Quickstart/ollama_quickstart.md
+++ b/docs/mddocs/Quickstart/ollama_quickstart.md
@ -183,3 +183,24 @@ An example process of interacting with model with `ollama run example` looks lik
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" width=100%; />
 </a>
+
+### Troubleshooting
+
+#### Why model is always loaded again after several minutes
+Ollama will unload model from gpu memory in every 5 minutes as default. For latest version of ollama, you could set `OLLAMA_KEEP_ALIVE=-1` to keep the model loaded in memory. Reference issue: https://github.com/intel-analytics/ipex-llm/issues/11608
+
+#### `exit status 0xc0000135` error when executing  `ollama serve`
+When executing `ollama serve`, if you meet `llama runner process has terminated: exit status 0xc0000135` on Windows or you meet `ollama_llama_server: error while loading shared libraries: libmkl_core.so.2: cannot open shared object file` on Linux, this is most likely caused by the lack of sycl dependency. Please check:
+
+1. if you have installed conda and if you are in the right conda environment which has pip installed oneapi dependencies on Windows
+2. if you have executed `source /opt/intel/oneapi/setvars.sh` on Linux
+
+#### Program hang during initial model loading stage
+When launching `ollama serve` for the first time on Windows, it may get stuck during the model loading phase. If you notice that the program is hanging for a long time during the first run, you can manually input a space or other characters on the server side to ensure the program is running.
+
+#### How to distinguish the community version of Ollama from the ipex-llm version of Ollama
+In the server log of community version of Ollama, you may see `source=payload_common.go:139 msg="Dynamic LLM libraries [rocm_v60000 cpu_avx2 cuda_v11 cpu cpu_avx]"`.
+But in the server log of ipex-llm version of Ollama, you should only see `source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"`.
+
+#### Ollama hang when multiple different questions is asked or context is long
+If you find ollama hang when multiple different questions is asked or context is long, and you see `update_slots : failed to free spaces in the KV cache` in the server log, this could be because that sometimes the LLM context is larger than the default `n_ctx` value, you may increase the `n_ctx` and try it again.
--- a/python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md
+++ b/python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md
@ -1,5 +1,5 @@
 # Qwen2
-In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) as a reference InternLM model.
+In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) and [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) as reference Qwen2 models.

 ## 0. Requirements
 To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
@ -131,4 +131,21 @@ Inference time: xxxx s
 What is AI?
 -------------------- Output --------------------
 AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans and mimic their actions. The term may
+```
+
+##### [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct)
+```log
+Inference time: 0.33887791633605957 s
+-------------------- Prompt --------------------
+AI是什么？
+-------------------- Output --------------------
+AI是人工智能的简称，是一种计算机科学和技术领域，旨在使机器能够完成通常需要人类智能的任务。这包括识别和理解语言、图像处理
+```
+
+```log
+Inference time: 0.340407133102417 s
+-------------------- Prompt --------------------
+What is AI?
+-------------------- Output --------------------
+Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and work like humans. It involves creating computer programs, algorithms
 ```