updated llama.cpp and ollama quickstart (#11732)

* updated llama.cpp and ollama quickstart.md

* added qwen2-1.5B sample output

* revision on quickstart updates

* revision on quickstart updates

* revision on qwen2 readme

* added 2 troubleshoots“
”

* troubleshoot revision
This commit is contained in:
Jinhe 2024-08-08 11:04:01 +08:00 committed by GitHub
parent 54cc9353db
commit d0c89fb715
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 52 additions and 4 deletions

View file

@ -40,7 +40,7 @@ Visit the [Install IPEX-LLM on Linux with Intel GPU](./install_linux_gpu.md), fo
#### Windows (Optional)
Please make sure your GPU driver version is equal or newer than `31.0.101.5333`. If it is not, follow the instructions in [this section](./install_windows_gpu.md#optional-update-gpu-driver) to update your GPU driver; otherwise, you might encounter gibberish output.
Please make sure your GPU driver version is equal or newer than `31.0.101.5522`. If it is not, follow the instructions in [this section](./install_windows_gpu.md#optional-update-gpu-driver) to update your GPU driver; otherwise, you might encounter gibberish output.
### 1. Install IPEX-LLM for llama.cpp
@ -146,7 +146,7 @@ Before running, you should download or copy community GGUF model to your current
- For **Linux users**:
```bash
./main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color
./main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 99 --color
```
> **Note**:
@ -158,7 +158,7 @@ Before running, you should download or copy community GGUF model to your current
Please run the following command in Miniforge Prompt.
```cmd
main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color
main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 99 --color
```
> **Note**:
@ -316,3 +316,13 @@ Also, you can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select device
If you run the llama.cpp program on Windows and find that your program crashes or outputs abnormally when accepting Chinese prompts, you can open `Region->Administrative->Change System locale..`, check `Beta: Use Unicode UTF-8 for worldwide language support` option and then restart your computer.
For detailed instructions on how to do this, see [this issue](https://github.com/intel-analytics/ipex-llm/issues/10989#issuecomment-2105600469).
#### sycl7.dll not found error
If you meet `System Error: sycl7.dll not found` on Windows or you meet similar error on Linux, please check:
1. if you have installed conda and if you are in the right conda environment which has pip installed oneapi dependencies on Windows
2. if you have executed `source /opt/intel/oneapi/setvars.sh` on Linux
#### Check driver first when you meet garbage output
If you meet garbage output, please check if your GPU driver version is >= [31.0.101.5522](https://www.intel.cn/content/www/cn/zh/download/785597/823163/intel-arc-iris-xe-graphics-windows.html). If not, please follow the instructions in [this section](./install_linux_gpu.md#install-gpu-driver) to update your GPU driver.

View file

@ -183,3 +183,24 @@ An example process of interacting with model with `ollama run example` looks lik
<a href="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" width=100%; />
</a>
### Troubleshooting
#### Why model is always loaded again after several minutes
Ollama will unload model from gpu memory in every 5 minutes as default. For latest version of ollama, you could set `OLLAMA_KEEP_ALIVE=-1` to keep the model loaded in memory. Reference issue: https://github.com/intel-analytics/ipex-llm/issues/11608
#### `exit status 0xc0000135` error when executing `ollama serve`
When executing `ollama serve`, if you meet `llama runner process has terminated: exit status 0xc0000135` on Windows or you meet `ollama_llama_server: error while loading shared libraries: libmkl_core.so.2: cannot open shared object file` on Linux, this is most likely caused by the lack of sycl dependency. Please check:
1. if you have installed conda and if you are in the right conda environment which has pip installed oneapi dependencies on Windows
2. if you have executed `source /opt/intel/oneapi/setvars.sh` on Linux
#### Program hang during initial model loading stage
When launching `ollama serve` for the first time on Windows, it may get stuck during the model loading phase. If you notice that the program is hanging for a long time during the first run, you can manually input a space or other characters on the server side to ensure the program is running.
#### How to distinguish the community version of Ollama from the ipex-llm version of Ollama
In the server log of community version of Ollama, you may see `source=payload_common.go:139 msg="Dynamic LLM libraries [rocm_v60000 cpu_avx2 cuda_v11 cpu cpu_avx]"`.
But in the server log of ipex-llm version of Ollama, you should only see `source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"`.
#### Ollama hang when multiple different questions is asked or context is long
If you find ollama hang when multiple different questions is asked or context is long, and you see `update_slots : failed to free spaces in the KV cache` in the server log, this could be because that sometimes the LLM context is larger than the default `n_ctx` value, you may increase the `n_ctx` and try it again.

View file

@ -1,5 +1,5 @@
# Qwen2
In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) as a reference InternLM model.
In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) and [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) as reference Qwen2 models.
## 0. Requirements
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
@ -131,4 +131,21 @@ Inference time: xxxx s
What is AI?
-------------------- Output --------------------
AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans and mimic their actions. The term may
```
##### [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct)
```log
Inference time: 0.33887791633605957 s
-------------------- Prompt --------------------
AI是什么
-------------------- Output --------------------
AI是人工智能的简称是一种计算机科学和技术领域旨在使机器能够完成通常需要人类智能的任务。这包括识别和理解语言、图像处理
```
```log
Inference time: 0.340407133102417 s
-------------------- Prompt --------------------
What is AI?
-------------------- Output --------------------
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and work like humans. It involves creating computer programs, algorithms
```