diff --git a/docs/mddocs/Quickstart/llama_cpp_quickstart.md b/docs/mddocs/Quickstart/llama_cpp_quickstart.md index 2bd16289..9b27c481 100644 --- a/docs/mddocs/Quickstart/llama_cpp_quickstart.md +++ b/docs/mddocs/Quickstart/llama_cpp_quickstart.md @@ -40,7 +40,7 @@ Visit the [Install IPEX-LLM on Linux with Intel GPU](./install_linux_gpu.md), fo #### Windows (Optional) -Please make sure your GPU driver version is equal or newer than `31.0.101.5333`. If it is not, follow the instructions in [this section](./install_windows_gpu.md#optional-update-gpu-driver) to update your GPU driver; otherwise, you might encounter gibberish output. +Please make sure your GPU driver version is equal or newer than `31.0.101.5522`. If it is not, follow the instructions in [this section](./install_windows_gpu.md#optional-update-gpu-driver) to update your GPU driver; otherwise, you might encounter gibberish output. ### 1. Install IPEX-LLM for llama.cpp @@ -146,7 +146,7 @@ Before running, you should download or copy community GGUF model to your current - For **Linux users**: ```bash - ./main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color + ./main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 99 --color ``` > **Note**: @@ -158,7 +158,7 @@ Before running, you should download or copy community GGUF model to your current Please run the following command in Miniforge Prompt. ```cmd - main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color + main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 99 --color ``` > **Note**: @@ -316,3 +316,13 @@ Also, you can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select device If you run the llama.cpp program on Windows and find that your program crashes or outputs abnormally when accepting Chinese prompts, you can open `Region->Administrative->Change System locale..`, check `Beta: Use Unicode UTF-8 for worldwide language support` option and then restart your computer. For detailed instructions on how to do this, see [this issue](https://github.com/intel-analytics/ipex-llm/issues/10989#issuecomment-2105600469). + +#### sycl7.dll not found error +If you meet `System Error: sycl7.dll not found` on Windows or you meet similar error on Linux, please check: + +1. if you have installed conda and if you are in the right conda environment which has pip installed oneapi dependencies on Windows +2. if you have executed `source /opt/intel/oneapi/setvars.sh` on Linux + +#### Check driver first when you meet garbage output +If you meet garbage output, please check if your GPU driver version is >= [31.0.101.5522](https://www.intel.cn/content/www/cn/zh/download/785597/823163/intel-arc-iris-xe-graphics-windows.html). If not, please follow the instructions in [this section](./install_linux_gpu.md#install-gpu-driver) to update your GPU driver. + diff --git a/docs/mddocs/Quickstart/ollama_quickstart.md b/docs/mddocs/Quickstart/ollama_quickstart.md index d4519be8..b44ae9e4 100644 --- a/docs/mddocs/Quickstart/ollama_quickstart.md +++ b/docs/mddocs/Quickstart/ollama_quickstart.md @@ -183,3 +183,24 @@ An example process of interacting with model with `ollama run example` looks lik + +### Troubleshooting + +#### Why model is always loaded again after several minutes +Ollama will unload model from gpu memory in every 5 minutes as default. For latest version of ollama, you could set `OLLAMA_KEEP_ALIVE=-1` to keep the model loaded in memory. Reference issue: https://github.com/intel-analytics/ipex-llm/issues/11608 + +#### `exit status 0xc0000135` error when executing `ollama serve` +When executing `ollama serve`, if you meet `llama runner process has terminated: exit status 0xc0000135` on Windows or you meet `ollama_llama_server: error while loading shared libraries: libmkl_core.so.2: cannot open shared object file` on Linux, this is most likely caused by the lack of sycl dependency. Please check: + +1. if you have installed conda and if you are in the right conda environment which has pip installed oneapi dependencies on Windows +2. if you have executed `source /opt/intel/oneapi/setvars.sh` on Linux + +#### Program hang during initial model loading stage +When launching `ollama serve` for the first time on Windows, it may get stuck during the model loading phase. If you notice that the program is hanging for a long time during the first run, you can manually input a space or other characters on the server side to ensure the program is running. + +#### How to distinguish the community version of Ollama from the ipex-llm version of Ollama +In the server log of community version of Ollama, you may see `source=payload_common.go:139 msg="Dynamic LLM libraries [rocm_v60000 cpu_avx2 cuda_v11 cpu cpu_avx]"`. +But in the server log of ipex-llm version of Ollama, you should only see `source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"`. + +#### Ollama hang when multiple different questions is asked or context is long +If you find ollama hang when multiple different questions is asked or context is long, and you see `update_slots : failed to free spaces in the KV cache` in the server log, this could be because that sometimes the LLM context is larger than the default `n_ctx` value, you may increase the `n_ctx` and try it again. \ No newline at end of file diff --git a/python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md b/python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md index dda7ba18..f0ae5e3e 100644 --- a/python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md +++ b/python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md @@ -1,5 +1,5 @@ # Qwen2 -In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) as a reference InternLM model. +In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) and [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) as reference Qwen2 models. ## 0. Requirements To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. @@ -131,4 +131,21 @@ Inference time: xxxx s What is AI? -------------------- Output -------------------- AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans and mimic their actions. The term may +``` + +##### [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) +```log +Inference time: 0.33887791633605957 s +-------------------- Prompt -------------------- +AI是什么? +-------------------- Output -------------------- +AI是人工智能的简称,是一种计算机科学和技术领域,旨在使机器能够完成通常需要人类智能的任务。这包括识别和理解语言、图像处理 +``` + +```log +Inference time: 0.340407133102417 s +-------------------- Prompt -------------------- +What is AI? +-------------------- Output -------------------- +Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and work like humans. It involves creating computer programs, algorithms ``` \ No newline at end of file