diff --git a/docs/mddocs/DockerGuides/docker_cpp_xpu_quickstart.md b/docs/mddocs/DockerGuides/docker_cpp_xpu_quickstart.md new file mode 100644 index 00000000..92156d25 --- /dev/null +++ b/docs/mddocs/DockerGuides/docker_cpp_xpu_quickstart.md @@ -0,0 +1,221 @@ +## Run llama.cpp/Ollama/Open-WebUI on an Intel GPU via Docker + +## Quick Start + +### Install Docker + +1. Linux Installation + + Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux. + +2. Windows Installation + + For Windows installation, refer to this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#install-docker-desktop-for-windows). + +#### Setting Docker on windows + +Need to enable `--net=host`,follow [this guide](https://docs.docker.com/network/drivers/host/#docker-desktop) so that you can easily access the service running on the docker. The [v6.1x kernel version wsl]( https://learn.microsoft.com/en-us/community/content/wsl-user-msft-kernel-v6#1---building-the-microsoft-linux-kernel-v61x) is recommended to use.Otherwise, you may encounter the blocking issue before loading the model to GPU. + +### Pull the latest image +```bash +# This image will be updated every day +docker pull intelanalytics/ipex-llm-inference-cpp-xpu:latest +``` + +### Start Docker Container + +```eval_rst +.. tabs:: + .. tab:: Linux + + To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Select the device you are running(device type:(Max, Flex, Arc, iGPU)). And change the `/path/to/models` to mount the models. `bench_model` is used to benchmark quickly. If want to benchmark, make sure it on the `/path/to/models` + + .. code-block:: bash + + #/bin/bash + export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest + export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container + sudo docker run -itd \ + --net=host \ + --device=/dev/dri \ + -v /path/to/models:/models \ + -e no_proxy=localhost,127.0.0.1 \ + --memory="32G" \ + --name=$CONTAINER_NAME \ + -e bench_model="mistral-7b-v0.1.Q4_0.gguf" \ + -e DEVICE=Arc \ + --shm-size="16g" \ + $DOCKER_IMAGE + + .. tab:: Windows + + To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. And change the `/path/to/models` to mount the models. Then add `--privileged` and map the `/usr/lib/wsl` to the docker. + + .. code-block:: bash + + #/bin/bash + export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest + export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container + sudo docker run -itd \ + --net=host \ + --device=/dev/dri \ + --privileged \ + -v /path/to/models:/models \ + -v /usr/lib/wsl:/usr/lib/wsl \ + -e no_proxy=localhost,127.0.0.1 \ + --memory="32G" \ + --name=$CONTAINER_NAME \ + -e bench_model="mistral-7b-v0.1.Q4_0.gguf" \ + -e DEVICE=Arc \ + --shm-size="16g" \ + $DOCKER_IMAGE + +``` + + +After the container is booted, you could get into the container through `docker exec`. + +```bash +docker exec -it ipex-llm-inference-cpp-xpu-container /bin/bash +``` + +To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is: + +```bash +root@arda-arc12:/# sycl-ls +[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000] +[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000] +[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33] +[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241] +``` + + +### Quick benchmark for llama.cpp + +Notice that the performance on windows wsl docker is a little slower than on windows host, ant it's caused by the implementation of wsl kernel. + +```bash +bash /llm/scripts/benchmark_llama-cpp.sh +``` + +The benchmark will run three times to warm up to get the accurate results, and the example output is like: +```bash +llama_print_timings: load time = xxx ms +llama_print_timings: sample time = xxx ms / 128 runs ( xxx ms per token, xxx tokens per second) +llama_print_timings: prompt eval time = xxx ms / xxx tokens ( xxx ms per token, xxx tokens per second) +llama_print_timings: eval time = xxx ms / 127 runs ( xxx ms per token, xxx tokens per second) +llama_print_timings: total time = xxx ms / xxx tokens +``` + +### Running llama.cpp inference with IPEX-LLM on Intel GPU + +```bash +cd /llm/scripts/ +# set the recommended Env +source ipex-llm-init --gpu --device $DEVICE +# mount models and change the model_path in `start-llama-cpp.sh` +bash start-llama-cpp.sh +``` + +The example output is like: +```bash +llama_print_timings: load time = xxx ms +llama_print_timings: sample time = xxx ms / 32 runs ( xxx ms per token, xxx tokens per second) +llama_print_timings: prompt eval time = xxx ms / xxx tokens ( xxx ms per token, xxx tokens per second) +llama_print_timings: eval time = xxx ms / 31 runs ( xxx ms per token, xxx tokens per second) +llama_print_timings: total time = xxx ms / xxx tokens +``` + +Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html) for more details. + + +### Running Ollama serving with IPEX-LLM on Intel GPU + +Running the ollama on the background, you can see the ollama.log in `/root/ollama/ollama.log` +```bash +cd /llm/scripts/ +# set the recommended Env +source ipex-llm-init --gpu --device $DEVICE +bash start-ollama.sh # ctrl+c to exit, and the ollama serve will run on the background +``` + +Sample output: +```bash +time=2024-05-16T10:45:33.536+08:00 level=INFO source=images.go:697 msg="total blobs: 0" +time=2024-05-16T10:45:33.536+08:00 level=INFO source=images.go:704 msg="total unused blobs removed: 0" +time=2024-05-16T10:45:33.536+08:00 level=INFO source=routes.go:1044 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" +time=2024-05-16T10:45:33.537+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama751325299/runners +time=2024-05-16T10:45:33.565+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]" +time=2024-05-16T10:45:33.565+08:00 level=INFO source=gpu.go:122 msg="Detecting GPUs" +time=2024-05-16T10:45:33.566+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" +``` + +#### Run Ollama models (interactive) + +```bash +cd /llm/ollama +# create a file named Modelfile +FROM /models/mistral-7b-v0.1.Q4_0.gguf +TEMPLATE [INST] {{ .Prompt }} [/INST] +PARAMETER num_predict 64 + +# create example and run it on console +./ollama create example -f Modelfile +./ollama run example +``` + +An example process of interacting with model with `ollama run example` looks like the following: + + + + + + +#### Pull models from ollama to serve + +```bash +cd /llm/ollama +./ollama pull llama2 +``` + +Use the Curl to Test: +```bash +curl http://localhost:11434/api/generate -d ' +{ + "model": "llama2", + "prompt": "What is AI?", + "stream": false +}' +``` + +Sample output: +```bash +{"model":"llama2","created_at":"2024-05-16T02:52:18.972296097Z","response":"\nArtificial intelligence (AI) refers to the development of computer systems that can perform tasks that typically require human intelligence, such as learning, problem-solving, and decision-making. AI systems use algorithms and data to mimic human behavior and perform tasks such as:\n\n1. Image recognition: AI can identify objects in images and classify them into different categories.\n2. Natural Language Processing (NLP): AI can understand and generate human language, allowing it to interact with humans through voice assistants or chatbots.\n3. Predictive analytics: AI can analyze data to make predictions about future events, such as stock prices or weather patterns.\n4. Robotics: AI can control robots that perform tasks such as assembly, maintenance, and logistics.\n5. Recommendation systems: AI can suggest products or services based on a user's past behavior or preferences.\n6. Autonomous vehicles: AI can control self-driving cars that can navigate through roads and traffic without human intervention.\n7. Fraud detection: AI can identify and flag fraudulent transactions, such as credit card purchases or insurance claims.\n8. Personalized medicine: AI can analyze genetic data to provide personalized medical recommendations, such as drug dosages or treatment plans.\n9. Virtual assistants: AI can interact with users through voice or text interfaces, providing information or completing tasks.\n10. Sentiment analysis: AI can analyze text or speech to determine the sentiment or emotional tone of a message.\n\nThese are just a few examples of what AI can do. As the technology continues to evolve, we can expect to see even more innovative applications of AI in various industries and aspects of our lives.","done":true,"context":[xxx,xxx],"total_duration":12831317190,"load_duration":6453932096,"prompt_eval_count":25,"prompt_eval_duration":254970000,"eval_count":390,"eval_duration":6079077000} +``` + + +Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html#pull-model) for more details. + + +### Running Open WebUI with Intel GPU + +Start the ollama and load the model first, then use the open-webui to chat. +If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add export HF_ENDPOINT=https://hf-mirror.com before running bash start.sh. +```bash +cd /llm/scripts/ +bash start-open-webui.sh +``` + +Sample output: +```bash +INFO: Started server process [1055] +INFO: Waiting for application startup. +INFO: Application startup complete. +INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit) +``` + + + + + +For how to log-in or other guide, Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/open_webui_with_ollama_quickstart.html) for more details. diff --git a/docs/mddocs/DockerGuides/docker_pytorch_inference_gpu.md b/docs/mddocs/DockerGuides/docker_pytorch_inference_gpu.md new file mode 100644 index 00000000..76409384 --- /dev/null +++ b/docs/mddocs/DockerGuides/docker_pytorch_inference_gpu.md @@ -0,0 +1,171 @@ +# Python Inference using IPEX-LLM on Intel GPU + +We can run PyTorch Inference Benchmark, Chat Service and PyTorch Examples on Intel GPUs within Docker (on Linux or WSL). + +```eval_rst +.. note:: + + The current Windows + WSL + Docker solution only supports Arc series dGPU. For Windows users with MTL iGPU, it is recommended to install directly via pip install in Miniforge Prompt. Refer to `this guide `_. + +``` + +## Install Docker + +Follow the [Docker installation Guide](./docker_windows_gpu.html#install-docker) to install docker on either Linux or Windows. + +## Launch Docker + +Prepare ipex-llm-xpu Docker Image: +```bash +docker pull intelanalytics/ipex-llm-xpu:latest +``` + +Start ipex-llm-xpu Docker Container: + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest + export CONTAINER_NAME=my_container + export MODEL_PATH=/llm/models[change to your model path] + + docker run -itd \ + --net=host \ + --device=/dev/dri \ + --memory="32G" \ + --name=$CONTAINER_NAME \ + --shm-size="16g" \ + -v $MODEL_PATH:/llm/models \ + $DOCKER_IMAGE + + .. tab:: Windows WSL + + .. code-block:: bash + + #/bin/bash + export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest + export CONTAINER_NAME=my_container + export MODEL_PATH=/llm/models[change to your model path] + + sudo docker run -itd \ + --net=host \ + --privileged \ + --device /dev/dri \ + --memory="32G" \ + --name=$CONTAINER_NAME \ + --shm-size="16g" \ + -v $MODEL_PATH:/llm/llm-models \ + -v /usr/lib/wsl:/usr/lib/wsl \ + $DOCKER_IMAGE +``` + + +Access the container: +``` +docker exec -it $CONTAINER_NAME bash +``` +To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is: + +```bash +root@arda-arc12:/# sycl-ls +[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000] +[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000] +[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33] +[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241] +``` + +```eval_rst +.. tip:: + + You can run the Env-Check script to verify your ipex-llm installation and runtime environment. + + .. code-block:: bash + + cd /ipex-llm/python/llm/scripts + bash env-check.sh + + +``` + +## Run Inference Benchmark + +Navigate to benchmark directory, and modify the `config.yaml` under the `all-in-one` folder for benchmark configurations. +```bash +cd /benchmark/all-in-one +vim config.yaml +``` + +In the `config.yaml`, change `repo_id` to the model you want to test and `local_model_hub` to point to your model hub path. + +```yaml +... +repo_id: + - 'meta-llama/Llama-2-7b-chat-hf' +local_model_hub: '/path/to/your/mode/folder' +... +``` + +After modifying `config.yaml`, run the following commands to run benchmarking: +```bash +source ipex-llm-init --gpu --device +python run.py +``` + + +**Result Interpretation** + +After the benchmarking is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns `1st token avg latency (ms)` and `2+ avg latency (ms/token)` for the benchmark results. You can also check whether the column `actual input/output tokens` is consistent with the column `input/output tokens` and whether the parameters you specified in `config.yaml` have been successfully applied in the benchmarking. + + +## Run Chat Service + +We provide `chat.py` for conversational AI. + +For example, if your model is Llama-2-7b-chat-hf and mounted on /llm/models, you can execute the following command to initiate a conversation: + ```bash + cd /llm + python chat.py --model-path /llm/models/Llama-2-7b-chat-hf + ``` + +Here is a demostration: + + + + +
+ +## Run PyTorch Examples + +We provide several PyTorch examples that you could apply IPEX-LLM INT4 optimizations on models on Intel GPUs + +For example, if your model is Llama-2-7b-chat-hf and mounted on /llm/models, you can navigate to /examples/llama2 directory, excute the following command to run example: + ```bash + cd /examples/ + python ./generate.py --repo-id-or-model-path /llm/models/Llama-2-7b-chat-hf --prompt PROMPT --n-predict N_PREDICT + ``` + + +Arguments info: +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`. +- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`. +- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. + +**Sample Output** +```log +Inference time: xxxx s +-------------------- Prompt -------------------- +[INST] <> + +<> + +What is AI? [/INST] +-------------------- Output -------------------- +[INST] <> + +<> + +What is AI? [/INST] Artificial intelligence (AI) is the broader field of research and development aimed at creating machines that can perform tasks that typically require human intelligence, +``` \ No newline at end of file diff --git a/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md b/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md new file mode 100644 index 00000000..9a07609d --- /dev/null +++ b/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md @@ -0,0 +1,139 @@ +# Run/Develop PyTorch in VSCode with Docker on Intel GPU + +An IPEX-LLM container is a pre-configured environment that includes all necessary dependencies for running LLMs on Intel GPUs. + +This guide provides steps to run/develop PyTorch examples in VSCode with Docker on Intel GPUs. + +```eval_rst +.. note:: + + This guide assumes you have already installed VSCode in your environment. + + To run/develop on Windows, install VSCode and then follow the steps below. + + To run/develop on Linux, you might open VSCode first and SSH to a remote Linux machine, then proceed with the following steps. + +``` + + +## Install Docker + +Follow the [Docker installation Guide](./docker_windows_gpu.html#install-docker) to install docker on either Linux or Windows. + +## Install Extensions for VSCcode + +#### Install Dev Containers Extension +For both Linux/Windows, you will need to Install Dev Containers extension. + +Open the Extensions view in VSCode (you can use the shortcut `Ctrl+Shift+X`), then search for and install the `Dev Containers` extension. + + + + + + +#### Install WSL Extension for Windows + +For Windows, you will need to install wsl extension to to the WSL environment. Open the Extensions view in VSCode (you can use the shortcut `Ctrl+Shift+X`), then search for and install the `WSL` extension. + +Press F1 to bring up the Command Palette and type in `WSL: Connect to WSL Using Distro...` and select it and then select a specific WSL distro `Ubuntu` + + + + + + + +## Launch Container + +Open the Terminal in VSCode (you can use the shortcut `` Ctrl+Shift+` ``), then pull ipex-llm-xpu Docker Image: + +```bash +docker pull intelanalytics/ipex-llm-xpu:latest +``` + +Start ipex-llm-xpu Docker Container: + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest + export CONTAINER_NAME=my_container + export MODEL_PATH=/llm/models[change to your model path] + + docker run -itd \ + --net=host \ + --device=/dev/dri \ + --memory="32G" \ + --name=$CONTAINER_NAME \ + --shm-size="16g" \ + -v $MODEL_PATH:/llm/models \ + $DOCKER_IMAGE + + .. tab:: Windows WSL + + .. code-block:: bash + + #/bin/bash + export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest + export CONTAINER_NAME=my_container + export MODEL_PATH=/llm/models[change to your model path] + + sudo docker run -itd \ + --net=host \ + --privileged \ + --device /dev/dri \ + --memory="32G" \ + --name=$CONTAINER_NAME \ + --shm-size="16g" \ + -v $MODEL_PATH:/llm/llm-models \ + -v /usr/lib/wsl:/usr/lib/wsl \ + $DOCKER_IMAGE +``` + + +## Run/Develop Pytorch Examples + +Press F1 to bring up the Command Palette and type in `Dev Containers: Attach to Running Container...` and select it and then select `my_container` + +Now you are in a running Docker Container, Open folder `/ipex-llm/python/llm/example/GPU/HF-Transformers-AutoModels/Model/`. + + + + + +In this folder, we provide several PyTorch examples that you could apply IPEX-LLM INT4 optimizations on models on Intel GPUs. + +For example, if your model is Llama-2-7b-chat-hf and mounted on /llm/models, you can navigate to llama2 directory, excute the following command to run example: + ```bash + cd + python ./generate.py --repo-id-or-model-path /llm/models/Llama-2-7b-chat-hf --prompt PROMPT --n-predict N_PREDICT + ``` + + +Arguments info: +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`. +- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`. +- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. + +**Sample Output** +```log +Inference time: xxxx s +-------------------- Prompt -------------------- +[INST] <> + +<> + +What is AI? [/INST] +-------------------- Output -------------------- +[INST] <> + +<> + +What is AI? [/INST] Artificial intelligence (AI) is the broader field of research and development aimed at creating machines that can perform tasks that typically require human intelligence, +``` + +You can develop your own PyTorch example based on these examples. diff --git a/docs/mddocs/DockerGuides/docker_windows_gpu.md b/docs/mddocs/DockerGuides/docker_windows_gpu.md new file mode 100644 index 00000000..ce536f9b --- /dev/null +++ b/docs/mddocs/DockerGuides/docker_windows_gpu.md @@ -0,0 +1,111 @@ +# Overview of IPEX-LLM Containers for Intel GPU + + +An IPEX-LLM container is a pre-configured environment that includes all necessary dependencies for running LLMs on Intel GPUs. + +This guide provides general instructions for setting up the IPEX-LLM Docker containers with Intel GPU. It begins with instructions and tips for Docker installation, and then introduce the available IPEX-LLM containers and their uses. + +## Install Docker + +### Linux + +Follow the instructions in the [Offcial Docker Guide](https://www.docker.com/get-started/) to install Docker on Linux. + + +### Windows + +```eval_rst +.. tip:: + + The installation requires at least 35GB of free disk space on C drive. + +``` +```eval_rst +.. note:: + + Detailed installation instructions for Windows, including steps for enabling WSL2, can be found on the [Docker Desktop for Windows installation page](https://docs.docker.com/desktop/install/windows-install/). + +``` + +#### Install Docker Desktop for Windows +Follow the instructions in [this guide](https://docs.docker.com/desktop/install/windows-install/) to install **Docker Desktop for Windows**. Restart you machine after the installation is complete. + +#### Install WSL2 + +Follow the instructions in [this guide](https://docs.microsoft.com/en-us/windows/wsl/install) to install **Windows Subsystem for Linux 2 (WSL2)**. + +```eval_rst +.. tip:: + + You may verify WSL2 installation by running the command `wsl --list` in PowerShell or Command Prompt. If WSL2 is installed, you will see a list of installed Linux distributions. +``` + +#### Enable Docker integration with WSL2 + +Open **Docker desktop**, and select `Settings`->`Resources`->`WSL integration`->turn on `Ubuntu` button->`Apply & restart`. + + + + +```eval_rst +.. tip:: + + If you encounter **Docker Engine stopped** when opening Docker Desktop, you can reopen it in administrator mode. +``` + + #### Verify Docker is enabled in WSL2 + + Execute the following commands in PowerShell or Command Prompt to verify that Docker is enabled in WSL2: + ```bash + wsl -d Ubuntu # Run Ubuntu WSL distribution + docker version # Check if Docker is enabled in WSL + ``` + +You can see the output similar to the following: + + + + + +```eval_rst +.. tip:: + + During the use of Docker in WSL, Docker Desktop needs to be kept open all the time. +``` + + +## IPEX-LLM Docker Containers + +We have several docker images available for running LLMs on Intel GPUs. The following table lists the available images and their uses: + +| Image Name | Description | Use Case | +|------------|-------------|----------| +| intelanalytics/ipex-llm-cpu:latest | CPU Inference |For development and running LLMs using llama.cpp, Ollama and Python| +| intelanalytics/ipex-llm-xpu:latest | GPU Inference |For development and running LLMs using llama.cpp, Ollama and Python| +| intelanalytics/ipex-llm-serving-cpu:latest | CPU Serving|For serving multiple users/requests through REST APIs using vLLM/FastChat| +| intelanalytics/ipex-llm-serving-xpu:latest | GPU Serving|For serving multiple users/requests through REST APIs using vLLM/FastChat| +| intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:latest | CPU Finetuning via Docker|For fine-tuning LLMs using QLora/Lora, etc. | +|intelanalytics/ipex-llm-finetune-qlora-cpu-k8s:latest|CPU Finetuning via Kubernetes|For fine-tuning LLMs using QLora/Lora, etc. | +| intelanalytics/ipex-llm-finetune-qlora-xpu:latest| GPU Finetuning|For fine-tuning LLMs using QLora/Lora, etc.| + +We have also provided several quickstarts for various usage scenarios: +- [Run and develop LLM applications in PyTorch](./docker_pytorch_inference_gpu.html) + +... to be added soon. + +## Troubleshooting + + +If your machine has both an integrated GPU (iGPU) and a dedicated GPU (dGPU) like ARC, you may encounter the following issue: + +```bash +Abort was called at 62 line in file: +./shared/source/os_interface/os_interface.h +LIBXSMM_VERSION: main_stable-1.17-3651 (25693763) +LIBXSMM_TARGET: adl [Intel(R) Core(TM) i7-14700K] +Registry and code: 13 MB +Command: python chat.py --model-path /llm/llm-models/chatglm2-6b/ +Uptime: 29.349235 s +Aborted +``` +To resolve this problem, you can disable the iGPU in Device Manager on Windows. For details, refer to [this guide](https://www.elevenforum.com/t/enable-or-disable-integrated-graphics-igpu-in-windows-11.18616/) diff --git a/docs/mddocs/DockerGuides/fastchat_docker_quickstart.md b/docs/mddocs/DockerGuides/fastchat_docker_quickstart.md new file mode 100644 index 00000000..786316fd --- /dev/null +++ b/docs/mddocs/DockerGuides/fastchat_docker_quickstart.md @@ -0,0 +1,117 @@ +# FastChat Serving with IPEX-LLM on Intel GPUs via docker + +This guide demonstrates how to run `FastChat` serving with `IPEX-LLM` on Intel GPUs via Docker. + +## Install docker + +Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux. + +## Pull the latest image + +```bash +# This image will be updated every day +docker pull intelanalytics/ipex-llm-serving-xpu:latest +``` + +## Start Docker Container + + To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Change the `/path/to/models` to mount the models. + +``` +#/bin/bash +export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest +export CONTAINER_NAME=ipex-llm-serving-xpu-container +sudo docker run -itd \ + --net=host \ + --device=/dev/dri \ + -v /path/to/models:/llm/models \ + -e no_proxy=localhost,127.0.0.1 \ + --memory="32G" \ + --name=$CONTAINER_NAME \ + --shm-size="16g" \ + $DOCKER_IMAGE +``` + +After the container is booted, you could get into the container through `docker exec`. + +```bash +docker exec -it ipex-llm-serving-xpu-container /bin/bash +``` + + +To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is: + +```bash +root@arda-arc12:/# sycl-ls +[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000] +[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000] +[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33] +[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241] +``` + + +## Running FastChat serving with IPEX-LLM on Intel GPU in Docker + +For convenience, we have provided a script named `/llm/start-fastchat-service.sh` for you to start the service. + +However, the script only provide instructions for the most common scenarios. If this script doesn't meet your needs, you can always find the complete guidance for FastChat at [Serving using IPEX-LLM and FastChat](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#start-the-service). + +Before starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#runtime-configurations) to setup our recommended runtime configurations. + +Now we can start the FastChat service, you can use our provided script `/llm/start-fastchat-service.sh` like the following way: + +```bash +# Only the MODEL_PATH needs to be set, other parameters have default values +export MODEL_PATH=YOUR_SELECTED_MODEL_PATH +export LOW_BIT_FORMAT=sym_int4 +export CONTROLLER_HOST=localhost +export CONTROLLER_PORT=21001 +export WORKER_HOST=localhost +export WORKER_PORT=21002 +export API_HOST=localhost +export API_PORT=8000 + +# Use the default model_worker +bash /llm/start-fastchat-service.sh -w model_worker +``` + +If everything goes smoothly, the result should be similar to the following figure: + + + + + +By default, we are using the `ipex_llm_worker` as the backend engine. You can also use `vLLM` as the backend engine. Try the following examples: + +```bash +# Only the MODEL_PATH needs to be set, other parameters have default values +export MODEL_PATH=YOUR_SELECTED_MODEL_PATH +export LOW_BIT_FORMAT=sym_int4 +export CONTROLLER_HOST=localhost +export CONTROLLER_PORT=21001 +export WORKER_HOST=localhost +export WORKER_PORT=21002 +export API_HOST=localhost +export API_PORT=8000 + +# Use the default model_worker +bash /llm/start-fastchat-service.sh -w vllm_worker +``` + +The `vllm_worker` may start slowly than normal `ipex_llm_worker`. The booted service should be similar to the following figure: + + + + + + +```eval_rst +.. note:: + To verify/use the service booted by the script, follow the instructions in `this guide `_. +``` + +After a request has been sent to the `openai_api_server`, the corresponding inference time latency can be found in the worker log as shown below: + + + + diff --git a/docs/mddocs/DockerGuides/index.rst b/docs/mddocs/DockerGuides/index.rst new file mode 100644 index 00000000..29781e52 --- /dev/null +++ b/docs/mddocs/DockerGuides/index.rst @@ -0,0 +1,15 @@ +IPEX-LLM Docker Container User Guides +===================================== + +In this section, you will find guides related to using IPEX-LLM with Docker, covering how to: + +* `Overview of IPEX-LLM Containers <./docker_windows_gpu.html>`_ + +* Inference in Python/C++ + * `GPU Inference in Python with IPEX-LLM <./docker_pytorch_inference_gpu.html>`_ + * `VSCode LLM Development with IPEX-LLM on Intel GPU <./docker_pytorch_inference_gpu.html>`_ + * `llama.cpp/Ollama/Open-WebUI with IPEX-LLM on Intel GPU <./docker_cpp_xpu_quickstart.html>`_ +* Serving + * `FastChat with IPEX-LLM on Intel GPU <./fastchat_docker_quickstart.html>`_ + * `vLLM with IPEX-LLM on Intel GPU <./vllm_docker_quickstart.html>`_ + * `vLLM with IPEX-LLM on Intel CPU <./vllm_cpu_docker_quickstart.html>`_ diff --git a/docs/mddocs/DockerGuides/vllm_cpu_docker_quickstart.md b/docs/mddocs/DockerGuides/vllm_cpu_docker_quickstart.md new file mode 100644 index 00000000..36b39ed5 --- /dev/null +++ b/docs/mddocs/DockerGuides/vllm_cpu_docker_quickstart.md @@ -0,0 +1,118 @@ +# vLLM Serving with IPEX-LLM on Intel CPU via Docker + +This guide demonstrates how to run `vLLM` serving with `ipex-llm` on Intel CPU via Docker. + +## Install docker + +Follow the instructions in this [guide](https://www.docker.com/get-started/) to install Docker on Linux. + +## Pull the latest image + +*Note: For running vLLM serving on Intel CPUs, you can currently use either the `intelanalytics/ipex-llm-serving-cpu:latest` or `intelanalytics/ipex-llm-serving-vllm-cpu:latest` Docker image.* + +```bash +# This image will be updated every day +docker pull intelanalytics/ipex-llm-serving-cpu:latest +``` + +## Start Docker Container + +To fully use your Intel CPU to run vLLM inference and serving, you should +``` +#/bin/bash +export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:latest +export CONTAINER_NAME=ipex-llm-serving-cpu-container +sudo docker run -itd \ + --net=host \ + --cpuset-cpus="0-47" \ + --cpuset-mems="0" \ + -v /path/to/models:/llm/models \ + -e no_proxy=localhost,127.0.0.1 \ + --memory="64G" \ + --name=$CONTAINER_NAME \ + --shm-size="16g" \ + $DOCKER_IMAGE +``` + +After the container is booted, you could get into the container through `docker exec`. + +```bash +docker exec -it ipex-llm-serving-cpu-container /bin/bash +``` + +## Running vLLM serving with IPEX-LLM on Intel CPU in Docker + +We have included multiple vLLM-related files in `/llm/`: +1. `vllm_offline_inference.py`: Used for vLLM offline inference example +2. `benchmark_vllm_throughput.py`: Used for benchmarking throughput +3. `payload-1024.lua`: Used for testing request per second using 1k-128 request +4. `start-vllm-service.sh`: Used for template for starting vLLM service + +Before performing benchmark or starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_cpu.html#environment-setup) to setup our recommended runtime configurations. + +### Service + +A script named `/llm/start-vllm-service.sh` have been included in the image for starting the service conveniently. + +Modify the `model` and `served_model_name` in the script so that it fits your requirement. The `served_model_name` indicates the model name used in the API. + +Then start the service using `bash /llm/start-vllm-service.sh`, the following message should be print if the service started successfully. + +If the service have booted successfully, you should see the output similar to the following figure: + + + + + + +#### Verify +After the service has been booted successfully, you can send a test request using `curl`. Here, `YOUR_MODEL` should be set equal to `served_model_name` in your booting script, e.g. `Qwen1.5`. + +```bash +curl http://localhost:8000/v1/completions \ +-H "Content-Type: application/json" \ +-d '{ + "model": "YOUR_MODEL", + "prompt": "San Francisco is a", + "max_tokens": 128, + "temperature": 0 +}' | jq '.choices[0].text' +``` + +Below shows an example output using `Qwen1.5-7B-Chat` with low-bit format `sym_int4`: + + + + + +#### Tuning + +You can tune the service using these four arguments: +- `--max-model-len` +- `--max-num-batched-token` +- `--max-num-seq` + +You can refer to this [doc](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#service) for a detailed explaination on these parameters. + +### Benchmark + +#### Online benchmark throurgh api_server + +We can benchmark the api_server to get an estimation about TPS (transactions per second). To do so, you need to start the service first according to the instructions mentioned above. + +Then in the container, do the following: +1. modify the `/llm/payload-1024.lua` so that the "model" attribute is correct. By default, we use a prompt that is roughly 1024 token long, you can change it if needed. +2. Start the benchmark using `wrk` using the script below: + +```bash +cd /llm +# warmup +wrk -t4 -c4 -d3m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h +# You can change -t and -c to control the concurrency. +# By default, we use 8 connections to benchmark the service. +wrk -t8 -c8 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h +``` + +#### Offline benchmark through benchmark_vllm_throughput.py + +Please refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#performing-benchmark) on how to use `benchmark_vllm_throughput.py` for benchmarking. diff --git a/docs/mddocs/DockerGuides/vllm_docker_quickstart.md b/docs/mddocs/DockerGuides/vllm_docker_quickstart.md new file mode 100644 index 00000000..eb7fff3e --- /dev/null +++ b/docs/mddocs/DockerGuides/vllm_docker_quickstart.md @@ -0,0 +1,146 @@ +# vLLM Serving with IPEX-LLM on Intel GPUs via Docker + +This guide demonstrates how to run `vLLM` serving with `IPEX-LLM` on Intel GPUs via Docker. + +## Install docker + +Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux. + +## Pull the latest image + +*Note: For running vLLM serving on Intel GPUs, you can currently use either the `intelanalytics/ipex-llm-serving-xpu:latest` or `intelanalytics/ipex-llm-serving-vllm-xpu:latest` Docker image.* +```bash +# This image will be updated every day +docker pull intelanalytics/ipex-llm-serving-xpu:latest +``` + +## Start Docker Container + + To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Change the `/path/to/models` to mount the models. + +``` +#/bin/bash +export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest +export CONTAINER_NAME=ipex-llm-serving-xpu-container +sudo docker run -itd \ + --net=host \ + --device=/dev/dri \ + -v /path/to/models:/llm/models \ + -e no_proxy=localhost,127.0.0.1 \ + --memory="32G" \ + --name=$CONTAINER_NAME \ + --shm-size="16g" \ + $DOCKER_IMAGE +``` + +After the container is booted, you could get into the container through `docker exec`. + +```bash +docker exec -it ipex-llm-serving-xpu-container /bin/bash +``` + + +To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is: + +```bash +root@arda-arc12:/# sycl-ls +[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000] +[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000] +[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33] +[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241] +``` + +## Running vLLM serving with IPEX-LLM on Intel GPU in Docker + +We have included multiple vLLM-related files in `/llm/`: +1. `vllm_offline_inference.py`: Used for vLLM offline inference example +2. `benchmark_vllm_throughput.py`: Used for benchmarking throughput +3. `payload-1024.lua`: Used for testing request per second using 1k-128 request +4. `start-vllm-service.sh`: Used for template for starting vLLM service + +Before performing benchmark or starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#runtime-configurations) to setup our recommended runtime configurations. + + +### Service + +#### Single card serving + +A script named `/llm/start-vllm-service.sh` have been included in the image for starting the service conveniently. + +Modify the `model` and `served_model_name` in the script so that it fits your requirement. The `served_model_name` indicates the model name used in the API. + +Then start the service using `bash /llm/start-vllm-service.sh`, the following message should be print if the service started successfully. + +If the service have booted successfully, you should see the output similar to the following figure: + + + + + + +#### Multi-card serving + +vLLM supports to utilize multiple cards through tensor parallel. + +You can refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#about-tensor-parallel) on how to utilize the `tensor-parallel` feature and start the service. + +#### Verify +After the service has been booted successfully, you can send a test request using `curl`. Here, `YOUR_MODEL` should be set equal to `served_model_name` in your booting script, e.g. `Qwen1.5`. + + +```bash +curl http://localhost:8000/v1/completions \ +-H "Content-Type: application/json" \ +-d '{ + "model": "YOUR_MODEL", + "prompt": "San Francisco is a", + "max_tokens": 128, + "temperature": 0 +}' | jq '.choices[0].text' +``` + +Below shows an example output using `Qwen1.5-7B-Chat` with low-bit format `sym_int4`: + + + + + +#### Tuning + +You can tune the service using these four arguments: +- `--gpu-memory-utilization` +- `--max-model-len` +- `--max-num-batched-token` +- `--max-num-seq` + +You can refer to this [doc](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#service) for a detailed explaination on these parameters. + +### Benchmark + +#### Online benchmark throurgh api_server + +We can benchmark the api_server to get an estimation about TPS (transactions per second). To do so, you need to start the service first according to the instructions mentioned above. + +Then in the container, do the following: +1. modify the `/llm/payload-1024.lua` so that the "model" attribute is correct. By default, we use a prompt that is roughly 1024 token long, you can change it if needed. +2. Start the benchmark using `wrk` using the script below: + +```bash +cd /llm +# warmup due to JIT compliation +wrk -t4 -c4 -d3m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h +# You can change -t and -c to control the concurrency. +# By default, we use 12 connections to benchmark the service. +wrk -t12 -c12 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h +``` + +The following figure shows performing benchmark on `Llama-2-7b-chat-hf` using the above script: + + + + + + +#### Offline benchmark through benchmark_vllm_throughput.py + +Please refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#performing-benchmark) on how to use `benchmark_vllm_throughput.py` for benchmarking. diff --git a/docs/mddocs/Inference/Self_Speculative_Decoding.md b/docs/mddocs/Inference/Self_Speculative_Decoding.md new file mode 100644 index 00000000..99179194 --- /dev/null +++ b/docs/mddocs/Inference/Self_Speculative_Decoding.md @@ -0,0 +1,23 @@ +# Self-Speculative Decoding + +### Speculative Decoding in Practice +In [speculative](https://arxiv.org/abs/2302.01318) [decoding](https://arxiv.org/abs/2211.17192), a small (draft) model quickly generates multiple draft tokens, which are then verified in parallel by the large (target) model. While speculative decoding can effectively speed up the target model, ***in practice it is difficult to maintain or even obtain a proper draft model***, especially when the target model is finetuned with customized data. + +### Self-Speculative Decoding +Built on top of the concept of “[self-speculative decoding](https://arxiv.org/abs/2309.08168)”, IPEX-LLM can now accelerate the original FP16 or BF16 model ***without the need of a separate draft model or model finetuning***; instead, it automatically converts the original model to INT4, and uses the INT4 model as the draft model behind the scene. In practice, this brings ***~30% speedup*** for FP16 and BF16 LLM inference latency on Intel GPU and CPU respectively. + +### Using IPEX-LLM Self-Speculative Decoding +Please refer to IPEX-LLM self-speculative decoding code snippets below, and the detailed [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Speculative-Decoding) and [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Speculative-Decoding) examples in the project repo. + +```python +model = AutoModelForCausalLM.from_pretrained(model_path, + optimize_model=True, + torch_dtype=torch.float16, #use bfloat16 on cpu + load_in_low_bit="fp16", #use bf16 on cpu + speculative=True, #set speculative to true + trust_remote_code=True, + use_cache=True) +output = model.generate(input_ids, + max_new_tokens=args.n_predict, + do_sample=False) +``` diff --git a/docs/mddocs/Overview/FAQ/faq.md b/docs/mddocs/Overview/FAQ/faq.md new file mode 100644 index 00000000..caf8bd51 --- /dev/null +++ b/docs/mddocs/Overview/FAQ/faq.md @@ -0,0 +1,79 @@ +# Frequently Asked Questions (FAQ) + +## General Info & Concepts + +### GGUF format usage with IPEX-LLM? + +IPEX-LLM supports running GGUF/AWQ/GPTQ models on both [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations) and [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations). +Please also refer to [here](https://github.com/intel-analytics/ipex-llm?tab=readme-ov-file#latest-update-) for our latest support. + +## How to Resolve Errors + +### Fail to install `ipex-llm` through `pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/` or `pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/` + +You could try to install IPEX-LLM dependencies for Intel XPU from source archives: +- For Windows system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#install-ipex-llm-from-wheel) for the steps. +- For Linux system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id3) for the steps. + +### PyTorch is not linked with support for xpu devices + +1. Before running on Intel GPUs, please make sure you've prepared environment follwing [installation instruction](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html). +2. If you are using an older version of `ipex-llm` (specifically, older than 2.5.0b20240104), you need to manually add `import intel_extension_for_pytorch as ipex` at the beginning of your code. +3. After optimizing the model with IPEX-LLM, you need to move model to GPU through `model = model.to('xpu')`. +4. If you have mutil GPUs, you could refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.html) for details about GPU selection. +5. If you do inference using the optimized model on Intel GPUs, you also need to set `to('xpu')` for input tensors. + +### Import `intel_extension_for_pytorch` error on Windows GPU + +Please refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#error-loading-intel-extension-for-pytorch) for detailed guide. We list the possible missing requirements in environment which could lead to this error. + +### XPU device count is zero + +It's recommended to reinstall driver: +- For Windows system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#prerequisites) for the steps. +- For Linux system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id1) for the steps. + +### Error such as `The size of tensor a (33) must match the size of tensor b (17) at non-singleton dimension 2` duing attention forward function + +If you are using IPEX-LLM PyTorch API, please try to set `optimize_llm=False` manually when call `optimize_model` function to work around. As for IPEX-LLM `transformers`-style API, you could try to set `optimize_model=False` manually when call `from_pretrained` function to work around. + +### ValueError: Unrecognized configuration class + +This error is not quite relevant to IPEX-LLM. It could be that you're using the incorrect AutoClass, or the transformers version is not updated, or transformers does not support using AutoClasses to load this model. You need to refer to the model card in huggingface to confirm these information. Besides, if you load the model from local path, please also make sure you download the complete model files. + +### `mixed dtype (CPU): expect input to have scalar type of BFloat16` during inference + +You could solve this error by converting the optimized model to `bf16` through `model.to(torch.bfloat16)` before inference. + +### Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES) + +This error is caused by out of GPU memory. Some possible solutions to decrease GPU memory uage: +1. If you run several models continuously, please make sure you have released GPU memory of previous model through `del model` timely. +2. You could try `model = model.float16()` or `model = model.bfloat16()` before moving model to GPU to use less GPU memory. +3. You could try set `cpu_embedding=True` when call `from_pretrained` of AutoClass or `optimize_model` function. + +### Failed to enable AMX + +You could use `export BIGDL_LLM_AMX_DISABLED=1` to disable AMX manually and solve this error. + +### oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized + +You may encounter this error during finetuning on multi GPUs. Please try `sudo apt install level-zero-dev` to fix it. + +### Random and unreadable output of Gemma-7b-it on Arc770 ubuntu 22.04 due to driver and OneAPI missmatching. + +If driver and OneAPI missmatching, it will lead to some error when IPEX-LLM uses XMX(short prompts) for speeding up. +The output of `What's AI?` may like below: +``` +wiedzy Artificial Intelligence meliti: Artificial Intelligence undenti beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng +``` +If you meet this error. Please check your driver version and OneAPI version. Commnad is `sudo apt list --installed | egrep "intel-basekit|intel-level-zero-gpu"`. +Make sure intel-basekit>=2024.0.1-43 and intel-level-zero-gpu>=1.3.27191.42-775~22.04. + +### Too many open files + +You may encounter this error during finetuning, expecially when run 70B model. Please raise the system open file limit using `ulimit -n 1048576`. + +### `RuntimeError: could not create a primitive` on Windows + +This error may happen when multiple GPUs exists for Windows Users. To solve this error, you can open Device Manager (search "Device Manager" in your start menu). Then click the "Display adapter" option, and disable all the GPU device you do not want to use. Restart your computer and try again. IPEX-LLM should work fine this time. \ No newline at end of file diff --git a/docs/mddocs/Overview/KeyFeatures/cli.md b/docs/mddocs/Overview/KeyFeatures/cli.md new file mode 100644 index 00000000..ab162594 --- /dev/null +++ b/docs/mddocs/Overview/KeyFeatures/cli.md @@ -0,0 +1,40 @@ +# CLI (Command Line Interface) Tool + +```eval_rst + +.. note:: + + Currently ``ipex-llm`` CLI supports *LLaMA* (e.g., vicuna), *GPT-NeoX* (e.g., redpajama), *BLOOM* (e.g., pheonix) and *GPT2* (e.g., starcoder) model architecture; for other models, you may use the ``transformers``-style or LangChain APIs. +``` + +## Convert Model + +You may convert the downloaded model into native INT4 format using `llm-convert`. + +```bash +# convert PyTorch (fp16 or fp32) model; +# llama/bloom/gptneox/starcoder model family is currently supported +llm-convert "/path/to/model/" --model-format pth --model-family "bloom" --outfile "/path/to/output/" + +# convert GPTQ-4bit model +# only llama model family is currently supported +llm-convert "/path/to/model/" --model-format gptq --model-family "llama" --outfile "/path/to/output/" +``` + +## Run Model + +You may run the converted model using `llm-cli` or `llm-chat` (built on top of `main.cpp` in [`llama.cpp`](https://github.com/ggerganov/llama.cpp)) + +```bash +# help +# llama/bloom/gptneox/starcoder model family is currently supported +llm-cli -x gptneox -h + +# text completion +# llama/bloom/gptneox/starcoder model family is currently supported +llm-cli -t 16 -x gptneox -m "/path/to/output/model.bin" -p 'Once upon a time,' + +# chat mode +# llama/gptneox model family is currently supported +llm-chat -m "/path/to/output/model.bin" -x llama +``` \ No newline at end of file diff --git a/docs/mddocs/Overview/KeyFeatures/finetune.md b/docs/mddocs/Overview/KeyFeatures/finetune.md new file mode 100644 index 00000000..b895b89f --- /dev/null +++ b/docs/mddocs/Overview/KeyFeatures/finetune.md @@ -0,0 +1,64 @@ +# Finetune (QLoRA) + +We also support finetuning LLMs (large language models) using QLoRA with IPEX-LLM 4bit optimizations on Intel GPUs. + +```eval_rst +.. note:: + + Currently, only Hugging Face Transformers models are supported running QLoRA finetuning. +``` + +To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example. + +**Make sure you have prepared environment following instructions [here](../install_gpu.html).** + +```eval_rst +.. note:: + + If you are using an older version of ``ipex-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code. +``` + +First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`. + +```python +from ipex_llm.transformers import AutoModelForCausalLM + +model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", + load_in_low_bit="nf4", + optimize_model=False, + torch_dtype=torch.float16, + modules_to_not_convert=["lm_head"],) +model = model.to('xpu') +``` + +Then, we have to apply some preprocessing to the model to prepare it for training. +```python +from ipex_llm.transformers.qlora import prepare_model_for_kbit_training +model.gradient_checkpointing_enable() +model = prepare_model_for_kbit_training(model) +``` + +Next, we can obtain a Peft model from the optimized model and a configuration object containing the parameters as follows: +```python +from ipex_llm.transformers.qlora import get_peft_model +from peft import LoraConfig +config = LoraConfig(r=8, + lora_alpha=32, + target_modules=["q_proj", "k_proj", "v_proj"], + lora_dropout=0.05, + bias="none", + task_type="CAUSAL_LM") +model = get_peft_model(model, config) +``` + +```eval_rst +.. important:: + + Instead of ``from peft import prepare_model_for_kbit_training, get_peft_model`` as we did for regular QLoRA using bitandbytes and cuda, we import them from ``ipex_llm.transformers.qlora`` here to get a IPEX-LLM compatible Peft model. And the rest is just the same as regular LoRA finetuning process using ``peft``. +``` + +```eval_rst +.. seealso:: + + See the complete examples `here `_ +``` diff --git a/docs/mddocs/Overview/KeyFeatures/gpu_supports.rst b/docs/mddocs/Overview/KeyFeatures/gpu_supports.rst new file mode 100644 index 00000000..6828cb05 --- /dev/null +++ b/docs/mddocs/Overview/KeyFeatures/gpu_supports.rst @@ -0,0 +1,14 @@ +GPU Supports +================================ + +IPEX-LLM not only supports running large language models for inference, but also supports QLoRA finetuning on Intel GPUs. + +* |inference_on_gpu|_ +* `Finetune (QLoRA) <./finetune.html>`_ +* `Multi GPUs selection <./multi_gpus_selection.html>`_ + +.. |inference_on_gpu| replace:: Inference on GPU +.. _inference_on_gpu: ./inference_on_gpu.html + +.. |multi_gpus_selection| replace:: Multi GPUs selection +.. _multi_gpus_selection: ./multi_gpus_selection.html diff --git a/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md b/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md new file mode 100644 index 00000000..0eee498f --- /dev/null +++ b/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md @@ -0,0 +1,54 @@ +# Hugging Face ``transformers`` Format + +## Load in Low Precision +You may apply INT4 optimizations to any Hugging Face *Transformers* models as follows: + +```python +# load Hugging Face Transformers model with INT4 optimizations +from ipex_llm.transformers import AutoModelForCausalLM + +model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True) +``` + +After loading the Hugging Face *Transformers* model, you may easily run the optimized model as follows: + +```python +# run the optimized model +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained(model_path) +input_ids = tokenizer.encode(input_str, ...) +output_ids = model.generate(input_ids, ...) +output = tokenizer.batch_decode(output_ids) +``` + +```eval_rst +.. seealso:: + + See the complete CPU examples `here `_ and GPU examples `here `_. + +.. note:: + + You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows: + + .. code-block:: python + + model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5") + + See the CPU example `here `_ and GPU example `here `_. +``` + +## Save & Load +After the model is optimized using INT4 (or INT8/INT5), you may save and load the optimized model as follows: + +```python +model.save_low_bit(model_path) + +new_model = AutoModelForCausalLM.load_low_bit(model_path) +``` + +```eval_rst +.. seealso:: + + See the CPU example `here `_ and GPU example `here `_ +``` \ No newline at end of file diff --git a/docs/mddocs/Overview/KeyFeatures/index.rst b/docs/mddocs/Overview/KeyFeatures/index.rst new file mode 100644 index 00000000..8611f9bd --- /dev/null +++ b/docs/mddocs/Overview/KeyFeatures/index.rst @@ -0,0 +1,33 @@ +IPEX-LLM Key Features +================================ + +You may run the LLMs using ``ipex-llm`` through one of the following APIs: + +* `PyTorch API <./optimize_model.html>`_ +* |transformers_style_api|_ + + * |hugging_face_transformers_format|_ + * `Native Format <./native_format.html>`_ + +* `LangChain API <./langchain_api.html>`_ +* |gpu_supports|_ + + * |inference_on_gpu|_ + * `Finetune (QLoRA) <./finetune.html>`_ + * `Multi GPUs selection <./multi_gpus_selection.html>`_ + + +.. |transformers_style_api| replace:: ``transformers``-style API +.. _transformers_style_api: ./transformers_style_api.html + +.. |hugging_face_transformers_format| replace:: Hugging Face ``transformers`` Format +.. _hugging_face_transformers_format: ./hugging_face_format.html + +.. |gpu_supports| replace:: GPU Supports +.. _gpu_supports: ./gpu_supports.html + +.. |inference_on_gpu| replace:: Inference on GPU +.. _inference_on_gpu: ./inference_on_gpu.html + +.. |multi_gpus_selection| replace:: Multi GPUs selection +.. _multi_gpus_selection: ./multi_gpus_selection.html diff --git a/docs/mddocs/Overview/KeyFeatures/inference_on_gpu.md b/docs/mddocs/Overview/KeyFeatures/inference_on_gpu.md new file mode 100644 index 00000000..1a9638e9 --- /dev/null +++ b/docs/mddocs/Overview/KeyFeatures/inference_on_gpu.md @@ -0,0 +1,128 @@ +# Inference on GPU + +Apart from the significant acceleration capabilites on Intel CPUs, IPEX-LLM also supports optimizations and acceleration for running LLMs (large language models) on Intel GPUs. With IPEX-LLM, PyTorch models (in FP16/BF16/FP32) can be optimized with low-bit quantizations (supported precisions include INT4, INT5, INT8, etc). + +Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example. + +**Make sure you have prepared environment following instructions [here](../install_gpu.html).** + +```eval_rst +.. note:: + + If you are using an older version of ``ipex-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code. +``` + +## Load and Optimize Model + +You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-style API](./transformers_style_api.html) on Intel GPUs according to your preference. + +**Once you have the model with IPEX-LLM low bit optimization, set it to `to('xpu')`**. + +```eval_rst +.. tabs:: + + .. tab:: PyTorch API + + You could optimize any PyTorch model with "one-line code change", and the loading and optimizing process on Intel GPUs maybe as follows: + + .. code-block:: python + + # Take Llama-2-7b-chat-hf as an example + from transformers import LlamaForCausalLM + from ipex_llm import optimize_model + + model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype='auto', low_cpu_mem_usage=True) + model = optimize_model(model) # With only one line to enable IPEX-LLM INT4 optimization + + model = model.to('xpu') # Important after obtaining the optimized model + + .. tip:: + + When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``optimize_model`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. + + See the `API doc <../../../PythonAPI/LLM/optimize.html#ipex_llm.optimize_model>`_ for ``optimize_model`` to find more information. + + Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows: + + .. code-block:: python + + from transformers import LlamaForCausalLM + from ipex_llm.optimize import low_memory_init, load_low_bit + + saved_dir='./llama-2-ipex-llm-4-bit' + with low_memory_init(): # Fast and low cost by loading model on meta device + model = LlamaForCausalLM.from_pretrained(saved_dir, + torch_dtype="auto", + trust_remote_code=True) + model = load_low_bit(model, saved_dir) # Load the optimized model + + model = model.to('xpu') # Important after obtaining the optimized model + + .. tab:: ``transformers``-style API + + You could run any Hugging Face Transformers model with ``transformers``-style API, and the loading and optimizing process on Intel GPUs maybe as follows: + + .. code-block:: python + + # Take Llama-2-7b-chat-hf as an example + from ipex_llm.transformers import AutoModelForCausalLM + + # Load model in 4 bit, which convert the relevant layers in the model into INT4 format + model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', load_in_4bit=True) + + model = model.to('xpu') # Important after obtaining the optimized model + + .. tip:: + + When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``from_pretrained`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. + + See the `API doc <../../../PythonAPI/LLM/transformers.html#hugging-face-transformers-automodel>`_ to find more information. + + Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows: + + .. code-block:: python + + from ipex_llm.transformers import AutoModelForCausalLM + + saved_dir='./llama-2-ipex-llm-4-bit' + model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model + + model = model.to('xpu') # Important after obtaining the optimized model + + .. tip:: + + When running saved optimized models on Intel iGPUs for Windows users, we also recommend setting ``cpu_embedding=True`` in the ``load_low_bit`` function. +``` + +## Run Optimized Model + +You could then do inference using the optimized model on Intel GPUs almostly the same as on CPUs. **The only difference is to set `to('xpu')` for input tensors.** + +Continuing with the [example of Llama-2-7b-chat-hf](#load-and-optimize-model), running as follows: +```python +import torch + +with torch.inference_mode(): + prompt = 'Q: What is CPU?\nA:' + input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') # With .to('xpu') specifically for inference on Intel GPUs + output = model.generate(input_ids, max_new_tokens=32) + output_str = tokenizer.decode(output[0], skip_special_tokens=True) +``` + +```eval_rst +.. note:: + + The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation. +``` + +```eval_rst +.. note:: + + If you are a Windows user, please also note that for **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +``` + +```eval_rst +.. seealso:: + + See the complete examples `here `_ +``` diff --git a/docs/mddocs/Overview/KeyFeatures/langchain_api.md b/docs/mddocs/Overview/KeyFeatures/langchain_api.md new file mode 100644 index 00000000..46a7adb3 --- /dev/null +++ b/docs/mddocs/Overview/KeyFeatures/langchain_api.md @@ -0,0 +1,57 @@ +# LangChain API + +You may run the models using the LangChain API in `ipex-llm`. + +## Using Hugging Face `transformers` INT4 Format + +You may run any Hugging Face *Transformers* model (with INT4 optimiztions applied) using the LangChain API as follows: + +```python +from ipex_llm.langchain.llms import TransformersLLM +from ipex_llm.langchain.embeddings import TransformersEmbeddings +from langchain.chains.question_answering import load_qa_chain + +embeddings = TransformersEmbeddings.from_model_id(model_id=model_path) +ipex_llm = TransformersLLM.from_model_id(model_id=model_path, ...) + +doc_chain = load_qa_chain(ipex_llm, ...) +output = doc_chain.run(...) +``` + +```eval_rst +.. seealso:: + + See the examples `here `_. +``` + +## Using Native INT4 Format + +You may also convert Hugging Face *Transformers* models into native INT4 format, and then run the converted models using the LangChain API as follows. + +```eval_rst +.. note:: + + * Currently only llama/bloom/gptneox/starcoder model families are supported; for other models, you may use the Hugging Face ``transformers`` INT4 format as described `above <./langchain_api.html#using-hugging-face-transformers-int4-format>`_. + + * You may choose the corresponding API developed for specific native models to load the converted model. +``` + +```python +from ipex_llm.langchain.llms import LlamaLLM +from ipex_llm.langchain.embeddings import LlamaEmbeddings +from langchain.chains.question_answering import load_qa_chain + +# switch to GptneoxEmbeddings/BloomEmbeddings/StarcoderEmbeddings to load other models +embeddings = LlamaEmbeddings(model_path='/path/to/converted/model.bin') +# switch to GptneoxLLM/BloomLLM/StarcoderLLM to load other models +ipex_llm = LlamaLLM(model_path='/path/to/converted/model.bin') + +doc_chain = load_qa_chain(ipex_llm, ...) +doc_chain.run(...) +``` + +```eval_rst +.. seealso:: + + See the examples `here `_. +``` diff --git a/docs/mddocs/Overview/KeyFeatures/multi_gpus_selection.md b/docs/mddocs/Overview/KeyFeatures/multi_gpus_selection.md new file mode 100644 index 00000000..1bacc1e8 --- /dev/null +++ b/docs/mddocs/Overview/KeyFeatures/multi_gpus_selection.md @@ -0,0 +1,86 @@ +# Multi Intel GPUs selection + +In [Inference on GPU](inference_on_gpu.md) and [Finetune (QLoRA)](finetune.md), you have known how to run inference and finetune on Intel GPUs. In this section, we will show you two approaches to select GPU devices. + +## List devices + +The `sycl-ls` tool enumerates a list of devices available in the system. You can use it after you setup oneapi environment: + +```eval_rst +.. tabs:: + .. tab:: Windows + + Please make sure you are using CMD (Miniforge Prompt if using conda): + + .. code-block:: cmd + + call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" + sycl-ls + + .. tab:: Linux + + .. code-block:: bash + + source /opt/intel/oneapi/setvars.sh + sycl-ls +``` + +If you have two Arc770 GPUs, you can get something like below: +``` +[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000] +[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) i9-14900K 3.0 [2023.16.7.0.21_160000] +[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33] +[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33] +[opencl:gpu:4] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 3.0 [23.17.26241.33] +[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241] +[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241] +[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.26241] +``` +This output shows there are two Arc A770 GPUs as well as an Intel iGPU on this machine. + +## Devices selection +To enable xpu, you should convert your model and input to xpu by below code: +``` +model = model.to('xpu') +input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') +``` +To select the desired devices, there are two ways: one is changing the code, another is adding an environment variable. See: + +### 1. Select device in python +To specify a xpu, you can change the `to('xpu')` to `to('xpu:[device_id]')`, this device_id is counted from zero. + +If you you want to use the second device, you can change the code like this: +``` +model = model.to('xpu:1') +input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu:1') +``` + +### 2. OneAPI device selector +Device selection environment variable, `ONEAPI_DEVICE_SELECTOR`, can be used to limit the choice of Intel GPU devices. As upon `sycl-ls` shows, the last three lines are three Level Zero GPU devices. So we can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select devices. +For example, you want to use the second A770 GPU, you can run the python like this: + +```eval_rst +.. tabs:: + .. tab:: Windows + + .. code-block:: cmd + + set ONEAPI_DEVICE_SELECTOR=level_zero:1 + python generate.py + + Through ``set ONEAPI_DEVICE_SELECTOR=level_zero:1``, only the second A770 GPU will be available for the current environment. + + .. tab:: Linux + + .. code-block:: bash + + ONEAPI_DEVICE_SELECTOR=level_zero:1 python generate.py + + ``ONEAPI_DEVICE_SELECTOR=level_zero:1`` in upon command only affect in current python program. Also, you can export the environment variable, then run your python: + + .. code-block:: bash + + export ONEAPI_DEVICE_SELECTOR=level_zero:1 + python generate.py + +``` diff --git a/docs/mddocs/Overview/KeyFeatures/native_format.md b/docs/mddocs/Overview/KeyFeatures/native_format.md new file mode 100644 index 00000000..6a0847c0 --- /dev/null +++ b/docs/mddocs/Overview/KeyFeatures/native_format.md @@ -0,0 +1,32 @@ +# Native Format + +You may also convert Hugging Face *Transformers* models into native INT4 format for maximum performance as follows. + +```eval_rst +.. note:: + + Currently only llama/bloom/gptneox/starcoder/chatglm model families are supported; you may use the corresponding API to load the converted model. (For other models, you can use the Hugging Face ``transformers`` format as described `here <./hugging_face_format.html>`_). +``` + +```python +# convert the model +from ipex_llm import llm_convert +ipex_llm_path = llm_convert(model='/path/to/model/', + outfile='/path/to/output/', outtype='int4', model_family="llama") + +# load the converted model +# switch to ChatGLMForCausalLM/GptneoxForCausalLM/BloomForCausalLM/StarcoderForCausalLM to load other models +from ipex_llm.transformers import LlamaForCausalLM +llm = LlamaForCausalLM.from_pretrained("/path/to/output/model.bin", native=True, ...) + +# run the converted model +input_ids = llm.tokenize(prompt) +output_ids = llm.generate(input_ids, ...) +output = llm.batch_decode(output_ids) +``` + +```eval_rst +.. seealso:: + + See the complete example `here `_ +``` \ No newline at end of file diff --git a/docs/mddocs/Overview/KeyFeatures/optimize_model.md b/docs/mddocs/Overview/KeyFeatures/optimize_model.md new file mode 100644 index 00000000..f6d3c02b --- /dev/null +++ b/docs/mddocs/Overview/KeyFeatures/optimize_model.md @@ -0,0 +1,69 @@ +## PyTorch API + +In general, you just need one-line `optimize_model` to easily optimize any loaded PyTorch model, regardless of the library or API you are using. With IPEX-LLM, PyTorch models (in FP16/BF16/FP32) can be optimized with low-bit quantizations (supported precisions include INT4, INT5, INT8, etc). + +### Optimize model + +First, use any PyTorch APIs you like to load your model. To help you better understand the process, here we use [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) library `LlamaForCausalLM` to load a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example: + +```python +# Create or load any Pytorch model, take Llama-2-7b-chat-hf as an example +from transformers import LlamaForCausalLM +model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype='auto', low_cpu_mem_usage=True) +``` + +Then, just need to call `optimize_model` to optimize the loaded model and INT4 optimization is applied on model by default: +```python +from ipex_llm import optimize_model + +# With only one line to enable IPEX-LLM INT4 optimization +model = optimize_model(model) +``` + +After optimizing the model, IPEX-LLM does not require any change in the inference code. You can use any libraries to run the optimized model with very low latency. + +### More Precisions + +In the [Optimize Model](#optimize-model), symmetric INT4 optimization is applied by default. You may apply other low bit optimizations (INT5, INT8, etc) by specifying the ``low_bit`` parameter. + +Currently, ``low_bit`` supports options 'sym_int4', 'asym_int4', 'sym_int5', 'asym_int5' or 'sym_int8', in which 'sym' and 'asym' differentiate between symmetric and asymmetric quantization. Symmetric quantization allocates bits for positive and negative values equally, whereas asymmetric quantization allows different bit allocations for positive and negative values. + +You may apply symmetric INT8 optimization as follows: + +```python +from ipex_llm import optimize_model + +# Apply symmetric INT8 optimization +model = optimize_model(model, low_bit="sym_int8") +``` + +### Save & Load Optimized Model + +The loading process of the original model may be time-consuming and memory-intensive. For example, the [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model is stored with float16 precision, resulting in large memory usage when loaded using `LlamaForCausalLM`. To avoid high resource consumption and expedite loading process, you can use `save_low_bit` to store the model after low-bit optimization. Then, in subsequent uses, you can opt to use the `load_low_bit` API to directly load the optimized model. Besides, saving and loading operations are platform-independent, regardless of their operating systems. +#### Save + +Continuing with the [example of Llama-2-7b-chat-hf](#optimize-model), we can save the previously optimized model as follows: +```python +saved_dir='./llama-2-ipex-llm-4-bit' +model.save_low_bit(saved_dir) +``` +#### Load + +We recommend to use the context manager `low_memory_init` to quickly initiate a model instance with low cost, and then use `load_low_bit` to load the optimized low-bit model as follows: +```python +from ipex_llm.optimize import low_memory_init, load_low_bit +with low_memory_init(): # Fast and low cost by loading model on meta device + model = LlamaForCausalLM.from_pretrained(saved_dir, + torch_dtype="auto", + trust_remote_code=True) +model = load_low_bit(model, saved_dir) # Load the optimized model +``` + + +```eval_rst +.. seealso:: + + * Please refer to the `API documentation `_ for more details. + + * We also provide detailed examples on how to run PyTorch models (e.g., Openai Whisper, LLaMA2, ChatGLM2, Falcon, MPT, Baichuan2, etc.) using IPEX-LLM. See the complete CPU examples `here `_ and GPU examples `here `_. +``` diff --git a/docs/mddocs/Overview/KeyFeatures/transformers_style_api.rst b/docs/mddocs/Overview/KeyFeatures/transformers_style_api.rst new file mode 100644 index 00000000..07fad70b --- /dev/null +++ b/docs/mddocs/Overview/KeyFeatures/transformers_style_api.rst @@ -0,0 +1,10 @@ +``transformers``-style API +================================ + +You may run the LLMs using ``transformers``-style API in ``ipex-llm``. + +* |hugging_face_transformers_format|_ +* `Native Format <./native_format.html>`_ + +.. |hugging_face_transformers_format| replace:: Hugging Face ``transformers`` Format +.. _hugging_face_transformers_format: ./hugging_face_format.html \ No newline at end of file diff --git a/docs/mddocs/Overview/examples.rst b/docs/mddocs/Overview/examples.rst new file mode 100644 index 00000000..89e9a8dd --- /dev/null +++ b/docs/mddocs/Overview/examples.rst @@ -0,0 +1,9 @@ +IPEX-LLM Examples +================================ + +You can use IPEX-LLM to run any PyTorch model with INT4 optimizations on Intel XPU (from Laptop to GPU to Cloud). + +Here, we provide examples to help you quickly get started using IPEX-LLM to run some popular open-source models in the community. Please refer to the appropriate guide based on your device: + +* `CPU <./examples_cpu.html>`_ +* `GPU <./examples_gpu.html>`_ diff --git a/docs/mddocs/Overview/examples_cpu.md b/docs/mddocs/Overview/examples_cpu.md new file mode 100644 index 00000000..f715e638 --- /dev/null +++ b/docs/mddocs/Overview/examples_cpu.md @@ -0,0 +1,64 @@ +# IPEX-LLM Examples: CPU + +Here, we provide some examples on how you could apply IPEX-LLM INT4 optimizations on popular open-source models in the community. + +To run these examples, please first refer to [here](./install_cpu.html) for more information about how to install ``ipex-llm``, requirements and best practices for setting up your environment. + +The following models have been verified on either servers or laptops with Intel CPUs. + +## Example of PyTorch API + +| Model | Example of PyTorch API | +|------------|-------------------------------------------------------| +| LLaMA 2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/llama2) | +| ChatGLM | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/chatglm) | +| Mistral | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/mistral) | +| Bark | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/bark) | +| BERT | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/bert) | +| Openai Whisper | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/openai-whisper) | + +```eval_rst +.. important:: + + In addition to INT4 optimization, IPEX-LLM also provides other low bit optimizations (such as INT8, INT5, NF4, etc.). You may apply other low bit optimizations through PyTorch API as `example `_. +``` + + +## Example of `transformers`-style API + +| Model | Example of `transformers`-style API | +|------------|-------------------------------------------------------| +| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* | [link1](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models), [link2](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/vicuna) | +| LLaMA 2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/llama2) | [link1](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models), [link2](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2) | +| ChatGLM | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/chatglm) | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm) | +| ChatGLM2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm2) | +| Mistral | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/mistral) | +| Falcon | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/falcon) | +| MPT | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/mpt) | +| Dolly-v1 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v1) | +| Dolly-v2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v2) | +| Replit Code| [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/replit) | +| RedPajama | [link1](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models), [link2](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/redpajama) | +| Phoenix | [link1](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models), [link2](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/phoenix) | +| StarCoder | [link1](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models), [link2](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/starcoder) | +| Baichuan | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan) | +| Baichuan2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan2) | +| InternLM | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm) | +| Qwen | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen) | +| Aquila | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/aquila) | +| MOSS | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/moss) | +| Whisper | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/whisper) | + +```eval_rst +.. important:: + + In addition to INT4 optimization, IPEX-LLM also provides other low bit optimizations (such as INT8, INT5, NF4, etc.). You may apply other low bit optimizations through ``transformers``-style API as `example `_. +``` + + +```eval_rst +.. seealso:: + + See the complete examples `here `_. +``` + diff --git a/docs/mddocs/Overview/examples_gpu.md b/docs/mddocs/Overview/examples_gpu.md new file mode 100644 index 00000000..8eea9f9f --- /dev/null +++ b/docs/mddocs/Overview/examples_gpu.md @@ -0,0 +1,70 @@ +# IPEX-LLM Examples: GPU + +Here, we provide some examples on how you could apply IPEX-LLM INT4 optimizations on popular open-source models in the community. + +To run these examples, please first refer to [here](./install_gpu.html) for more information about how to install ``ipex-llm``, requirements and best practices for setting up your environment. + +```eval_rst +.. important:: + + Only Linux system is supported now, Ubuntu 22.04 is prefered. +``` + +The following models have been verified on either servers or laptops with Intel GPUs. + +## Example of PyTorch API + +| Model | Example of PyTorch API | +|------------|-------------------------------------------------------| +| LLaMA 2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/llama2) | +| ChatGLM 2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/chatglm2) | +| Mistral | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/mistral) | +| Baichuan | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/baichuan) | +| Baichuan2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/baichuan2) | +| Replit | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/replit) | +| StarCoder | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/starcoder) | +| Dolly-v1 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/dolly-v1) | +| Dolly-v2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/dolly-v2) | + +```eval_rst +.. important:: + + In addition to INT4 optimization, IPEX-LLM also provides other low bit optimizations (such as INT8, INT5, NF4, etc.). You may apply other low bit optimizations through PyTorch API as `example `_. +``` + + +## Example of `transformers`-style API + +| Model | Example of `transformers`-style API | +|------------|-------------------------------------------------------| +| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* |[link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna)| +| LLaMA 2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2) | +| ChatGLM2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2) | +| Mistral | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral) | +| Falcon | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon) | +| MPT | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/mpt) | +| Dolly-v1 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v1) | +| Dolly-v2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v2) | +| Replit | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/replit) | +| StarCoder | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder) | +| Baichuan | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan) | +| Baichuan2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2) | +| InternLM | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm) | +| Qwen | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen) | +| Aquila | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila) | +| Whisper | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper) | +| Chinese Llama2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2) | +| GPT-J | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j) | + +```eval_rst +.. important:: + + In addition to INT4 optimization, IPEX-LLM also provides other low bit optimizations (such as INT8, INT5, NF4, etc.). You may apply other low bit optimizations through ``transformers``-style API as `example `_. +``` + + +```eval_rst +.. seealso:: + + See the complete examples `here `_. +``` diff --git a/docs/mddocs/Overview/install.rst b/docs/mddocs/Overview/install.rst new file mode 100644 index 00000000..ff2d94e1 --- /dev/null +++ b/docs/mddocs/Overview/install.rst @@ -0,0 +1,7 @@ +IPEX-LLM Installation +================================ + +Here, we provide instructions on how to install ``ipex-llm`` and best practices for setting up your environment. Please refer to the appropriate guide based on your device: + +* `CPU <./install_cpu.html>`_ +* `GPU <./install_gpu.html>`_ \ No newline at end of file diff --git a/docs/mddocs/Overview/install_cpu.md b/docs/mddocs/Overview/install_cpu.md new file mode 100644 index 00000000..990e3f09 --- /dev/null +++ b/docs/mddocs/Overview/install_cpu.md @@ -0,0 +1,100 @@ +# IPEX-LLM Installation: CPU + +## Quick Installation + +Install IPEX-LLM for CPU supports using pip through: + +```eval_rst +.. tabs:: + + .. tab:: Linux + + .. code-block:: bash + + pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu + + .. tab:: Windows + + .. code-block:: cmd + + pip install --pre --upgrade ipex-llm[all] +``` + +Please refer to [Environment Setup](#environment-setup) for more information. + +```eval_rst +.. note:: + + ``all`` option will trigger installation of all the dependencies for common LLM application development. + +.. important:: + + ``ipex-llm`` is tested with Python 3.9, 3.10 and 3.11; Python 3.11 is recommended for best practices. +``` + +## Recommended Requirements + +Here list the recommended hardware and OS for smooth IPEX-LLM optimization experiences on CPU: + +* Hardware + + * PCs equipped with 12th Gen Intel® Core™ processor or higher, and at least 16GB RAM + * Servers equipped with Intel® Xeon® processors, at least 32G RAM. + +* Operating System + + * Ubuntu 20.04 or later + * CentOS 7 or later + * Windows 10/11, with or without WSL + +## Environment Setup + +For optimal performance with LLM models using IPEX-LLM optimizations on Intel CPUs, here are some best practices for setting up environment: + +First we recommend using [Conda](https://conda-forge.org/download/) to create a python 3.11 enviroment: + +```eval_rst +.. tabs:: + + .. tab:: Linux + + .. code-block:: bash + + conda create -n llm python=3.11 + conda activate llm + + pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu + + .. tab:: Windows + + .. code-block:: cmd + + conda create -n llm python=3.11 + conda activate llm + + pip install --pre --upgrade ipex-llm[all] +``` + +Then for running a LLM model with IPEX-LLM optimizations (taking an `example.py` an example): + +```eval_rst +.. tabs:: + + .. tab:: Client + + It is recommended to run directly with full utilization of all CPU cores: + + .. code-block:: bash + + python example.py + + .. tab:: Server + + It is recommended to run with all the physical cores of a single socket: + + .. code-block:: bash + + # e.g. for a server with 48 cores per socket + export OMP_NUM_THREADS=48 + numactl -C 0-47 -m 0 python example.py +``` \ No newline at end of file diff --git a/docs/mddocs/Overview/install_gpu.md b/docs/mddocs/Overview/install_gpu.md new file mode 100644 index 00000000..52303ef5 --- /dev/null +++ b/docs/mddocs/Overview/install_gpu.md @@ -0,0 +1,666 @@ +# IPEX-LLM Installation: GPU + +## Windows + +### Prerequisites + +IPEX-LLM on Windows supports Intel iGPU and dGPU. + +```eval_rst +.. important:: + + IPEX-LLM on Windows only supports PyTorch 2.1. +``` + +To apply Intel GPU acceleration, please first verify your GPU driver version. + +```eval_rst +.. note:: + + The GPU driver version of your device can be checked in the "Task Manager" -> GPU 0 (or GPU 1, etc.) -> Driver version. +``` + +If you have driver version lower than `31.0.101.5122`, it is recommended to [**update your GPU driver to the latest**](https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html): + + + +### Install IPEX-LLM +#### Install IPEX-LLM From PyPI + +We recommend using [Miniforge](https://conda-forge.org/download/) to create a python 3.11 enviroment. + +```eval_rst +.. important:: + + ``ipex-llm`` is tested with Python 3.9, 3.10 and 3.11. Python 3.11 is recommended for best practices. +``` + +The easiest ways to install `ipex-llm` is the following commands, choosing either US or CN website for `extra-index-url`: + +```eval_rst +.. tabs:: + .. tab:: US + + .. code-block:: cmd + + conda create -n llm python=3.11 libuv + conda activate llm + + pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + + .. tab:: CN + + .. code-block:: cmd + + conda create -n llm python=3.11 libuv + conda activate llm + + pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ +``` + +#### Install IPEX-LLM From Wheel + +If you encounter network issues when installing IPEX, you can also install IPEX-LLM dependencies for Intel XPU from source archives. First you need to download and install torch/torchvision/ipex from wheels listed below before installing `ipex-llm`. + +Download the wheels on Windows system: + +``` +wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torch-2.1.0a0%2Bcxx11.abi-cp311-cp311-win_amd64.whl +wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torchvision-0.16.0a0%2Bcxx11.abi-cp311-cp311-win_amd64.whl +wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/intel_extension_for_pytorch-2.1.10%2Bxpu-cp311-cp311-win_amd64.whl +``` + +You may install dependencies directly from the wheel archives and then install `ipex-llm` using following commands: + +``` +pip install torch-2.1.0a0+cxx11.abi-cp311-cp311-win_amd64.whl +pip install torchvision-0.16.0a0+cxx11.abi-cp311-cp311-win_amd64.whl +pip install intel_extension_for_pytorch-2.1.10+xpu-cp311-cp311-win_amd64.whl + +pip install --pre --upgrade ipex-llm[xpu] +``` + +```eval_rst +.. note:: + + All the wheel packages mentioned here are for Python 3.11. If you would like to use Python 3.9 or 3.10, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp11`` with ``cp39`` or ``cp310``, respectively. +``` + +### Runtime Configuration + +To use GPU acceleration on Windows, several environment variables are required before running a GPU example: + + + +```eval_rst +.. tabs:: + .. tab:: Intel iGPU + + .. code-block:: cmd + + set SYCL_CACHE_PERSISTENT=1 + set BIGDL_LLM_XMX_DISABLED=1 + + .. tab:: Intel Arc™ A-Series Graphics + + .. code-block:: cmd + + set SYCL_CACHE_PERSISTENT=1 +``` + +```eval_rst +.. note:: + + For **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +``` + +### Troubleshooting + +#### 1. Error loading `intel_extension_for_pytorch` + +If you met error when importing `intel_extension_for_pytorch`, please ensure that you have completed the following steps: + +* Ensure that you have installed Visual Studio with "Desktop development with C++" workload. + +* Make sure that the correct version of oneAPI, specifically 2024.0, is installed. + +* Ensure that `libuv` is installed in your conda environment. This can be done during the creation of the environment with the command: + ```cmd + conda create -n llm python=3.11 libuv + ``` + If you missed `libuv`, you can add it to your existing environment through + ```cmd + conda install libuv + ``` + + + +## Linux + +### Prerequisites + +IPEX-LLM GPU support on Linux has been verified on: + +* Intel Arc™ A-Series Graphics +* Intel Data Center GPU Flex Series +* Intel Data Center GPU Max Series + +```eval_rst +.. important:: + + IPEX-LLM on Linux supports PyTorch 2.0 and PyTorch 2.1. + + .. warning:: + + IPEX-LLM support for Pytorch 2.0 is deprecated as of ``ipex-llm >= 2.1.0b20240511``. +``` + +```eval_rst +.. important:: + + We currently support the Ubuntu 20.04 operating system and later. +``` + +```eval_rst +.. tabs:: + .. tab:: PyTorch 2.1 + + To enable IPEX-LLM for Intel GPUs with PyTorch 2.1, here are several prerequisite steps for tools installation and environment preparation: + + + * Step 1: Install Intel GPU Driver version >= stable_775_20_20231219. We highly recommend installing the latest version of intel-i915-dkms using apt. + + .. seealso:: + + Please refer to our `driver installation `_ for general purpose GPU capabilities. + + See `release page `_ for latest version. + + .. note:: + + For Intel Core™ Ultra integrated GPU, please make sure level_zero version >= 1.3.28717. The level_zero version can be checked with ``sycl-ls``, and verison will be tagged be ``[ext_oneapi_level_zero:gpu]``. + + .. code-block:: + + [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix] + [opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 5 125H OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix] + [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO [24.09.28717.12] + [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.28717] + + If you have level_zero version < 1.3.28717, you could update as follows: + + .. code-block:: bash + + wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.16238.4/intel-igc-core_1.0.16238.4_amd64.deb + wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.16238.4/intel-igc-opencl_1.0.16238.4_amd64.deb + wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-level-zero-gpu-dbgsym_1.3.28717.12_amd64.ddeb + wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-level-zero-gpu_1.3.28717.12_amd64.deb + wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-opencl-icd-dbgsym_24.09.28717.12_amd64.ddeb + wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-opencl-icd_24.09.28717.12_amd64.deb + wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/libigdgmm12_22.3.17_amd64.deb + sudo dpkg -i *.deb + + * Step 2: Download and install `Intel® oneAPI Base Toolkit `_ with version 2024.0. OneDNN, OneMKL and DPC++ compiler are needed, others are optional. + + Intel® oneAPI Base Toolkit 2024.0 installation methods: + + .. tabs:: + + .. tab:: APT installer + + Step 1: Set up repository + + .. code-block:: bash + + wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null + echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list + sudo apt update + + Step 2: Install the package + + .. code-block:: bash + + sudo apt install intel-oneapi-common-vars=2024.0.0-49406 \ + intel-oneapi-common-oneapi-vars=2024.0.0-49406 \ + intel-oneapi-diagnostics-utility=2024.0.0-49093 \ + intel-oneapi-compiler-dpcpp-cpp=2024.0.2-49895 \ + intel-oneapi-dpcpp-ct=2024.0.0-49381 \ + intel-oneapi-mkl=2024.0.0-49656 \ + intel-oneapi-mkl-devel=2024.0.0-49656 \ + intel-oneapi-mpi=2021.11.0-49493 \ + intel-oneapi-mpi-devel=2021.11.0-49493 \ + intel-oneapi-dal=2024.0.1-25 \ + intel-oneapi-dal-devel=2024.0.1-25 \ + intel-oneapi-ippcp=2021.9.1-5 \ + intel-oneapi-ippcp-devel=2021.9.1-5 \ + intel-oneapi-ipp=2021.10.1-13 \ + intel-oneapi-ipp-devel=2021.10.1-13 \ + intel-oneapi-tlt=2024.0.0-352 \ + intel-oneapi-ccl=2021.11.2-5 \ + intel-oneapi-ccl-devel=2021.11.2-5 \ + intel-oneapi-dnnl-devel=2024.0.0-49521 \ + intel-oneapi-dnnl=2024.0.0-49521 \ + intel-oneapi-tcm-1.0=1.0.0-435 + + .. note:: + + You can uninstall the package by running the following command: + + .. code-block:: bash + + sudo apt autoremove intel-oneapi-common-vars + + .. tab:: PIP installer + + Step 1: Install oneAPI in a user-defined folder, e.g., ``~/intel/oneapi``. + + .. code-block:: bash + + export PYTHONUSERBASE=~/intel/oneapi + pip install dpcpp-cpp-rt==2024.0.2 mkl-dpcpp==2024.0.0 onednn==2024.0.0 --user + + .. note:: + + The oneAPI packages are visible in ``pip list`` only if ``PYTHONUSERBASE`` is properly set. + + Step 2: Configure your working conda environment (e.g. with name ``llm``) to append oneAPI path (e.g. ``~/intel/oneapi/lib``) to the environment variable ``LD_LIBRARY_PATH``. + + .. code-block:: bash + + conda env config vars set LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/intel/oneapi/lib -n llm + + .. note:: + You can view the configured environment variables for your environment (e.g. with name ``llm``) by running ``conda env config vars list -n llm``. + You can continue with your working conda environment and install ``ipex-llm`` as guided in the next section. + + .. note:: + + You are recommended not to install other pip packages in the user-defined folder for oneAPI (e.g. ``~/intel/oneapi``). + You can uninstall the oneAPI package by simply deleting the package folder, and unsetting the configuration of your working conda environment (e.g., with name ``llm``). + + .. code-block:: bash + + rm -r ~/intel/oneapi + conda env config vars unset LD_LIBRARY_PATH -n llm + + .. tab:: Offline installer + + Using the offline installer allows you to customize the installation path. + + .. code-block:: bash + + wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/20f4e6a1-6b0b-4752-b8c1-e5eacba10e01/l_BaseKit_p_2024.0.0.49564_offline.sh + sudo sh ./l_BaseKit_p_2024.0.0.49564_offline.sh + + .. note:: + + You can also modify the installation or uninstall the package by running the following commands: + + .. code-block:: bash + + cd /opt/intel/oneapi/installer + sudo ./installer + + .. tab:: PyTorch 2.0 (deprecated for versions ``ipex-llm >= 2.1.0b20240511``) + + To enable IPEX-LLM for Intel GPUs with PyTorch 2.0, here're several prerequisite steps for tools installation and environment preparation: + + + * Step 1: Install Intel GPU Driver version >= stable_775_20_20231219. Highly recommend installing the latest version of intel-i915-dkms using apt. + + .. seealso:: + + Please refer to our `driver installation `_ for general purpose GPU capabilities. + + See `release page `_ for latest version. + + * Step 2: Download and install `Intel® oneAPI Base Toolkit `_ with version 2023.2. OneDNN, OneMKL and DPC++ compiler are needed, others are optional. + + Intel® oneAPI Base Toolkit 2023.2 installation methods: + + .. tabs:: + .. tab:: APT installer + + Step 1: Set up repository + + .. code-block:: bash + + wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null + echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list + sudo apt update + + Step 2: Install the packages + + .. code-block:: bash + + sudo apt install -y intel-oneapi-common-vars=2023.2.0-49462 \ + intel-oneapi-compiler-cpp-eclipse-cfg=2023.2.0-49495 intel-oneapi-compiler-dpcpp-eclipse-cfg=2023.2.0-49495 \ + intel-oneapi-diagnostics-utility=2022.4.0-49091 \ + intel-oneapi-compiler-dpcpp-cpp=2023.2.0-49495 \ + intel-oneapi-mkl=2023.2.0-49495 intel-oneapi-mkl-devel=2023.2.0-49495 \ + intel-oneapi-mpi=2021.10.0-49371 intel-oneapi-mpi-devel=2021.10.0-49371 \ + intel-oneapi-tbb=2021.10.0-49541 intel-oneapi-tbb-devel=2021.10.0-49541\ + intel-oneapi-ccl=2021.10.0-49084 intel-oneapi-ccl-devel=2021.10.0-49084\ + intel-oneapi-dnnl-devel=2023.2.0-49516 intel-oneapi-dnnl=2023.2.0-49516 + + .. note:: + + You can uninstall the package by running the following command: + + .. code-block:: bash + + sudo apt autoremove intel-oneapi-common-vars + + .. tab:: PIP installer + + Step 1: Install oneAPI in a user-defined folder, e.g., ``~/intel/oneapi`` + + .. code-block:: bash + + export PYTHONUSERBASE=~/intel/oneapi + pip install dpcpp-cpp-rt==2023.2.0 mkl-dpcpp==2023.2.0 onednn-cpu-dpcpp-gpu-dpcpp==2023.2.0 --user + + .. note:: + + The oneAPI packages are visible in ``pip list`` only if ``PYTHONUSERBASE`` is properly set. + + Step 2: Configure your working conda environment (e.g. with name ``llm``) to append oneAPI path (e.g. ``~/intel/oneapi/lib``) to the environment variable ``LD_LIBRARY_PATH``. + + .. code-block:: bash + + conda env config vars set LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/intel/oneapi/lib -n llm + + .. note:: + You can view the configured environment variables for your environment (e.g. with name ``llm``) by running ``conda env config vars list -n llm``. + You can continue with your working conda environment and install ``ipex-llm`` as guided in the next section. + + .. note:: + + You are recommended not to install other pip packages in the user-defined folder for oneAPI (e.g. ``~/intel/oneapi``). + You can uninstall the oneAPI package by simply deleting the package folder, and unsetting the configuration of your working conda environment (e.g., with name ``llm``). + + .. code-block:: bash + + rm -r ~/intel/oneapi + conda env config vars unset LD_LIBRARY_PATH -n llm + + .. tab:: Offline installer + + Using the offline installer allows you to customize the installation path. + + .. code-block:: bash + + wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/992857b9-624c-45de-9701-f6445d845359/l_BaseKit_p_2023.2.0.49397_offline.sh + sudo sh ./l_BaseKit_p_2023.2.0.49397_offline.sh + + .. note:: + + You can also modify the installation or uninstall the package by running the following commands: + + .. code-block:: bash + + cd /opt/intel/oneapi/installer + sudo ./installer +``` + +### Install IPEX-LLM +#### Install IPEX-LLM From PyPI + +We recommend using [Miniforge](https://conda-forge.org/download/ to create a python 3.11 enviroment: + +```eval_rst +.. important:: + + ``ipex-llm`` is tested with Python 3.9, 3.10 and 3.11. Python 3.11 is recommended for best practices. +``` + +```eval_rst +.. important:: + Make sure you install matching versions of ipex-llm/pytorch/IPEX and oneAPI Base Toolkit. IPEX-LLM with Pytorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. IPEX-LLM with Pytorch 2.0 should be used with oneAPI Base Toolkit version 2023.2. +``` + +```eval_rst +.. tabs:: + .. tab:: PyTorch 2.1 + Choose either US or CN website for ``extra-index-url``: + + .. tabs:: + .. tab:: US + + .. code-block:: bash + + conda create -n llm python=3.11 + conda activate llm + + pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + + .. note:: + + The ``xpu`` option will install IPEX-LLM with PyTorch 2.1 by default, which is equivalent to + + .. code-block:: bash + + pip install --pre --upgrade ipex-llm[xpu_2.1] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + + .. tab:: CN + + .. code-block:: bash + + conda create -n llm python=3.11 + conda activate llm + + pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ + + .. note:: + + The ``xpu`` option will install IPEX-LLM with PyTorch 2.1 by default, which is equivalent to + + .. code-block:: bash + + pip install --pre --upgrade ipex-llm[xpu_2.1] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ + + + .. tab:: PyTorch 2.0 (deprecated for versions ``ipex-llm >= 2.1.0b20240511``) + Choose either US or CN website for ``extra-index-url``: + + .. tabs:: + .. tab:: US + + .. code-block:: bash + + conda create -n llm python=3.11 + conda activate llm + + pip install --pre --upgrade ipex-llm[xpu_2.0]==2.1.0b20240510 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + + .. tab:: CN + + .. code-block:: bash + + conda create -n llm python=3.11 + conda activate llm + + pip install --pre --upgrade ipex-llm[xpu_2.0]==2.1.0b20240510 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ + +``` + +#### Install IPEX-LLM From Wheel + +If you encounter network issues when installing IPEX, you can also install IPEX-LLM dependencies for Intel XPU from source archives. First you need to download and install torch/torchvision/ipex from wheels listed below before installing `ipex-llm`. + +```eval_rst +.. tabs:: + .. tab:: PyTorch 2.1 + + .. code-block:: bash + + # get the wheels on Linux system for IPEX 2.1.10+xpu + wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torch-2.1.0a0%2Bcxx11.abi-cp311-cp311-linux_x86_64.whl + wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torchvision-0.16.0a0%2Bcxx11.abi-cp311-cp311-linux_x86_64.whl + wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/intel_extension_for_pytorch-2.1.10%2Bxpu-cp311-cp311-linux_x86_64.whl + + Then you may install directly from the wheel archives using following commands: + + .. code-block:: bash + + # install the packages from the wheels + pip install torch-2.1.0a0+cxx11.abi-cp311-cp311-linux_x86_64.whl + pip install torchvision-0.16.0a0+cxx11.abi-cp311-cp311-linux_x86_64.whl + pip install intel_extension_for_pytorch-2.1.10+xpu-cp311-cp311-linux_x86_64.whl + + # install ipex-llm for Intel GPU + pip install --pre --upgrade ipex-llm[xpu] + + .. tab:: PyTorch 2.0 (deprecated for versions ``ipex-llm >= 2.1.0b20240511``) + + .. code-block:: bash + + # get the wheels on Linux system for IPEX 2.0.110+xpu + wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torch-2.0.1a0%2Bcxx11.abi-cp311-cp311-linux_x86_64.whl + wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torchvision-0.15.2a0%2Bcxx11.abi-cp311-cp311-linux_x86_64.whl + wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/intel_extension_for_pytorch-2.0.110%2Bxpu-cp311-cp311-linux_x86_64.whl + + Then you may install directly from the wheel archives using following commands: + + .. code-block:: bash + + # install the packages from the wheels + pip install torch-2.0.1a0+cxx11.abi-cp311-cp311-linux_x86_64.whl + pip install torchvision-0.15.2a0+cxx11.abi-cp311-cp311-linux_x86_64.whl + pip install intel_extension_for_pytorch-2.0.110+xpu-cp311-cp311-linux_x86_64.whl + + # install ipex-llm for Intel GPU + pip install --pre --upgrade ipex-llm[xpu_2.0]==2.1.0b20240510 + +``` + +```eval_rst +.. note:: + + All the wheel packages mentioned here are for Python 3.11. If you would like to use Python 3.9 or 3.10, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp11`` with ``cp39`` or ``cp310``, respectively. +``` + +### Runtime Configuration + +To use GPU acceleration on Linux, several environment variables are required or recommended before running a GPU example. + +```eval_rst +.. tabs:: + .. tab:: Intel Arc™ A-Series and Intel Data Center GPU Flex + + For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, we recommend: + + .. code-block:: bash + + # Configure oneAPI environment variables. Required step for APT or offline installed oneAPI. + # Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH. + source /opt/intel/oneapi/setvars.sh + + # Recommended Environment Variables for optimal performance + export USE_XETLA=OFF + export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 + export SYCL_CACHE_PERSISTENT=1 + + .. tab:: Intel Data Center GPU Max + + For Intel Data Center GPU Max Series, we recommend: + + .. code-block:: bash + + # Configure oneAPI environment variables. Required step for APT or offline installed oneAPI. + # Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH. + source /opt/intel/oneapi/setvars.sh + + # Recommended Environment Variables for optimal performance + export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so + export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 + export SYCL_CACHE_PERSISTENT=1 + export ENABLE_SDP_FUSION=1 + + Please note that ``libtcmalloc.so`` can be installed by ``conda install -c conda-forge -y gperftools=2.10`` + + .. tab:: Intel iGPU + + .. code-block:: bash + + # Configure oneAPI environment variables. Required step for APT or offline installed oneAPI. + # Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH. + source /opt/intel/oneapi/setvars.sh + + export SYCL_CACHE_PERSISTENT=1 + export BIGDL_LLM_XMX_DISABLED=1 + +``` + +```eval_rst +.. note:: + + For **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. +``` + +### Known issues + +#### 1. Potential suboptimal performance with Linux kernel 6.2.0 + +For Ubuntu 22.04 and driver version < stable_775_20_20231219, the performance on Linux kernel 6.2.0 is worse than Linux kernel 5.19.0. You can use `sudo apt update && sudo apt install -y intel-i915-dkms intel-fw-gpu` to install the latest driver to solve this issue (need to reboot OS). + +Tips: You can use `sudo apt list --installed | grep intel-i915-dkms` to check your intel-i915-dkms's version, the version should be latest and >= `1.23.9.11.231003.15+i19-1`. + +#### 2. Driver installation unmet dependencies error: intel-i915-dkms + +The last apt install command of the driver installation may produce the following error: + +``` +The following packages have unmet dependencies: + intel-i915-dkms : Conflicts: intel-platform-cse-dkms + Conflicts: intel-platform-vsec-dkms +``` + +You can use `sudo apt install -y intel-i915-dkms intel-fw-gpu` to install instead. As the intel-platform-cse-dkms and intel-platform-vsec-dkms are already provided by intel-i915-dkms. + +### Troubleshooting + +#### 1. Cannot open shared object file: No such file or directory + +Error where libmkl file is not found, for example, + +``` +OSError: libmkl_intel_lp64.so.2: cannot open shared object file: No such file or directory +``` +``` +Error: libmkl_sycl_blas.so.4: cannot open shared object file: No such file or directory +``` + +The reason for such errors is that oneAPI has not been initialized properly before running IPEX-LLM code or before importing IPEX package. + +* For oneAPI installed using APT or Offline Installer, make sure you execute `setvars.sh` of oneAPI Base Toolkit before running IPEX-LLM. +* For PIP-installed oneAPI, activate your working environment and run ``echo $LD_LIBRARY_PATH`` to check if the installation path is properly configured for the environment. If the output does not contain oneAPI path (e.g. ``~/intel/oneapi/lib``), check [Prerequisites](#id1) to re-install oneAPI with PIP installer. +* Make sure you install matching versions of ipex-llm/pytorch/IPEX and oneAPI Base Toolkit. IPEX-LLM with PyTorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. IPEX-LLM with PyTorch 2.0 should be used with oneAPI Base Toolkit version 2023.2. diff --git a/docs/mddocs/Overview/known_issues.md b/docs/mddocs/Overview/known_issues.md new file mode 100644 index 00000000..5b2621db --- /dev/null +++ b/docs/mddocs/Overview/known_issues.md @@ -0,0 +1 @@ +# IPEX-LLM Known Issues \ No newline at end of file diff --git a/docs/mddocs/Overview/llm.md b/docs/mddocs/Overview/llm.md new file mode 100644 index 00000000..ef0cba3a --- /dev/null +++ b/docs/mddocs/Overview/llm.md @@ -0,0 +1,68 @@ +# IPEX-LLM in 5 minutes + +You can use IPEX-LLM to run any [*Hugging Face Transformers*](https://huggingface.co/docs/transformers/index) PyTorch model. It automatically optimizes and accelerates LLMs using low-precision (INT4/INT5/INT8) techniques, modern hardware accelerations and latest software optimizations. + +Hugging Face transformers-based applications can run on IPEX-LLM with one-line code change, and you'll immediately observe significant speedup[1]. + +Here, let's take a relatively small LLM model, i.e [open_llama_3b_v2](https://huggingface.co/openlm-research/open_llama_3b_v2), and IPEX-LLM INT4 optimizations as an example. + +## Load a Pretrained Model + +Simply use one-line `transformers`-style API in `ipex-llm` to load `open_llama_3b_v2` with INT4 optimization (by specifying `load_in_4bit=True`) as follows: + +```python +from ipex_llm.transformers import AutoModelForCausalLM + +model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b_v2", + load_in_4bit=True) +``` + +```eval_rst +.. tip:: + + `open_llama_3b_v2 `_ is a pretrained large language model hosted on Hugging Face. ``openlm-research/open_llama_3b_v2`` is its Hugging Face model id. ``from_pretrained`` will automatically download the model from Hugging Face to a local cache path (e.g. ``~/.cache/huggingface``), load the model, and converted it to ``ipex-llm`` INT4 format. + + It may take a long time to download the model using API. You can also download the model yourself, and set ``pretrained_model_name_or_path`` to the local path of the downloaded model. This way, ``from_pretrained`` will load and convert directly from local path without download. +``` +## Load Tokenizer + +You also need a tokenizer for inference. Just use the official `transformers` API to load `LlamaTokenizer`: + +```python +from transformers import LlamaTokenizer + +tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b_v2") +``` + +## Run LLM + +Now you can do model inference exactly the same way as using official `transformers` API: + +```python +import torch + +with torch.inference_mode(): + prompt = 'Q: What is CPU?\nA:' + + # tokenize the input prompt from string to token ids + input_ids = tokenizer.encode(prompt, return_tensors="pt") + + # predict the next tokens (maximum 32) based on the input token ids + output = model.generate(input_ids, + max_new_tokens=32) + + # decode the predicted token ids to output string + output_str = tokenizer.decode(output[0], skip_special_tokens=True) + + print(output_str) +``` + +------ + +
+

+ [1] + Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. + +

+
diff --git a/docs/mddocs/Quickstart/axolotl_quickstart.md b/docs/mddocs/Quickstart/axolotl_quickstart.md new file mode 100644 index 00000000..4a2cbb3a --- /dev/null +++ b/docs/mddocs/Quickstart/axolotl_quickstart.md @@ -0,0 +1,314 @@ +# Finetune LLM with Axolotl on Intel GPU + +[Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) is a popular tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures. You can now use [`ipex-llm`](https://github.com/intel-analytics/ipex-llm) as an accelerated backend for `Axolotl` running on Intel **GPU** *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)*. + +See the demo of finetuning LLaMA2-7B on Intel Arc GPU below. + + + +## Quickstart + +### 0. Prerequisites + +IPEX-LLM's support for [Axolotl v0.4.0](https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0) is only available for Linux system. We recommend Ubuntu 20.04 or later (Ubuntu 22.04 is preferred). + +Visit the [Install IPEX-LLM on Linux with Intel GPU](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html), follow [Install Intel GPU Driver](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-intel-gpu-driver) and [Install oneAPI](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-oneapi) to install GPU driver and Intel® oneAPI Base Toolkit 2024.0. + +### 1. Install IPEX-LLM for Axolotl + +Create a new conda env, and install `ipex-llm[xpu]`. + +```cmd +conda create -n axolotl python=3.11 +conda activate axolotl +# install ipex-llm +pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +``` + +Install [axolotl v0.4.0](https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0) from git. + +```cmd +# install axolotl v0.4.0 +git clone https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0 +cd axolotl +# replace requirements.txt +remove requirements.txt +wget -O requirements.txt https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/requirements-xpu.txt +pip install -e . +pip install transformers==4.36.0 +# to avoid https://github.com/OpenAccess-AI-Collective/axolotl/issues/1544 +pip install datasets==2.15.0 +# prepare axolotl entrypoints +wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/finetune.py +wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/train.py +``` + +**After the installation, you should have created a conda environment, named `axolotl` for instance, for running `Axolotl` commands with IPEX-LLM.** + +### 2. Example: Finetune Llama-2-7B with Axolotl + +The following example will introduce finetuning [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) with [alpaca_2k_test](https://huggingface.co/datasets/mhenrichsen/alpaca_2k_test) dataset using LoRA and QLoRA. + +Note that you don't need to write any code in this example. + +| Model | Dataset | Finetune method | +|-------|-------|-------| +| Llama-2-7B | alpaca_2k_test | LoRA (Low-Rank Adaptation) | +| Llama-2-7B | alpaca_2k_test | QLoRA (Quantized Low-Rank Adaptation) | + +For more technical details, please refer to [Llama 2](https://arxiv.org/abs/2307.09288), [LoRA](https://arxiv.org/abs/2106.09685) and [QLoRA](https://arxiv.org/abs/2305.14314). + +#### 2.1 Download Llama-2-7B and alpaca_2k_test + +By default, Axolotl will automatically download models and datasets from Huggingface. Please ensure you have login to Huggingface. + +```cmd +huggingface-cli login +``` + +If you prefer offline models and datasets, please download [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) and [alpaca_2k_test](https://huggingface.co/datasets/mhenrichsen/alpaca_2k_test). Then, set `HF_HUB_OFFLINE=1` to avoid connecting to Huggingface. + +```cmd +export HF_HUB_OFFLINE=1 +``` + +#### 2.2 Set Environment Variables + +```eval_rst +.. note:: + + This is a required step on for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI. +``` + +Configure oneAPI variables by running the following command: + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + source /opt/intel/oneapi/setvars.sh + +``` + +Configure accelerate to avoid training with CPU. You can download a default `default_config.yaml` with `use_cpu: false`. + +```cmd +mkdir -p ~/.cache/huggingface/accelerate/ +wget -O ~/.cache/huggingface/accelerate/default_config.yaml https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/default_config.yaml +``` + +As an alternative, you can config accelerate based on your requirements. + +```cmd +accelerate config +``` + +Please answer `NO` in option `Do you want to run your training on CPU only (even if a GPU / Apple Silicon device is available)? [yes/NO]:`. + +After finishing accelerate config, check if `use_cpu` is disabled (i.e., `use_cpu: false`) in accelerate config file (`~/.cache/huggingface/accelerate/default_config.yaml`). + +#### 2.3 LoRA finetune + +Prepare `lora.yml` for Axolotl LoRA finetune. You can download a template from github. + +```cmd +wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/lora.yml +``` + +**If you are using the offline model and dataset in local env**, please modify the model path and dataset path in `lora.yml`. Otherwise, keep them unchanged. + +```yaml +# Please change to local path if model is offline, e.g., /path/to/model/Llama-2-7b-hf +base_model: NousResearch/Llama-2-7b-hf +datasets: + # Please change to local path if dataset is offline, e.g., /path/to/dataset/alpaca_2k_test + - path: mhenrichsen/alpaca_2k_test + type: alpaca +``` + +Modify LoRA parameters, such as `lora_r` and `lora_alpha`, etc. + +```yaml +adapter: lora +lora_model_dir: + +lora_r: 32 +lora_alpha: 16 +lora_dropout: 0.05 +lora_target_linear: true +lora_fan_in_fan_out: +``` + +Launch LoRA training with the following command. + +```cmd +accelerate launch finetune.py lora.yml +``` + +In Axolotl v0.4.0, you can use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`. + +```cmd +accelerate launch train.py lora.yml +``` + +#### 2.4 QLoRA finetune + +Prepare `lora.yml` for QLoRA finetune. You can download a template from github. + +```cmd +wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/qlora.yml +``` + +**If you are using the offline model and dataset in local env**, please modify the model path and dataset path in `qlora.yml`. Otherwise, keep them unchanged. + +```yaml +# Please change to local path if model is offline, e.g., /path/to/model/Llama-2-7b-hf +base_model: NousResearch/Llama-2-7b-hf +datasets: + # Please change to local path if dataset is offline, e.g., /path/to/dataset/alpaca_2k_test + - path: mhenrichsen/alpaca_2k_test + type: alpaca +``` + +Modify QLoRA parameters, such as `lora_r` and `lora_alpha`, etc. + +```yaml +adapter: qlora +lora_model_dir: + +lora_r: 32 +lora_alpha: 16 +lora_dropout: 0.05 +lora_target_modules: +lora_target_linear: true +lora_fan_in_fan_out: +``` + +Launch LoRA training with the following command. + +```cmd +accelerate launch finetune.py qlora.yml +``` + +In Axolotl v0.4.0, you can use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`. + +```cmd +accelerate launch train.py qlora.yml +``` + +### 3. Finetune Llama-3-8B (Experimental) + +Warning: this section will install axolotl main ([796a085](https://github.com/OpenAccess-AI-Collective/axolotl/tree/796a085b2f688f4a5efe249d95f53ff6833bf009)) for new features, e.g., Llama-3-8B. + +#### 3.1 Install Axolotl main in conda + +Axolotl main has lots of new dependencies. Please setup a new conda env for this version. + +```cmd +conda create -n llm python=3.11 +conda activate llm +# install axolotl main +git clone https://github.com/OpenAccess-AI-Collective/axolotl +cd axolotl && git checkout 796a085 +pip install -e . +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +# install transformers etc +# to avoid https://github.com/OpenAccess-AI-Collective/axolotl/issues/1544 +pip install datasets==2.15.0 +pip install transformers==4.37.0 +``` + +Config accelerate and oneAPIs, according to [Set Environment Variables](#22-set-environment-variables). + +#### 3.2 Alpaca QLoRA + +Based on [axolotl Llama-3 QLoRA example](https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/examples/llama-3/qlora.yml). + +Prepare `llama3-qlora.yml` for QLoRA finetune. You can download a template from github. + +```cmd +wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/llama3-qlora.yml +``` + +**If you are using the offline model and dataset in local env**, please modify the model path and dataset path in `llama3-qlora.yml`. Otherwise, keep them unchanged. + +```yaml +# Please change to local path if model is offline, e.g., /path/to/model/Meta-Llama-3-8B +base_model: meta-llama/Meta-Llama-3-8B +datasets: + # Please change to local path if dataset is offline, e.g., /path/to/dataset/alpaca_2k_test + - path: aaditya/alpaca_subset_1 + type: alpaca +``` + +Modify QLoRA parameters, such as `lora_r` and `lora_alpha`, etc. + +```yaml +adapter: qlora +lora_model_dir: + +sequence_len: 256 +sample_packing: true +pad_to_sequence_len: true + +lora_r: 32 +lora_alpha: 16 +lora_dropout: 0.05 +lora_target_modules: +lora_target_linear: true +lora_fan_in_fan_out: +``` + +```cmd +accelerate launch finetune.py llama3-qlora.yml +``` + +You can also use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`. + +```cmd +accelerate launch train.py llama3-qlora.yml +``` + +Expected output + +```cmd +{'loss': 0.237, 'learning_rate': 1.2254711850265387e-06, 'epoch': 3.77} +{'loss': 0.6068, 'learning_rate': 1.1692453482951115e-06, 'epoch': 3.77} +{'loss': 0.2926, 'learning_rate': 1.1143322458989303e-06, 'epoch': 3.78} +{'loss': 0.2475, 'learning_rate': 1.0607326072295087e-06, 'epoch': 3.78} +{'loss': 0.1531, 'learning_rate': 1.008447144232094e-06, 'epoch': 3.79} +{'loss': 0.1799, 'learning_rate': 9.57476551396197e-07, 'epoch': 3.79} +{'loss': 0.2724, 'learning_rate': 9.078215057463868e-07, 'epoch': 3.79} +{'loss': 0.2534, 'learning_rate': 8.594826668332445e-07, 'epoch': 3.8} +{'loss': 0.3388, 'learning_rate': 8.124606767246579e-07, 'epoch': 3.8} +{'loss': 0.3867, 'learning_rate': 7.667561599972505e-07, 'epoch': 3.81} +{'loss': 0.2108, 'learning_rate': 7.223697237281668e-07, 'epoch': 3.81} +{'loss': 0.0792, 'learning_rate': 6.793019574868775e-07, 'epoch': 3.82} +``` + +## Troubleshooting + +#### TypeError: PosixPath + +Error message: `TypeError: argument of type 'PosixPath' is not iterable` + +This issue is related to [axolotl #1544](https://github.com/OpenAccess-AI-Collective/axolotl/issues/1544). It can be fixed by downgrading datasets to 2.15.0. + +```cmd +pip install datasets==2.15.0 +``` + +#### RuntimeError: out of device memory + +Error message: `RuntimeError: Allocation is out of device memory on current platform.` + +This issue is caused by running out of GPU memory. Please reduce `lora_r` or `micro_batch_size` in `qlora.yml` or `lora.yml`, or reduce data using in training. + +#### OSError: libmkl_intel_lp64.so.2 + +Error message: `OSError: libmkl_intel_lp64.so.2: cannot open shared object file: No such file or directory` + +oneAPI environment is not correctly set. Please refer to [Set Environment Variables](#set-environment-variables). diff --git a/docs/mddocs/Quickstart/benchmark_quickstart.md b/docs/mddocs/Quickstart/benchmark_quickstart.md new file mode 100644 index 00000000..ba26b770 --- /dev/null +++ b/docs/mddocs/Quickstart/benchmark_quickstart.md @@ -0,0 +1,174 @@ +# Run Performance Benchmarking with IPEX-LLM + +We can perform benchmarking for IPEX-LLM on Intel CPUs and GPUs using the benchmark scripts we provide. + +## Prepare The Environment + +You can refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install.html) to install IPEX-LLM in your environment. The following dependencies are also needed to run the benchmark scripts. + +``` +pip install pandas +pip install omegaconf +``` + +## Prepare The Scripts + +Navigate to your local workspace and then download IPEX-LLM from GitHub. Modify the `config.yaml` under `all-in-one` folder for your benchmark configurations. + +``` +cd your/local/workspace +git clone https://github.com/intel-analytics/ipex-llm.git +cd ipex-llm/python/llm/dev/benchmark/all-in-one/ +``` + +## config.yaml + + +```yaml +repo_id: + - 'meta-llama/Llama-2-7b-chat-hf' +local_model_hub: 'path to your local model hub' +warm_up: 1 # must set >=2 when run "pipeline_parallel_gpu" test_api +num_trials: 3 +num_beams: 1 # default to greedy search +low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4) +batch_size: 1 # default to 1 +in_out_pairs: + - '32-32' + - '1024-128' + - '2048-256' +test_api: + - "transformer_int4_gpu" # on Intel GPU, transformer-like API, (qtype=int4) +cpu_embedding: False # whether put embedding to CPU +streaming: False # whether output in streaming way (only avaiable now for gpu win related test_api) +task: 'continuation' # task can be 'continuation', 'QA' and 'summarize' +``` + +Some parameters in the yaml file that you can configure: + + +- `repo_id`: The name of the model and its organization. +- `local_model_hub`: The folder path where the models are stored on your machine. Replace 'path to your local model hub' with /llm/models. +- `warm_up`: The number of warmup trials before performance benchmarking (must set to >= 2 when using "pipeline_parallel_gpu" test_api). +- `num_trials`: The number of runs for performance benchmarking (the final result is the average of all trials). +- `low_bit`: The low_bit precision you want to convert to for benchmarking. +- `batch_size`: The number of samples on which the models make predictions in one forward pass. +- `in_out_pairs`: Input sequence length and output sequence length combined by '-'. +- `test_api`: Different test functions for different machines. + - `transformer_int4_gpu` on Intel GPU for Linux + - `transformer_int4_gpu_win` on Intel GPU for Windows + - `transformer_int4` on Intel CPU +- `cpu_embedding`: Whether to put embedding on CPU (only available for windows GPU-related test_api). +- `streaming`: Whether to output in a streaming way (only available for GPU Windows-related test_api). +- `use_fp16_torch_dtype`: Whether to use fp16 for the non-linear layer (only available for "pipeline_parallel_gpu" test_api). +- `n_gpu`: Number of GPUs to use (only available for "pipeline_parallel_gpu" test_api). +- `task`: There are three tasks: `continuation`, `QA` and `summarize`. `continuation` refers to writing additional content based on prompt. `QA` refers to answering questions based on prompt. `summarize` refers to summarizing the prompt. + + +```eval_rst +.. note:: + + If you want to benchmark the performance without warmup, you can set ``warm_up: 0`` and ``num_trials: 1`` in ``config.yaml``, and run each single model and in_out_pair separately. +``` + + +## Run on Windows + +Please refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration) to configure oneAPI environment variables. + +```eval_rst +.. tabs:: + .. tab:: Intel iGPU + + .. code-block:: bash + + set SYCL_CACHE_PERSISTENT=1 + set BIGDL_LLM_XMX_DISABLED=1 + + python run.py + + .. tab:: Intel Arc™ A300-Series or Pro A60 + + .. code-block:: bash + + set SYCL_CACHE_PERSISTENT=1 + python run.py + + .. tab:: Other Intel dGPU Series + + .. code-block:: bash + + # e.g. Arc™ A770 + python run.py + +``` + +## Run on Linux + +```eval_rst +.. tabs:: + .. tab:: Intel Arc™ A-Series and Intel Data Center GPU Flex + + For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, we recommend: + + .. code-block:: bash + + ./run-arc.sh + + .. tab:: Intel iGPU + + For Intel iGPU, we recommend: + + .. code-block:: bash + + ./run-igpu.sh + + .. tab:: Intel Data Center GPU Max + + Please note that you need to run ``conda install -c conda-forge -y gperftools=2.10`` before running the benchmark script on Intel Data Center GPU Max Series. + + .. code-block:: bash + + ./run-max-gpu.sh + + .. tab:: Intel SPR + + For Intel SPR machine, we recommend: + + .. code-block:: bash + + ./run-spr.sh + + The scipt uses a default numactl strategy. If you want to customize it, please use ``lscpu`` or ``numactl -H`` to check how cpu indexs are assigned to numa node, and make sure the run command is binded to only one socket. + + .. tab:: Intel HBM + + For Intel HBM machine, we recommend: + + .. code-block:: bash + + ./run-hbm.sh + + The scipt uses a default numactl strategy. If you want to customize it, please use ``numactl -H`` to check how the index of hbm node and cpu are assigned. + + For example: + + + .. code-block:: bash + + node 0 1 2 3 + 0: 10 21 13 23 + 1: 21 10 23 13 + 2: 13 23 10 23 + 3: 23 13 23 10 + + + here hbm node is the node whose distance from the checked node is 13, node 2 is node 0's hbm node. + + And make sure the run command is binded to only one socket. + +``` + +## Result + +After the benchmarking is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns `1st token avg latency (ms)` and `2+ avg latency (ms/token)` for the benchmark results. You can also check whether the column `actual input/output tokens` is consistent with the column `input/output tokens` and whether the parameters you specified in `config.yaml` have been successfully applied in the benchmarking. diff --git a/docs/mddocs/Quickstart/bigdl_llm_migration.md b/docs/mddocs/Quickstart/bigdl_llm_migration.md new file mode 100644 index 00000000..a1ef5051 --- /dev/null +++ b/docs/mddocs/Quickstart/bigdl_llm_migration.md @@ -0,0 +1,63 @@ +# `bigdl-llm` Migration Guide + +This guide helps you migrate your `bigdl-llm` application to use `ipex-llm`. + +## Upgrade `bigdl-llm` package to `ipex-llm` + +```eval_rst +.. note:: + This step assumes you have already installed `bigdl-llm`. +``` +You need to uninstall `bigdl-llm` and install `ipex-llm`With your `bigdl-llm` conda environment activated, execute the following command according to your device type and location: + +### For CPU + +```bash +pip uninstall -y bigdl-llm +pip install --pre --upgrade ipex-llm[all] # for cpu +``` + +### For GPU +Choose either US or CN website for `extra-index-url`: +```eval_rst +.. tabs:: + + .. tab:: US + + .. code-block:: cmd + + pip uninstall -y bigdl-llm + pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + + .. tab:: CN + + .. code-block:: cmd + + pip uninstall -y bigdl-llm + pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ +``` + +## Migrate `bigdl-llm` code to `ipex-llm` +There are two options to migrate `bigdl-llm` code to `ipex-llm`. + +### 1. Upgrade `bigdl-llm` code to `ipex-llm` +To upgrade `bigdl-llm` code to `ipex-llm`, simply replace all `bigdl.llm` with `ipex_llm`: + +```python +#from bigdl.llm.transformers import AutoModelForCausalLM # Original line +from ipex_llm.transformers import AutoModelForCausalLM #Updated line +model = AutoModelForCausalLM.from_pretrained(model_path, + load_in_4bit=True, + trust_remote_code=True) +``` + +### 2. Run `bigdl-llm` code in compatible mode (experimental) +To run in the compatible mode, simply add `import ipex_llm` at the beginning of the existing `bigdl-llm` code: + +```python +import ipex_llm # Add this line before any bigdl.llm imports +from bigdl.llm.transformers import AutoModelForCausalLM +model = AutoModelForCausalLM.from_pretrained(model_path, + load_in_4bit=True, + trust_remote_code=True) +``` diff --git a/docs/mddocs/Quickstart/chatchat_quickstart.md b/docs/mddocs/Quickstart/chatchat_quickstart.md new file mode 100644 index 00000000..e482751a --- /dev/null +++ b/docs/mddocs/Quickstart/chatchat_quickstart.md @@ -0,0 +1,82 @@ +# Run Local RAG using Langchain-Chatchat on Intel CPU and GPU + +[chatchat-space/Langchain-Chatchat](https://github.com/chatchat-space/Langchain-Chatchat) is a Knowledge Base QA application using RAG pipeline; by porting it to [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily run ***local RAG pipelines*** using [Langchain-Chatchat](https://github.com/intel-analytics/Langchain-Chatchat) with LLMs and Embedding models on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max). + +*See the demos of running LLaMA2-7B (English) and ChatGLM-3-6B (Chinese) on an Intel Core Ultra laptop below.* + + + + + + + + + + +
English简体中文
+ +>You can change the UI language in the left-side menu. We currently support **English** and **简体中文** (see video demos below). + +## Langchain-Chatchat Architecture + +See the Langchain-Chatchat architecture below ([source](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/img/langchain%2Bchatglm.png)). + + + +## Quickstart + +### Install and Run + +Follow the guide that corresponds to your specific system and device from the links provided below: + +- For systems with Intel Core Ultra integrated GPU: [Windows Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_win_mtl.md#) | [Linux Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_linux_mtl.md#) +- For systems with Intel Arc A-Series GPU: [Windows Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_win_arc.md#) | [Linux Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_linux_arc.md#) +- For systems with Intel Data Center Max Series GPU: [Linux Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_linux_max.md#) +- For systems with Xeon-Series CPU: [Linux Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_linux_xeon.md#) + +### How to use RAG + +#### Step 1: Create Knowledge Base + +- Select `Manage Knowledge Base` from the menu on the left, then choose `New Knowledge Base` from the dropdown menu on the right side. + + + rag-menu + + +- Fill in the name of your new knowledge base (example: "test") and press the `Create` button. Adjust any other settings as needed. + + + rag-menu + + +- Upload knowledge files from your computer and allow some time for the upload to complete. Once finished, click on `Add files to Knowledge Base` button to build the vector store. Note: this process may take several minutes. + + + rag-menu + + +#### Step 2: Chat with RAG + +You can now click `Dialogue` on the left-side menu to return to the chat UI. Then in `Knowledge base settings` menu, choose the Knowledge Base you just created, e.g, "test". Now you can start chatting. + + + rag-menu + + +
+ +For more information about how to use Langchain-Chatchat, refer to Official Quickstart guide in [English](./README_en.md#), [Chinese](./README_chs.md#), or the [Wiki](https://github.com/chatchat-space/Langchain-Chatchat/wiki/). + +### Trouble Shooting & Tips + +#### 1. Version Compatibility + +Ensure that you have installed `ipex-llm>=2.1.0b20240327`. To upgrade `ipex-llm`, use +```bash +pip install --pre --upgrade ipex-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` + +#### 2. Prompt Templates + +In the left-side menu, you have the option to choose a prompt template. There're several pre-defined templates - those ending with '_cn' are Chinese templates, and those ending with '_en' are English templates. You can also define your own prompt templates in `configs/prompt_config.py`. Remember to restart the service to enable these changes. diff --git a/docs/mddocs/Quickstart/continue_quickstart.md b/docs/mddocs/Quickstart/continue_quickstart.md new file mode 100644 index 00000000..68623118 --- /dev/null +++ b/docs/mddocs/Quickstart/continue_quickstart.md @@ -0,0 +1,169 @@ + +# Run Coding Copilot in VSCode with Intel GPU + +[**Continue**](https://marketplace.visualstudio.com/items?itemName=Continue.continue) is a coding copilot extension in [Microsoft Visual Studio Code](https://code.visualstudio.com/); by integrating it with [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily leverage local LLMs running on Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) for code explanation, code generation/completion, etc. + +Below is a demo of using `Continue` with [CodeQWen1.5-7B](https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat) running on Intel A770 GPU. This demo illustrates how a programmer used `Continue` to find a solution for the [Kaggle's _Titanic_ challenge](https://www.kaggle.com/competitions/titanic/), which involves asking `Continue` to complete the code for model fitting, evaluation, hyper parameter tuning, feature engineering, and explain generated code. + + + +## Quickstart + +This guide walks you through setting up and running **Continue** within _Visual Studio Code_, empowered by local large language models served via [Ollama](./ollama_quickstart.html) with `ipex-llm` optimizations. + +### 1. Install and Run Ollama Serve + +Visit [Run Ollama with IPEX-LLM on Intel GPU](./ollama_quickstart.html), and follow the steps 1) [Install IPEX-LLM for Ollama](./ollama_quickstart.html#install-ipex-llm-for-ollama), 2) [Initialize Ollama](./ollama_quickstart.html#initialize-ollama) 3) [Run Ollama Serve](./ollama_quickstart.html#run-ollama-serve) to install, init and start the Ollama Service. + + +```eval_rst +.. important:: + + If the `Continue` plugin is not installed on the same machine where Ollama is running (which means `Continue` needs to connect to a remote Ollama service), you must configure the Ollama service to accept connections from any IP address. To achieve this, set or export the environment variable `OLLAMA_HOST=0.0.0.0` before executing the command `ollama serve`. + +.. tip:: + + If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`: + + .. code-block:: bash + + export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +``` + +### 2. Pull and Prepare the Model + +#### 2.1 Pull Model + +Now we need to pull a model for coding. Here we use [CodeQWen1.5-7B](https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat) model as an example. Open a new terminal window, run the following command to pull [`codeqwen:latest`](https://ollama.com/library/codeqwen). + + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + export no_proxy=localhost,127.0.0.1 + ./ollama pull codeqwen:latest + + .. tab:: Windows + + Please run the following command in Miniforge Prompt. + + .. code-block:: cmd + + set no_proxy=localhost,127.0.0.1 + ollama pull codeqwen:latest + +.. seealso:: + + Besides CodeQWen, there are other coding models you might want to explore, such as Magicoder, Wizardcoder, Codellama, Codegemma, Starcoder, Starcoder2, and etc. You can find these models in the `Ollama model library `_. Simply search for the model, pull it in a similar manner, and give it a try. +``` + + +#### 2.2 Prepare the Model and Pre-load + +To make `Continue` run more smoothly with Ollama, we will create a new model in Ollama using the original model with an adjusted num_ctx parameter of 4096. + +Start by creating a file named `Modelfile` with the following content: + + +```dockerfile +FROM codeqwen:latest +PARAMETER num_ctx 4096 +``` +Next, use the following commands in the terminal (Linux) or Miniforge Prompt (Windows) to create a new model in Ollama named `codeqwen:latest-continue`: + + +```bash + ollama create codeqwen:latest-continue -f Modelfile +``` + +After creation, run `ollama list` to see `codeqwen:latest-continue` in the list of models. + +Finally, preload the new model by executing the following command in a new terminal (Linux) or Miniforge Prompt (Windows): + +```bash +ollama run codeqwen:latest-continue +``` + + + +### 3. Install `Continue` Extension + +Search for `Continue` in the VSCode `Extensions Marketplace` and install it just like any other extension. + + + + +
+ +Once installed, the `Continue` icon will appear on the left sidebar. You can drag and drop the icon to the right sidebar for easy access to the `Continue` view. + + + + +
+ +If the icon does not appear or you cannot open the view, press `Ctrl+Shift+L` or follow the steps below to open the `Continue` view on the right side. + + + + +
+ +Once you have successfully opened the `Continue` view, you will see the welcome screen as shown below. Select **Fully local** -> **Continue** -> **Continue** as illustrated. + + + + +When you see the screen below, your plug-in is ready to use. + + + + + +### 4. `Continue` Configuration + +Once `Continue` is installed and ready, simply select the model "`Ollama - codeqwen:latest-continue`" from the bottom of the `Continue` view (all models in `ollama list` will appear in the format `Ollama-xxx`). + +Now you can start using `Continue`. + +#### Connecting to Remote Ollama Service + +You can configure `Continue` by clicking the small gear icon located at the bottom right of the `Continue` view to open `config.json`. In `config.json`, you will find all necessary configuration settings. + +If you are running Ollama on the same machine as `Continue`, no changes are necessary. If Ollama is running on a different machine, you'll need to update the `apiBase` key in `Ollama` item in `config.json` to point to the remote Ollama URL, as shown in the example below and in the figure. + +```json + { + "title": "Ollama", + "provider": "ollama", + "model": "AUTODETECT", + "apiBase": "http://your-ollama-service-ip:11434" + } +``` + + + + + + + +### 5. How to Use `Continue` +For detailed tutorials please refer to [this link](https://continue.dev/docs/how-to-use-continue). Here we are only showing the most common scenarios. + +#### Q&A over specific code +If you don't understand how some code works, highlight(press `Ctrl+Shift+L`) it and ask "how does this code work?" + + + + + +#### Editing code +You can ask Continue to edit your highlighted code with the command `/edit`. + + + + + diff --git a/docs/mddocs/Quickstart/deepspeed_autotp_fastapi_quickstart.md b/docs/mddocs/Quickstart/deepspeed_autotp_fastapi_quickstart.md new file mode 100644 index 00000000..f99c6731 --- /dev/null +++ b/docs/mddocs/Quickstart/deepspeed_autotp_fastapi_quickstart.md @@ -0,0 +1,102 @@ +# Run IPEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi + +This example demonstrates how to run IPEX-LLM serving on multiple [Intel GPUs](../README.md) by leveraging DeepSpeed AutoTP. + +## Requirements + +To run this example with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. For this particular example, you will need at least two GPUs on your machine. + +## Example + +### 1. Install + +```bash +conda create -n llm python=3.11 +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +pip install oneccl_bind_pt==2.1.100 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +# configures OneAPI environment variables +source /opt/intel/oneapi/setvars.sh +pip install git+https://github.com/microsoft/DeepSpeed.git@ed8aed5 +pip install git+https://github.com/intel/intel-extension-for-deepspeed.git@0eb734b +pip install mpi4py fastapi uvicorn +conda install -c conda-forge -y gperftools=2.10 # to enable tcmalloc +``` + +> **Important**: IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0. Please make sure you have installed the correct version. + +### 2. Run tensor parallel inference on multiple GPUs + +When we run the model in a distributed manner across two GPUs, the memory consumption of each GPU is only half of what it was originally, and the GPUs can work simultaneously during inference computation. + +We provide example usage for `Llama-2-7b-chat-hf` model running on Arc A770 + +Run Llama-2-7b-chat-hf on two Intel Arc A770: + +```bash + +# Before run this script, you should adjust the YOUR_REPO_ID_OR_MODEL_PATH in last line +# If you want to change server port, you can set port parameter in last line + +# To avoid GPU OOM, you could adjust --max-num-seqs and --max-num-batched-tokens parameters in below script +bash run_llama2_7b_chat_hf_arc_2_card.sh +``` + +If you successfully run the serving, you can get output like this: + +```bash +[0] INFO: Started server process [120071] +[0] INFO: Waiting for application startup. +[0] INFO: Application startup complete. +[0] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) +``` + +> **Note**: You could change `NUM_GPUS` to the number of GPUs you have on your machine. And you could also specify other low bit optimizations through `--low-bit`. + +### 3. Sample Input and Output + +We can use `curl` to test serving api + +```bash +# Set http_proxy and https_proxy to null to ensure that requests are not forwarded by a proxy. +export http_proxy= +export https_proxy= + +curl -X 'POST' \ + 'http://127.0.0.1:8000/generate/' \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "prompt": "What is AI?", + "n_predict": 32 +}' +``` + +And you should get output like this: + +```json +{ + "generated_text": "What is AI? Artificial intelligence (AI) refers to the development of computer systems able to perform tasks that would normally require human intelligence, such as visual perception, speech", + "generate_time": "0.45149803161621094s" +} + +``` + +**Important**: The first token latency is much larger than rest token latency, you could use [our benchmark tool](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/dev/benchmark/README.md) to obtain more details about first and rest token latency. + +### 4. Benchmark with wrk + +We use wrk for testing end-to-end throughput, check [here](https://github.com/wg/wrk). + +You can install by: +```bash +sudo apt install wrk +``` + +Please change the test url accordingly. + +```bash +# set t/c to the number of concurrencies to test full throughput. +wrk -t1 -c1 -d5m -s ./wrk_script_1024.lua http://127.0.0.1:8000/generate/ --timeout 1m +``` \ No newline at end of file diff --git a/docs/mddocs/Quickstart/dify_quickstart.md b/docs/mddocs/Quickstart/dify_quickstart.md new file mode 100644 index 00000000..97e4ae2d --- /dev/null +++ b/docs/mddocs/Quickstart/dify_quickstart.md @@ -0,0 +1,150 @@ +# Run Dify on Intel GPU + + +[**Dify**](https://dify.ai/) is an open-source production-ready LLM app development platform; by integrating it with [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily leverage local LLMs running on Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) for building complex AI workflows (e.g. RAG). + + +*See the demo of a RAG workflow in Dify running LLaMA2-7B on Intel A770 GPU below.* + + + + +## Quickstart + +### 1. Install and Start `Ollama` Service on Intel GPU + +Follow the steps in [Run Ollama on Intel GPU Guide](./ollama_quickstart.md) to install and run Ollama on Intel GPU. Ensure that `ollama serve` is running correctly and can be accessed through a local URL (e.g., `https://127.0.0.1:11434`) or a remote URL (e.g., `http://your_ip:11434`). + +We recommend pulling the desired model before proceeding with Dify. For instance, to pull the LLaMA2-7B model, you can use the following command: + +```bash +ollama pull llama2:7b +``` + +### 2. Install and Start `Dify` + + +#### 2.1 Download `Dify` + +You can either clone the repository or download the source zip from [github](https://github.com/langgenius/dify/archive/refs/heads/main.zip): +```bash +git clone https://github.com/langgenius/dify.git +``` + +#### 2.2 Setup Redis and PostgreSQL + +Next, deploy PostgreSQL and Redis. You can choose to utilize Docker, following the steps in the [Local Source Code Start Guide](https://docs.dify.ai/getting-started/install-self-hosted/local-source-code#clone-dify), or proceed without Docker using the following instructions: + + +- Install Redis by executing `sudo apt-get install redis-server`. Refer to [this guide](https://www.hostinger.com/tutorials/how-to-install-and-setup-redis-on-ubuntu/) for Redis environment setup, including password configuration and daemon settings. + +- Install PostgreSQL by following either [the Official PostgreSQL Tutorial](https://www.postgresql.org/docs/current/tutorial.html) or [a PostgreSQL Quickstart Guide](https://www.digitalocean.com/community/tutorials/how-to-install-postgresql-on-ubuntu-20-04-quickstart). After installation, proceed with the following PostgreSQL commands for setting up Dify. These commands create a username/password for Dify (e.g., `dify_user`, change `'your_password'` as desired), create a new database named `dify`, and grant privileges: + ```sql + CREATE USER dify_user WITH PASSWORD 'your_password'; + CREATE DATABASE dify; + GRANT ALL PRIVILEGES ON DATABASE dify TO dify_user; + ``` + +Configure Redis and PostgreSQL settings in the `.env` file located under dify source folder `dify/api/`: + +```bash dify/api/.env +### Example dify/api/.env +## Redis settings +REDIS_HOST=localhost +REDIS_PORT=6379 +REDIS_USERNAME=your_redis_user_name # change if needed +REDIS_PASSWORD=your_redis_password # change if needed +REDIS_DB=0 + +## postgreSQL settings +DB_USERNAME=dify_user # change if needed +DB_PASSWORD=your_dify_password # change if needed +DB_HOST=localhost +DB_PORT=5432 +DB_DATABASE=dify # change if needed +``` + +#### 2.3 Server Deployment + +Follow the steps in the [`Server Deployment` section in Local Source Code Start Guide](https://docs.dify.ai/getting-started/install-self-hosted/local-source-code#server-deployment) to deploy and start the Dify Server. + +Upon successful deployment, you will see logs in the terminal similar to the following: + + +```bash +INFO:werkzeug: +* Running on all addresses (0.0.0.0) +* Running on http://127.0.0.1:5001 +* Running on http://10.239.44.83:5001 +INFO:werkzeug:Press CTRL+C to quit +INFO:werkzeug: * Restarting with stat +WARNING:werkzeug: * Debugger is active! +INFO:werkzeug: * Debugger PIN: 227-697-894 +``` + + +#### 2.4 Deploy the frontend page + +Refer to the instructions provided in the [`Deploy the frontend page` section in Local Source Code Start Guide](https://docs.dify.ai/getting-started/install-self-hosted/local-source-code#deploy-the-frontend-page) to deploy the frontend page. + +Below is an example of environment variable configuration found in `dify/web/.env.local`: + + +```bash +# For production release, change this to PRODUCTION +NEXT_PUBLIC_DEPLOY_ENV=DEVELOPMENT +NEXT_PUBLIC_EDITION=SELF_HOSTED +NEXT_PUBLIC_API_PREFIX=http://localhost:5001/console/api +NEXT_PUBLIC_PUBLIC_API_PREFIX=http://localhost:5001/api +NEXT_PUBLIC_SENTRY_DSN= +``` + +```eval_rst + +.. note:: + + If you encounter connection problems, you may run `export no_proxy=localhost,127.0.0.1` before starting API servcie, Worker service and frontend. + +``` + + +### 3. How to Use `Dify` + +For comprehensive usage instructions of Dify, please refer to the [Dify Documentation](https://docs.dify.ai/). In this section, we'll only highlight a few key steps for local LLM setup. + + +#### Setup Ollama + +Open your browser and access the Dify UI at `http://localhost:3000`. + + +Configure the Ollama URL in `Settings > Model Providers > Ollama`. For detailed instructions on how to do this, see the [Ollama Guide in the Dify Documentation](https://docs.dify.ai/tutorials/model-configuration/ollama). + + +

rag-menu

+ +Once Ollama is successfully connected, you will see a list of Ollama models similar to the following: +

+ image-p1 +

+ + + +#### Run a simple RAG + +- Select the text summarization workflow template from the studio. +

+ image-p2 +

+ +- Add a knowledge base and specify the LLM or embedding model to use. +

+ image-p3 +

+ +- Enter your input in the workflow and execute it. You'll find retrieval results and generated answers on the right. +

+image-20240221102252560 +

+ + diff --git a/docs/mddocs/Quickstart/fastchat_quickstart.md b/docs/mddocs/Quickstart/fastchat_quickstart.md new file mode 100644 index 00000000..b154026d --- /dev/null +++ b/docs/mddocs/Quickstart/fastchat_quickstart.md @@ -0,0 +1,421 @@ +# Serving using IPEX-LLM and FastChat + +FastChat is an open platform for training, serving, and evaluating large language model based chatbots. You can find the detailed information at their [homepage](https://github.com/lm-sys/FastChat). + +IPEX-LLM can be easily integrated into FastChat so that user can use `IPEX-LLM` as a serving backend in the deployment. + +## Quick Start + +This quickstart guide walks you through installing and running `FastChat` with `ipex-llm`. + +## 1. Install IPEX-LLM with FastChat + +To run on CPU, you can install ipex-llm as follows: + +```bash +pip install --pre --upgrade ipex-llm[serving,all] +``` + +To add GPU support for FastChat, you may install **`ipex-llm`** as follows: + +```bash +pip install --pre --upgrade ipex-llm[xpu,serving] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + +``` + +## 2. Start the service + +### Launch controller + +You need first run the fastchat controller + +```bash +python3 -m fastchat.serve.controller +``` + +If the controller run successfully, you can see the output like this: + +```bash +Uvicorn running on http://localhost:21001 +``` + +### Launch model worker(s) and load models + +Using IPEX-LLM in FastChat does not impose any new limitations on model usage. Therefore, all Hugging Face Transformer models can be utilized in FastChat. + +#### IPEX-LLM worker + +To integrate IPEX-LLM with `FastChat` efficiently, we have provided a new model_worker implementation named `ipex_llm_worker.py`. + +```bash +# On CPU +# Available low_bit format including sym_int4, sym_int8, bf16 etc. +python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "cpu" + +# On GPU +# Available low_bit format including sym_int4, sym_int8, fp16 etc. +source /opt/intel/oneapi/setvars.sh +export USE_XETLA=OFF +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 + +python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu" +``` + +We have also provided an option `--load-low-bit-model` to load models that have been converted and saved into disk using the `save_low_bit` interface as introduced in this [document](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/KeyFeatures/hugging_face_format.html#save-load). + +Check the following examples: + +```bash +# Or --device "cpu" +python -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path /Low/Bit/Model/Path --trust-remote-code --device "xpu" --load-low-bit-model +``` + +#### For self-speculative decoding example: + +You can use IPEX-LLM to run `self-speculative decoding` example. Refer to [here](https://github.com/intel-analytics/ipex-llm/tree/c9fac8c26bf1e1e8f7376fa9a62b32951dd9e85d/python/llm/example/GPU/Speculative-Decoding) for more details on intel MAX GPUs. Refer to [here](https://github.com/intel-analytics/ipex-llm/tree/c9fac8c26bf1e1e8f7376fa9a62b32951dd9e85d/python/llm/example/GPU/Speculative-Decoding) for more details on intel CPUs. + +```bash +# Available low_bit format only including bf16 on CPU. +source ipex-llm-init -t +python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "bf16" --trust-remote-code --device "cpu" --speculative + +# Available low_bit format only including fp16 on GPU. +source /opt/intel/oneapi/setvars.sh +export ENABLE_SDP_FUSION=1 +export SYCL_CACHE_PERSISTENT=1 +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "fp16" --trust-remote-code --device "xpu" --speculative +``` + +You can get output like this: + +```bash +2024-04-12 18:18:09 | INFO | ipex_llm.transformers.utils | Converting the current model to sym_int4 format...... +2024-04-12 18:18:11 | INFO | model_worker | Register to controller +2024-04-12 18:18:11 | ERROR | stderr | INFO: Started server process [126133] +2024-04-12 18:18:11 | ERROR | stderr | INFO: Waiting for application startup. +2024-04-12 18:18:11 | ERROR | stderr | INFO: Application startup complete. +2024-04-12 18:18:11 | ERROR | stderr | INFO: Uvicorn running on http://localhost:21002 +``` + +For a full list of accepted arguments, you can refer to the main method of the `ipex_llm_worker.py` + +#### IPEX-LLM vLLM worker + +We also provide the `vllm_worker` which uses the [vLLM](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/vLLM-Serving) engine for better hardware utilization. + +To run using the `vLLM_worker`, we don't need to change model name, just simply uses the following command: + +```bash +# On CPU +python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device cpu + +# On GPU +source /opt/intel/oneapi/setvars.sh +export USE_XETLA=OFF +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device xpu --load-in-low-bit "sym_int4" --enforce-eager +``` + +#### Launch multiple workers + +Sometimes we may want to start multiple workers for the best performance. For running in CPU, you may want to seperate multiple workers in different sockets. Assuming each socket have 48 physicall cores, then you may want to start two workers using the following example: + +```bash +export OMP_NUM_THREADS=48 +numactl -C 0-47 -m 0 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "cpu" & + +# All the workers other than the first worker need to specify a different worker port and corresponding worker-address +numactl -C 48-95 -m 1 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "cpu" --port 21003 --worker-address "http://localhost:21003" & +``` + +For GPU, we may want to start two workers using different GPUs. To achieve this, you should use `ZE_AFFINITY_MASK` environment variable to select different GPUs for different workers. Below shows an example: + +```bash +ZE_AFFINITY_MASK=1 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu" & + +# All the workers other than the first worker need to specify a different worker port and corresponding worker-address +ZE_AFFINITY_MASK=2 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu" --port 21003 --worker-address "http://localhost:21003" & +``` + +If you are not sure the effect of `ZE_AFFINITY_MASK`, then you could set `ZE_AFFINITY_MASK` and check the result of `sycl-ls`. + +### Launch Gradio web server + +When you have started the controller and the worker, you can start web server as follows: + +```bash +python3 -m fastchat.serve.gradio_web_server +``` + +This is the user interface that users will interact with. + + + + + +By following these steps, you will be able to serve your models using the web UI with IPEX-LLM as the backend. You can open your browser and chat with a model now. + +### Launch TGI Style API server + +When you have started the controller and the worker, you can start TGI Style API server as follows: + +```bash +python3 -m ipex_llm.serving.fastchat.tgi_api_server --host localhost --port 8000 +``` +You can use `curl` for observing the output of the api + +#### Using /generate API + +This is to send a sentence as inputs in the request, and is expected to receive a response containing model-generated answer. + +```bash +curl -X POST -H "Content-Type: application/json" -d '{ + "inputs": "What is AI?", + "parameters": { + "best_of": 1, + "decoder_input_details": true, + "details": true, + "do_sample": true, + "frequency_penalty": 0.1, + "grammar": { + "type": "json", + "value": "string" + }, + "max_new_tokens": 32, + "repetition_penalty": 1.03, + "return_full_text": false, + "seed": 0.1, + "stop": [ + "photographer" + ], + "temperature": 0.5, + "top_k": 10, + "top_n_tokens": 5, + "top_p": 0.95, + "truncate": true, + "typical_p": 0.95, + "watermark": true + } +}' http://localhost:8000/generate +``` + +Sample output: +```bash +{ + "details": { + "best_of_sequences": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "\nArtificial Intelligence (AI) is a branch of computer science that attempts to simulate the way that the human brain works. It is a branch of computer " + }, + "finish_reason": "length", + "generated_text": "\nArtificial Intelligence (AI) is a branch of computer science that attempts to simulate the way that the human brain works. It is a branch of computer ", + "generated_tokens": 31 + } + ] + }, + "generated_text": "\nArtificial Intelligence (AI) is a branch of computer science that attempts to simulate the way that the human brain works. It is a branch of computer ", + "usage": { + "prompt_tokens": 4, + "total_tokens": 35, + "completion_tokens": 31 + } +} +``` + +#### Using /generate_stream API + +This is to send a sentence as inputs in the request, and a long connection will be opened to continuously receive multiple responses containing model-generated answer. + +```bash +curl -X POST -H "Content-Type: application/json" -d '{ + "inputs": "What is AI?", + "parameters": { + "best_of": 1, + "decoder_input_details": true, + "details": true, + "do_sample": true, + "frequency_penalty": 0.1, + "grammar": { + "type": "json", + "value": "string" + }, + "max_new_tokens": 32, + "repetition_penalty": 1.03, + "return_full_text": false, + "seed": 0.1, + "stop": [ + "photographer" + ], + "temperature": 0.5, + "top_k": 10, + "top_n_tokens": 5, + "top_p": 0.95, + "truncate": true, + "typical_p": 0.95, + "watermark": true + } +}' http://localhost:8000/generate_stream +``` + +Sample output: +```bash +data: {"token": {"id": 663359, "text": "", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null} + +data: {"token": {"id": 300560, "text": "\n", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null} + +data: {"token": {"id": 725120, "text": "Artificial Intelligence ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null} + +data: {"token": {"id": 734609, "text": "(AI) is ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null} + +data: {"token": {"id": 362235, "text": "a branch of computer ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null} + +data: {"token": {"id": 380983, "text": "science that attempts to ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null} + +data: {"token": {"id": 249979, "text": "simulate the way that ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null} + +data: {"token": {"id": 972663, "text": "the human brain ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null} + +data: {"token": {"id": 793301, "text": "works. It is a ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null} + +data: {"token": {"id": 501380, "text": "branch of computer ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null} + +data: {"token": {"id": 673232, "text": "", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null} + +data: {"token": {"id": 2, "text": "
", "logprob": 0.0, "special": true}, "generated_text": "\nArtificial Intelligence (AI) is a branch of computer science that attempts to simulate the way that the human brain works. It is a branch of computer ", "details": {"finish_reason": "eos_token", "generated_tokens": 31, "prefill_tokens": 4, "seed": 2023}, "special_ret": {"tensor": []}} +``` + + +### Launch RESTful API server + +To start an OpenAI API server that provides compatible APIs using IPEX-LLM backend, you can launch the `openai_api_server` and follow this [doc](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md) to use it. + +When you have started the controller and the worker, you can start RESTful API server as follows: + +```bash +python3 -m fastchat.serve.openai_api_server --host localhost --port 8000 +``` + +You can use `curl` for observing the output of the api + +You can format the output using `jq` + +#### List Models + +```bash +curl http://localhost:8000/v1/models | jq +``` + +Example output + +```json + +{ + "object": "list", + "data": [ + { + "id": "Llama-2-7b-chat-hf", + "object": "model", + "created": 1712919071, + "owned_by": "fastchat", + "root": "Llama-2-7b-chat-hf", + "parent": null, + "permission": [ + { + "id": "modelperm-XpFyEE7Sewx4XYbEcdbCVz", + "object": "model_permission", + "created": 1712919071, + "allow_create_engine": false, + "allow_sampling": true, + "allow_logprobs": true, + "allow_search_indices": true, + "allow_view": true, + "allow_fine_tuning": false, + "organization": "*", + "group": null, + "is_blocking": false + } + ] + } + ] +} +``` + +#### Chat Completions + +```bash +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Llama-2-7b-chat-hf", + "messages": [{"role": "user", "content": "Hello! What is your name?"}] + }' | jq +``` + +Example output + +```json +{ + "id": "chatcmpl-jJ9vKSGkcDMTxKfLxK7q2x", + "object": "chat.completion", + "created": 1712919092, + "model": "Llama-2-7b-chat-hf", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": " Hello! My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. Unterscheidung. 😊" + }, + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 15, + "total_tokens": 53, + "completion_tokens": 38 + } +} + +``` + +#### Text Completions + +```bash +curl http://localhost:8000/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Llama-2-7b-chat-hf", + "prompt": "Once upon a time", + "max_tokens": 41, + "temperature": 0.5 + }' | jq +``` + +Example Output: + +```json +{ + "id": "cmpl-PsAkpTWMmBLzWCTtM4r97Y", + "object": "text_completion", + "created": 1712919307, + "model": "Llama-2-7b-chat-hf", + "choices": [ + { + "index": 0, + "text": ", in a far-off land, there was a magical kingdom called \"Happily Ever Laughter.\" It was a place where laughter was the key to happiness, and everyone who ", + "logprobs": null, + "finish_reason": "length" + } + ], + "usage": { + "prompt_tokens": 5, + "total_tokens": 45, + "completion_tokens": 40 + } +} + +``` diff --git a/docs/mddocs/Quickstart/index.rst b/docs/mddocs/Quickstart/index.rst new file mode 100644 index 00000000..2e82acde --- /dev/null +++ b/docs/mddocs/Quickstart/index.rst @@ -0,0 +1,33 @@ +IPEX-LLM Quickstart +================================ + +.. note:: + + We are adding more Quickstart guide. + +This section includes efficient guide to show you how to: + + +* |bigdl_llm_migration_guide|_ +* `Install IPEX-LLM on Linux with Intel GPU <./install_linux_gpu.html>`_ +* `Install IPEX-LLM on Windows with Intel GPU <./install_windows_gpu.html>`_ +* `Install IPEX-LLM in Docker on Windows with Intel GPU <./docker_windows_gpu.html>`_ +* `Run PyTorch Inference on Intel GPU using Docker (on Linux or WSL) <./docker_benchmark_quickstart.html>`_ +* `Run Performance Benchmarking with IPEX-LLM <./benchmark_quickstart.html>`_ +* `Run Local RAG using Langchain-Chatchat on Intel GPU <./chatchat_quickstart.html>`_ +* `Run Text Generation WebUI on Intel GPU <./webui_quickstart.html>`_ +* `Run Open WebUI on Intel GPU <./open_webui_with_ollama_quickstart.html>`_ +* `Run PrivateGPT with IPEX-LLM on Intel GPU <./privateGPT_quickstart.html>`_ +* `Run Coding Copilot (Continue) in VSCode with Intel GPU <./continue_quickstart.html>`_ +* `Run Dify on Intel GPU <./dify_quickstart.html>`_ +* `Run llama.cpp with IPEX-LLM on Intel GPU <./llama_cpp_quickstart.html>`_ +* `Run Ollama with IPEX-LLM on Intel GPU <./ollama_quickstart.html>`_ +* `Run Llama 3 on Intel GPU using llama.cpp and ollama with IPEX-LLM <./llama3_llamacpp_ollama_quickstart.html>`_ +* `Run IPEX-LLM Serving with FastChat <./fastchat_quickstart.html>`_ +* `Run IPEX-LLM Serving with vLLM on Intel GPU <./vLLM_quickstart.html>`_ +* `Finetune LLM with Axolotl on Intel GPU <./axolotl_quickstart.html>`_ +* `Run IPEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi <./deepspeed_autotp_fastapi_quickstart.html>`_ + + +.. |bigdl_llm_migration_guide| replace:: ``bigdl-llm`` Migration Guide +.. _bigdl_llm_migration_guide: bigdl_llm_migration.html diff --git a/docs/mddocs/Quickstart/install_linux_gpu.md b/docs/mddocs/Quickstart/install_linux_gpu.md new file mode 100644 index 00000000..47d8f4a3 --- /dev/null +++ b/docs/mddocs/Quickstart/install_linux_gpu.md @@ -0,0 +1,313 @@ +# Install IPEX-LLM on Linux with Intel GPU + +This guide demonstrates how to install IPEX-LLM on Linux with Intel GPUs. It applies to Intel Data Center GPU Flex Series and Max Series, as well as Intel Arc Series GPU. + +IPEX-LLM currently supports the Ubuntu 20.04 operating system and later, and supports PyTorch 2.0 and PyTorch 2.1 on Linux. This page demonstrates IPEX-LLM with PyTorch 2.1. Check the [Installation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#linux) page for more details. + +## Install Prerequisites + +### Install GPU Driver + +#### For Linux kernel 6.2 + +* Install wget, gpg-agent + ```bash + sudo apt-get install -y gpg-agent wget + wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \ + sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg + echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client" | \ + sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list + ``` + + + +* Install drivers + + ```bash + sudo apt-get update + sudo apt-get -y install \ + gawk \ + dkms \ + linux-headers-$(uname -r) \ + libc6-dev + sudo apt install intel-i915-dkms intel-fw-gpu + sudo apt-get install -y gawk libc6-dev udev\ + intel-opencl-icd intel-level-zero-gpu level-zero \ + intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \ + libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \ + libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \ + mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo + + sudo reboot + ``` + + + + + + +* Configure permissions + ```bash + sudo gpasswd -a ${USER} render + newgrp render + + # Verify the device is working with i915 driver + sudo apt-get install -y hwinfo + hwinfo --display + ``` + +#### For Linux kernel 6.5 + +* Install wget, gpg-agent + ```bash + sudo apt-get install -y gpg-agent wget + wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \ + sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg + echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client" | \ + sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list + ``` + + + +* Install drivers + + ```bash + sudo apt-get update + sudo apt-get -y install \ + gawk \ + dkms \ + linux-headers-$(uname -r) \ + libc6-dev + + sudo apt-get install -y gawk libc6-dev udev\ + intel-opencl-icd intel-level-zero-gpu level-zero \ + intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \ + libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \ + libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \ + mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo + + sudo apt install -y intel-i915-dkms intel-fw-gpu + + sudo reboot + ``` + + + + +#### (Optional) Update Level Zero on Intel Core™ Ultra iGPU +For Intel Core™ Ultra integrated GPU, please make sure level_zero version >= 1.3.28717. The level_zero version can be checked with `sycl-ls`, and verison will be tagged behind `[ext_oneapi_level_zero:gpu]`. + +Here are the sample output of `sycl-ls`: +``` +[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix] +[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 5 125H OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix] +[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO [24.09.28717.12] +[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.28717] +``` + +If you have level_zero version < 1.3.28717, you could update as follows: +```bash +wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.16238.4/intel-igc-core_1.0.16238.4_amd64.deb +wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.16238.4/intel-igc-opencl_1.0.16238.4_amd64.deb +wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-level-zero-gpu-dbgsym_1.3.28717.12_amd64.ddeb +wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-level-zero-gpu_1.3.28717.12_amd64.deb +wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-opencl-icd-dbgsym_24.09.28717.12_amd64.ddeb +wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-opencl-icd_24.09.28717.12_amd64.deb +wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/libigdgmm12_22.3.17_amd64.deb +sudo dpkg -i *.deb +``` + +### Install oneAPI + ``` + wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null + + echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list + + sudo apt update + + sudo apt install intel-oneapi-common-vars=2024.0.0-49406 \ + intel-oneapi-common-oneapi-vars=2024.0.0-49406 \ + intel-oneapi-diagnostics-utility=2024.0.0-49093 \ + intel-oneapi-compiler-dpcpp-cpp=2024.0.2-49895 \ + intel-oneapi-dpcpp-ct=2024.0.0-49381 \ + intel-oneapi-mkl=2024.0.0-49656 \ + intel-oneapi-mkl-devel=2024.0.0-49656 \ + intel-oneapi-mpi=2021.11.0-49493 \ + intel-oneapi-mpi-devel=2021.11.0-49493 \ + intel-oneapi-dal=2024.0.1-25 \ + intel-oneapi-dal-devel=2024.0.1-25 \ + intel-oneapi-ippcp=2021.9.1-5 \ + intel-oneapi-ippcp-devel=2021.9.1-5 \ + intel-oneapi-ipp=2021.10.1-13 \ + intel-oneapi-ipp-devel=2021.10.1-13 \ + intel-oneapi-tlt=2024.0.0-352 \ + intel-oneapi-ccl=2021.11.2-5 \ + intel-oneapi-ccl-devel=2021.11.2-5 \ + intel-oneapi-dnnl-devel=2024.0.0-49521 \ + intel-oneapi-dnnl=2024.0.0-49521 \ + intel-oneapi-tcm-1.0=1.0.0-435 + ``` + image-20240221102252565 + + image-20240221102252565 + +### Setup Python Environment + +Download and install the Miniforge as follows if you don't have conda installed on your machine: + ```bash + wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh + bash Miniforge3-Linux-x86_64.sh + source ~/.bashrc + ``` + +You can use `conda --version` to verify you conda installation. + +After installation, create a new python environment `llm`: +```cmd +conda create -n llm python=3.11 +``` +Activate the newly created environment `llm`: +```cmd +conda activate llm +``` + + +## Install `ipex-llm` + +With the `llm` environment active, use `pip` to install `ipex-llm` for GPU. +Choose either US or CN website for `extra-index-url`: + +```eval_rst +.. tabs:: + .. tab:: US + + .. code-block:: cmd + + pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + + .. tab:: CN + + .. code-block:: cmd + + pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ +``` + +```eval_rst +.. note:: + + If you encounter network issues while installing IPEX, refer to `this guide `_ for troubleshooting advice. +``` + +## Verify Installation +* You can verify if `ipex-llm` is successfully installed by simply importing a few classes from the library. For example, execute the following import command in the terminal: + ```bash + source /opt/intel/oneapi/setvars.sh + + python + + > from ipex_llm.transformers import AutoModel, AutoModelForCausalLM + ``` + +## Runtime Configurations + +To use GPU acceleration on Linux, several environment variables are required or recommended before running a GPU example. + +```eval_rst +.. tabs:: + .. tab:: Intel Arc™ A-Series and Intel Data Center GPU Flex + + For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, we recommend: + + .. code-block:: bash + + # Configure oneAPI environment variables. Required step for APT or offline installed oneAPI. + # Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH. + source /opt/intel/oneapi/setvars.sh + + # Recommended Environment Variables for optimal performance + export USE_XETLA=OFF + export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 + export SYCL_CACHE_PERSISTENT=1 + + .. tab:: Intel Data Center GPU Max + + For Intel Data Center GPU Max Series, we recommend: + + .. code-block:: bash + + # Configure oneAPI environment variables. Required step for APT or offline installed oneAPI. + # Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH. + source /opt/intel/oneapi/setvars.sh + + # Recommended Environment Variables for optimal performance + export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so + export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 + export SYCL_CACHE_PERSISTENT=1 + export ENABLE_SDP_FUSION=1 + + Please note that ``libtcmalloc.so`` can be installed by ``conda install -c conda-forge -y gperftools=2.10`` + +``` + + ```eval_rst + .. seealso:: + + Please refer to `this guide <../Overview/install_gpu.html#id5>`_ for more details regarding runtime configuration. + ``` + +## A Quick Example + +Now let's play with a real LLM. We'll be using the [phi-1.5](https://huggingface.co/microsoft/phi-1_5) model, a 1.3 billion parameter LLM for this demostration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?". + +* Step 1: Activate the Python environment `llm` you previously created: + ```bash + conda activate llm + ``` +* Step 2: Follow [Runtime Configurations Section](#runtime-configurations) above to prepare your runtime environment. +* Step 3: Create a new file named `demo.py` and insert the code snippet below. + ```python + # Copy/Paste the contents to a new file demo.py + import torch + from ipex_llm.transformers import AutoModelForCausalLM + from transformers import AutoTokenizer, GenerationConfig + generation_config = GenerationConfig(use_cache = True) + + tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b", trust_remote_code=True) + # load Model using ipex-llm and load it to GPU + model = AutoModelForCausalLM.from_pretrained( + "tiiuae/falcon-7b", load_in_4bit=True, cpu_embedding=True, trust_remote_code=True) + model = model.to('xpu') + + # Format the prompt + question = "What is AI?" + prompt = " Question:{prompt}\n\n Answer:".format(prompt=question) + # Generate predicted tokens + with torch.inference_mode(): + input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') + # warm up one more time before the actual generation task for the first run, see details in `Tips & Troubleshooting` + # output = model.generate(input_ids, do_sample=False, max_new_tokens=32, generation_config = generation_config) + output = model.generate(input_ids, do_sample=False, max_new_tokens=32, generation_config = generation_config).cpu() + output_str = tokenizer.decode(output[0], skip_special_tokens=True) + print(output_str) + ``` + > Note: when running LLMs on Intel iGPUs with limited memory size, we recommend setting `cpu_embedding=True` in the `from_pretrained` function. + > This will allow the memory-intensive embedding layer to utilize the CPU instead of GPU. + +* Step 5. Run `demo.py` within the activated Python environment using the following command: + ```bash + python demo.py + ``` + + ### Example output + + Example output on a system equipped with an 11th Gen Intel Core i7 CPU and Iris Xe Graphics iGPU: + ``` + Question:What is AI? + Answer: AI stands for Artificial Intelligence, which is the simulation of human intelligence in machines. + ``` + +## Tips & Troubleshooting + +### Warmup for optimial performance on first run +When running LLMs on GPU for the first time, you might notice the performance is lower than expected, with delays up to several minutes before the first token is generated. This delay occurs because the GPU kernels require compilation and initialization, which varies across different GPU types. To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks. If you're developing an application, you can incorporate this warmup step into start-up or loading routine to enhance the user experience. + diff --git a/docs/mddocs/Quickstart/install_windows_gpu.md b/docs/mddocs/Quickstart/install_windows_gpu.md new file mode 100644 index 00000000..fe94002f --- /dev/null +++ b/docs/mddocs/Quickstart/install_windows_gpu.md @@ -0,0 +1,305 @@ +# Install IPEX-LLM on Windows with Intel GPU + +This guide demonstrates how to install IPEX-LLM on Windows with Intel GPUs. + +It applies to Intel Core Ultra and Core 11 - 14 gen integrated GPUs (iGPUs), as well as Intel Arc Series GPU. + +## Install Prerequisites + +### (Optional) Update GPU Driver + +```eval_rst +.. tip:: + + It is recommended to update your GPU driver, if you have driver version lower than ``31.0.101.5122``. Refer to `here <../Overview/install_gpu.html#prerequisites>`_ for more information. +``` + +Download and install the latest GPU driver from the [official Intel download page](https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html). A system reboot is necessary to apply the changes after the installation is complete. + +```eval_rst +.. note:: + + The process could take around 10 minutes. After reboot, check for the **Intel Arc Control** application to verify the driver has been installed correctly. If the installation was successful, you should see the **Arc Control** interface similar to the figure below +``` + + + + + + + + +### Setup Python Environment + +Visit [Miniforge installation page](https://conda-forge.org/download/), download the **Miniforge installer for Windows**, and follow the instructions to complete the installation. + +
+ +
+ +After installation, open the **Miniforge Prompt**, create a new python environment `llm`: +```cmd +conda create -n llm python=3.11 libuv +``` +Activate the newly created environment `llm`: +```cmd +conda activate llm +``` + +## Install `ipex-llm` + +With the `llm` environment active, use `pip` to install `ipex-llm` for GPU. Choose either US or CN website for `extra-index-url`: + +```eval_rst +.. tabs:: + .. tab:: US + + .. code-block:: cmd + + pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + + .. tab:: CN + + .. code-block:: cmd + + pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ +``` + +```eval_rst +.. note:: + + If you encounter network issues while installing IPEX, refer to `this guide `_ for troubleshooting advice. +``` + +## Verify Installation +You can verify if `ipex-llm` is successfully installed following below steps. + +### Step 1: Runtime Configurations +* Open the **Miniforge Prompt** and activate the Python environment `llm` you previously created: + ```cmd + conda activate llm + ``` + +* Set the following environment variables according to your device: + + ```eval_rst + .. tabs:: + .. tab:: Intel iGPU + + .. code-block:: cmd + + set SYCL_CACHE_PERSISTENT=1 + set BIGDL_LLM_XMX_DISABLED=1 + + .. tab:: Intel Arc™ A770 + + .. code-block:: cmd + + set SYCL_CACHE_PERSISTENT=1 + ``` + + ```eval_rst + .. seealso:: + + For other Intel dGPU Series, please refer to `this guide <../Overview/install_gpu.html#runtime-configuration>`_ for more details regarding runtime configuration. + ``` + +### Step 2: Run Python Code + +* Launch the Python interactive shell by typing `python` in the Miniforge Prompt window and then press Enter. + +* Copy following code to Miniforge Prompt **line by line** and press Enter **after copying each line**. + ```python + import torch + from ipex_llm.transformers import AutoModel,AutoModelForCausalLM + tensor_1 = torch.randn(1, 1, 40, 128).to('xpu') + tensor_2 = torch.randn(1, 1, 128, 40).to('xpu') + print(torch.matmul(tensor_1, tensor_2).size()) + ``` + It will output following content at the end: + ``` + torch.Size([1, 1, 40, 40]) + ``` + + ```eval_rst + .. seealso:: + + If you encounter any problem, please refer to `here `_ for help. + ``` +* To exit the Python interactive shell, simply press Ctrl+Z then press Enter (or input `exit()` then press Enter). + +## Monitor GPU Status +To monitor your GPU's performance and status (e.g. memory consumption, utilization, etc.), you can use either the **Windows Task Manager (in `Performance` Tab)** (see the left side of the figure below) or the **Arc Control** application (see the right side of the figure below) + + + +## A Quick Example + +Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat) model, a 1.8 billion parameter LLM for this demonstration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?". + +* Step 1: Follow [Runtime Configurations Section](#step-1-runtime-configurations) above to prepare your runtime environment. +* Step 2: Install additional package required for Qwen-1.8B-Chat to conduct: + ```cmd + pip install tiktoken transformers_stream_generator einops + ``` +* Step 3: Create code file. IPEX-LLM supports loading model from Hugging Face or ModelScope. Please choose according to your requirements. + ```eval_rst + .. tabs:: + .. tab:: Hugging Face + Create a new file named ``demo.py`` and insert the code snippet below to run `Qwen-1.8B-Chat `_ model with IPEX-LLM optimizations. + + .. code-block:: python + + # Copy/Paste the contents to a new file demo.py + import torch + from ipex_llm.transformers import AutoModelForCausalLM + from transformers import AutoTokenizer, GenerationConfig + generation_config = GenerationConfig(use_cache=True) + + print('Now start loading Tokenizer and optimizing Model...') + tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat", + trust_remote_code=True) + + # Load Model using ipex-llm and load it to GPU + model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", + load_in_4bit=True, + cpu_embedding=True, + trust_remote_code=True) + model = model.to('xpu') + print('Successfully loaded Tokenizer and optimized Model!') + + # Format the prompt + question = "What is AI?" + prompt = "user: {prompt}\n\nassistant:".format(prompt=question) + + # Generate predicted tokens + with torch.inference_mode(): + input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') + + print('--------------------------------------Note-----------------------------------------') + print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |') + print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |') + print('| Please be patient until it finishes warm-up... |') + print('-----------------------------------------------------------------------------------') + + # To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks. + # If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience. + output = model.generate(input_ids, + do_sample=False, + max_new_tokens=32, + generation_config=generation_config) # warm-up + + print('Successfully finished warm-up, now start generation...') + + output = model.generate(input_ids, + do_sample=False, + max_new_tokens=32, + generation_config=generation_config).cpu() + output_str = tokenizer.decode(output[0], skip_special_tokens=True) + print(output_str) + + .. tab:: ModelScope + + Please first run following command in Miniforge Prompt to install ModelScope: + + .. code-block:: cmd + + pip install modelscope==1.11.0 + + Create a new file named ``demo.py`` and insert the code snippet below to run `Qwen-1.8B-Chat `_ model with IPEX-LLM optimizations. + + .. code-block:: python + + # Copy/Paste the contents to a new file demo.py + import torch + from ipex_llm.transformers import AutoModelForCausalLM + from transformers import GenerationConfig + from modelscope import AutoTokenizer + generation_config = GenerationConfig(use_cache=True) + + print('Now start loading Tokenizer and optimizing Model...') + tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat", + trust_remote_code=True) + + # Load Model using ipex-llm and load it to GPU + model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", + load_in_4bit=True, + cpu_embedding=True, + trust_remote_code=True, + model_hub='modelscope') + model = model.to('xpu') + print('Successfully loaded Tokenizer and optimized Model!') + + # Format the prompt + question = "What is AI?" + prompt = "user: {prompt}\n\nassistant:".format(prompt=question) + + # Generate predicted tokens + with torch.inference_mode(): + input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') + + print('--------------------------------------Note-----------------------------------------') + print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |') + print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |') + print('| Please be patient until it finishes warm-up... |') + print('-----------------------------------------------------------------------------------') + + # To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks. + # If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience. + output = model.generate(input_ids, + do_sample=False, + max_new_tokens=32, + generation_config=generation_config) # warm-up + + print('Successfully finished warm-up, now start generation...') + + output = model.generate(input_ids, + do_sample=False, + max_new_tokens=32, + generation_config=generation_config).cpu() + output_str = tokenizer.decode(output[0], skip_special_tokens=True) + print(output_str) + + + .. tip:: + + Please note that the repo id on ModelScope may be different from Hugging Face for some models. + + ``` + + ```eval_rst + .. note:: + + When running LLMs on Intel iGPUs with limited memory size, we recommend setting ``cpu_embedding=True`` in the ``from_pretrained`` function. + This will allow the memory-intensive embedding layer to utilize the CPU instead of GPU. + ``` + +* Step 4. Run `demo.py` within the activated Python environment using the following command: + ```cmd + python demo.py + ``` + + ### Example output + + Example output on a system equipped with an Intel Core Ultra 5 125H CPU and Intel Arc Graphics iGPU: + ``` + user: What is AI? + + assistant: AI stands for Artificial Intelligence, which refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, + ``` + +## Tips & Troubleshooting + +### Warm-up for optimal performance on first run +When running LLMs on GPU for the first time, you might notice the performance is lower than expected, with delays up to several minutes before the first token is generated. This delay occurs because the GPU kernels require compilation and initialization, which varies across different GPU types. To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks. If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience. diff --git a/docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md b/docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md new file mode 100644 index 00000000..0576cc98 --- /dev/null +++ b/docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md @@ -0,0 +1,201 @@ +# Run Llama 3 on Intel GPU using llama.cpp and ollama with IPEX-LLM + +[Llama 3](https://llama.meta.com/llama3/) is the latest Large Language Models released by [Meta](https://llama.meta.com/) which provides state-of-the-art performance and excels at language nuances, contextual understanding, and complex tasks like translation and dialogue generation. + +Now, you can easily run Llama 3 on Intel GPU using `llama.cpp` and `Ollama` with IPEX-LLM. + +See the demo of running Llama-3-8B-Instruct on Intel Arc GPU using `Ollama` below. + + + +## Quick Start +This quickstart guide walks you through how to run Llama 3 on Intel GPU using `llama.cpp` / `Ollama` with IPEX-LLM. + +### 1. Run Llama 3 using llama.cpp + +#### 1.1 Install IPEX-LLM for llama.cpp and Initialize + +Visit [Run llama.cpp with IPEX-LLM on Intel GPU Guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html), and follow the instructions in section [Prerequisites](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#prerequisites) to setup and section [Install IPEX-LLM for llama.cpp](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#install-ipex-llm-for-llama-cpp) to install the IPEX-LLM with llama.cpp binaries, then follow the instructions in section [Initialize llama.cpp with IPEX-LLM](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#initialize-llama-cpp-with-ipex-llm) to initialize. + +**After above steps, you should have created a conda environment, named `llm-cpp` for instance and have llama.cpp binaries in your current directory.** + +**Now you can use these executable files by standard llama.cpp usage.** + +#### 1.2 Download Llama3 + +There already are some GGUF models of Llama3 in community, here we take [Meta-Llama-3-8B-Instruct-GGUF](https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF) for example. + +Suppose you have downloaded a [Meta-Llama-3-8B-Instruct-Q4_K_M.gguf](https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf) model from [Meta-Llama-3-8B-Instruct-GGUF](https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF) and put it under ``. + +#### 1.3 Run Llama3 on Intel GPU using llama.cpp + +#### Runtime Configuration + +To use GPU acceleration, several environment variables are required or recommended before running `llama.cpp`. + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + source /opt/intel/oneapi/setvars.sh + export SYCL_CACHE_PERSISTENT=1 + + .. tab:: Windows + + .. code-block:: bash + + set SYCL_CACHE_PERSISTENT=1 + +``` + +```eval_rst +.. tip:: + + If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance: + + .. code-block:: bash + + export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 + +``` + +##### Run llama3 + +Under your current directory, exceuting below command to do inference with Llama3: + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + ./main -m /Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing something" -t 8 -e -ngl 33 --color --no-mmap + + .. tab:: Windows + + Please run the following command in Miniforge Prompt. + + .. code-block:: bash + + main -m /Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing something" -e -ngl 33 --color --no-mmap +``` + +Under your current directory, you can also execute below command to have interactive chat with Llama3: + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + ./main -ngl 33 --interactive-first --color -e --in-prefix '<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -r '<|eot_id|>' -m /Meta-Llama-3-8B-Instruct-Q4_K_M.gguf + + .. tab:: Windows + + Please run the following command in Miniforge Prompt. + + .. code-block:: bash + + main -ngl 33 --interactive-first --color -e --in-prefix "<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -r "<|eot_id|>" -m /Meta-Llama-3-8B-Instruct-Q4_K_M.gguf +``` + +Below is a sample output on Intel Arc GPU: + + +### 2. Run Llama3 using Ollama + +#### 2.1 Install IPEX-LLM for Ollama and Initialize + +Visit [Run Ollama with IPEX-LLM on Intel GPU](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html), and follow the instructions in section [Install IPEX-LLM for llama.cpp](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#install-ipex-llm-for-llama-cpp) to install the IPEX-LLM with Ollama binary, then follow the instructions in section [Initialize Ollama](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html#initialize-ollama) to initialize. + +**After above steps, you should have created a conda environment, named `llm-cpp` for instance and have ollama binary file in your current directory.** + +**Now you can use this executable file by standard Ollama usage.** + +#### 2.2 Run Llama3 on Intel GPU using Ollama + +[ollama/ollama](https://github.com/ollama/ollama) has alreadly added [Llama3](https://ollama.com/library/llama3) into its library, so it's really easy to run Llama3 using ollama now. + +##### 2.2.1 Run Ollama Serve + +Launch the Ollama service: + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + export no_proxy=localhost,127.0.0.1 + export ZES_ENABLE_SYSMAN=1 + export OLLAMA_NUM_GPU=999 + source /opt/intel/oneapi/setvars.sh + export SYCL_CACHE_PERSISTENT=1 + + ./ollama serve + + .. tab:: Windows + + Please run the following command in Miniforge Prompt. + + .. code-block:: bash + + set no_proxy=localhost,127.0.0.1 + set ZES_ENABLE_SYSMAN=1 + set OLLAMA_NUM_GPU=999 + set SYCL_CACHE_PERSISTENT=1 + + ollama serve + +``` + +```eval_rst +.. tip:: + + If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`: + + .. code-block:: bash + + export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 + +``` + +```eval_rst +.. note:: + + To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`. +``` + +##### 2.2.2 Using Ollama Run Llama3 + +Keep the Ollama service on and open another terminal and run llama3 with `ollama run`: + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + export no_proxy=localhost,127.0.0.1 + ./ollama run llama3:8b-instruct-q4_K_M + + .. tab:: Windows + + Please run the following command in Miniforge Prompt. + + .. code-block:: bash + + set no_proxy=localhost,127.0.0.1 + ollama run llama3:8b-instruct-q4_K_M +``` + +```eval_rst +.. note:: + + Here we just take `llama3:8b-instruct-q4_K_M` for example, you can replace it with any other Llama3 model you want. +``` + +Below is a sample output on Intel Arc GPU : + diff --git a/docs/mddocs/Quickstart/llama_cpp_quickstart.md b/docs/mddocs/Quickstart/llama_cpp_quickstart.md new file mode 100644 index 00000000..1373a781 --- /dev/null +++ b/docs/mddocs/Quickstart/llama_cpp_quickstart.md @@ -0,0 +1,333 @@ +# Run llama.cpp with IPEX-LLM on Intel GPU + +[ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) prvoides fast LLM inference in in pure C++ across a variety of hardware; you can now use the C++ interface of [`ipex-llm`](https://github.com/intel-analytics/ipex-llm) as an accelerated backend for `llama.cpp` running on Intel **GPU** *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)*. + +See the demo of running LLaMA2-7B on Intel Arc GPU below. + + + +```eval_rst +.. note:: + + `ipex-llm[cpp]==2.5.0b20240527` is consistent with `c780e75 `_ of llama.cpp. + + Our current version is consistent with `62bfef5 `_ of llama.cpp. +``` + +## Quick Start +This quickstart guide walks you through installing and running `llama.cpp` with `ipex-llm`. + +### 0 Prerequisites +IPEX-LLM's support for `llama.cpp` now is available for Linux system and Windows system. + +#### Linux +For Linux system, we recommend Ubuntu 20.04 or later (Ubuntu 22.04 is preferred). + +Visit the [Install IPEX-LLM on Linux with Intel GPU](./install_linux_gpu.html), follow [Install Intel GPU Driver](./install_linux_gpu.html#install-intel-gpu-driver) and [Install oneAPI](./install_linux_gpu.html#install-oneapi) to install GPU driver and Intel® oneAPI Base Toolkit 2024.0. + +#### Windows (Optional) + +IPEX-LLM backend for llama.cpp only supports the more recent GPU drivers. Please make sure your GPU driver version is equal or newer than `31.0.101.5333`, otherwise you might find gibberish output. + +If you have lower GPU driver version, visit the [Install IPEX-LLM on Windows with Intel GPU Guide](./install_windows_gpu.html), and follow [Update GPU driver](./install_windows_gpu.html#optional-update-gpu-driver). + +### 1 Install IPEX-LLM for llama.cpp + +To use `llama.cpp` with IPEX-LLM, first ensure that `ipex-llm[cpp]` is installed. + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + conda create -n llm-cpp python=3.11 + conda activate llm-cpp + pip install --pre --upgrade ipex-llm[cpp] + + .. tab:: Windows + + .. note:: + + Please run the following command in Miniforge Prompt. + + .. code-block:: cmd + + conda create -n llm-cpp python=3.11 + conda activate llm-cpp + pip install --pre --upgrade ipex-llm[cpp] + +``` + +**After the installation, you should have created a conda environment, named `llm-cpp` for instance, for running `llama.cpp` commands with IPEX-LLM.** + +### 2 Setup for running llama.cpp + +First you should create a directory to use `llama.cpp`, for instance, use following command to create a `llama-cpp` directory and enter it. +```cmd +mkdir llama-cpp +cd llama-cpp +``` + +#### Initialize llama.cpp with IPEX-LLM + +Then you can use following command to initialize `llama.cpp` with IPEX-LLM: +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + init-llama-cpp + + After ``init-llama-cpp``, you should see many soft links of ``llama.cpp``'s executable files and a ``convert.py`` in current directory. + + .. image:: https://llm-assets.readthedocs.io/en/latest/_images/init_llama_cpp_demo_image.png + + .. tab:: Windows + + Please run the following command with **administrator privilege in Miniforge Prompt**. + + .. code-block:: bash + + init-llama-cpp.bat + + After ``init-llama-cpp.bat``, you should see many soft links of ``llama.cpp``'s executable files and a ``convert.py`` in current directory. + + .. image:: https://llm-assets.readthedocs.io/en/latest/_images/init_llama_cpp_demo_image_windows.png + +``` + +```eval_rst +.. note:: + + ``init-llama-cpp`` will create soft links of llama.cpp's executable files to current directory, if you want to use these executable files in other places, don't forget to run above commands again. +``` + +```eval_rst +.. note:: + + If you have installed higher version ``ipex-llm[cpp]`` and want to upgrade your binary file, don't forget to remove old binary files first and initialize again with ``init-llama-cpp`` or ``init-llama-cpp.bat``. +``` + +**Now you can use these executable files by standard llama.cpp's usage.** + +#### Runtime Configuration + +To use GPU acceleration, several environment variables are required or recommended before running `llama.cpp`. + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + source /opt/intel/oneapi/setvars.sh + export SYCL_CACHE_PERSISTENT=1 + + .. tab:: Windows + + Please run the following command in Miniforge Prompt. + + .. code-block:: bash + + set SYCL_CACHE_PERSISTENT=1 + +``` + +```eval_rst +.. tip:: + + If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance: + + .. code-block:: bash + + export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 + +``` + +### 3 Example: Running community GGUF models with IPEX-LLM + +Here we provide a simple example to show how to run a community GGUF model with IPEX-LLM. + +#### Model Download +Before running, you should download or copy community GGUF model to your current directory. For instance, `mistral-7b-instruct-v0.1.Q4_K_M.gguf` of [Mistral-7B-Instruct-v0.1-GGUF](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/tree/main). + +#### Run the quantized model + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + ./main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color + + .. note:: + + For more details about meaning of each parameter, you can use ``./main -h``. + + .. tab:: Windows + + Please run the following command in Miniforge Prompt. + + .. code-block:: bash + + main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color + + .. note:: + + For more details about meaning of each parameter, you can use ``main -h``. +``` + +#### Sample Output +``` +Log start +main: build = 1 (38bcbd4) +main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.0 (2024.0.0.20231017) for x86_64-unknown-linux-gnu +main: seed = 1710359960 +ggml_init_sycl: GGML_SYCL_DEBUG: 0 +ggml_init_sycl: GGML_SYCL_F16: no +found 8 SYCL devices: +|ID| Name |compute capability|Max compute units|Max work group|Max sub group|Global mem size| +|--|---------------------------------------------|------------------|-----------------|--------------|-------------|---------------| +| 0| Intel(R) Arc(TM) A770 Graphics| 1.3| 512| 1024| 32| 16225243136| +| 1| Intel(R) FPGA Emulation Device| 1.2| 32| 67108864| 64| 67181625344| +| 2| 13th Gen Intel(R) Core(TM) i9-13900K| 3.0| 32| 8192| 64| 67181625344| +| 3| Intel(R) Arc(TM) A770 Graphics| 3.0| 512| 1024| 32| 16225243136| +| 4| Intel(R) Arc(TM) A770 Graphics| 3.0| 512| 1024| 32| 16225243136| +| 5| Intel(R) UHD Graphics 770| 3.0| 32| 512| 32| 53745299456| +| 6| Intel(R) Arc(TM) A770 Graphics| 1.3| 512| 1024| 32| 16225243136| +| 7| Intel(R) UHD Graphics 770| 1.3| 32| 512| 32| 53745299456| +detect 2 SYCL GPUs: [0,6] with Max compute units:512 +llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ~/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2) +llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. +llama_model_loader: - kv 0: general.architecture str = llama +llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.1 +llama_model_loader: - kv 2: llama.context_length u32 = 32768 +llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 +llama_model_loader: - kv 4: llama.block_count u32 = 32 +llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 +llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 +llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 +llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 +llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 +llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 +llama_model_loader: - kv 11: general.file_type u32 = 15 +llama_model_loader: - kv 12: tokenizer.ggml.model str = llama +llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... +llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... +llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... +llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 +llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 +llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 +llama_model_loader: - kv 19: general.quantization_version u32 = 2 +llama_model_loader: - type f32: 65 tensors +llama_model_loader: - type q4_K: 193 tensors +llama_model_loader: - type q6_K: 33 tensors +llm_load_vocab: special tokens definition check successful ( 259/32000 ). +llm_load_print_meta: format = GGUF V2 +llm_load_print_meta: arch = llama +llm_load_print_meta: vocab type = SPM +llm_load_print_meta: n_vocab = 32000 +llm_load_print_meta: n_merges = 0 +llm_load_print_meta: n_ctx_train = 32768 +llm_load_print_meta: n_embd = 4096 +llm_load_print_meta: n_head = 32 +llm_load_print_meta: n_head_kv = 8 +llm_load_print_meta: n_layer = 32 +llm_load_print_meta: n_rot = 128 +llm_load_print_meta: n_embd_head_k = 128 +llm_load_print_meta: n_embd_head_v = 128 +llm_load_print_meta: n_gqa = 4 +llm_load_print_meta: n_embd_k_gqa = 1024 +llm_load_print_meta: n_embd_v_gqa = 1024 +llm_load_print_meta: f_norm_eps = 0.0e+00 +llm_load_print_meta: f_norm_rms_eps = 1.0e-05 +llm_load_print_meta: f_clamp_kqv = 0.0e+00 +llm_load_print_meta: f_max_alibi_bias = 0.0e+00 +llm_load_print_meta: n_ff = 14336 +llm_load_print_meta: n_expert = 0 +llm_load_print_meta: n_expert_used = 0 +llm_load_print_meta: causal attm = 1 +llm_load_print_meta: pooling type = 0 +llm_load_print_meta: rope type = 0 +llm_load_print_meta: rope scaling = linear +llm_load_print_meta: freq_base_train = 10000.0 +llm_load_print_meta: freq_scale_train = 1 +llm_load_print_meta: n_yarn_orig_ctx = 32768 +llm_load_print_meta: rope_finetuned = unknown +llm_load_print_meta: ssm_d_conv = 0 +llm_load_print_meta: ssm_d_inner = 0 +llm_load_print_meta: ssm_d_state = 0 +llm_load_print_meta: ssm_dt_rank = 0 +llm_load_print_meta: model type = 7B +llm_load_print_meta: model ftype = Q4_K - Medium +llm_load_print_meta: model params = 7.24 B +llm_load_print_meta: model size = 4.07 GiB (4.83 BPW) +llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.1 +llm_load_print_meta: BOS token = 1 '' +llm_load_print_meta: EOS token = 2 '' +llm_load_print_meta: UNK token = 0 '' +llm_load_print_meta: LF token = 13 '<0x0A>' +get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory +get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory +llm_load_tensors: ggml ctx size = 0.33 MiB +llm_load_tensors: offloading 32 repeating layers to GPU +llm_load_tensors: offloading non-repeating layers to GPU +llm_load_tensors: offloaded 33/33 layers to GPU +llm_load_tensors: SYCL0 buffer size = 2113.28 MiB +llm_load_tensors: SYCL6 buffer size = 1981.77 MiB +llm_load_tensors: SYCL_Host buffer size = 70.31 MiB +............................................................................................... +llama_new_context_with_model: n_ctx = 512 +llama_new_context_with_model: freq_base = 10000.0 +llama_new_context_with_model: freq_scale = 1 +llama_kv_cache_init: SYCL0 KV buffer size = 34.00 MiB +llama_kv_cache_init: SYCL6 KV buffer size = 30.00 MiB +llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB +llama_new_context_with_model: SYCL_Host input buffer size = 10.01 MiB +llama_new_context_with_model: SYCL0 compute buffer size = 73.00 MiB +llama_new_context_with_model: SYCL6 compute buffer size = 73.00 MiB +llama_new_context_with_model: SYCL_Host compute buffer size = 8.00 MiB +llama_new_context_with_model: graph splits (measure): 3 +system_info: n_threads = 8 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | +sampling: + repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 + top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 + mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 +sampling order: +CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature +generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 1 + Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun exploring the world around her. Her parents were kind and let her do what she wanted, as long as she stayed safe. +One day, the little +llama_print_timings: load time = 10096.78 ms +llama_print_timings: sample time = x.xx ms / 32 runs ( xx.xx ms per token, xx.xx tokens per second) +llama_print_timings: prompt eval time = xx.xx ms / 31 tokens ( xx.xx ms per token, xx.xx tokens per second) +llama_print_timings: eval time = xx.xx ms / 31 runs ( xx.xx ms per token, xx.xx tokens per second) +llama_print_timings: total time = xx.xx ms / 62 tokens +Log end +``` + +### Troubleshooting + +#### Fail to quantize model +If you encounter `main: failed to quantize model from xxx`, please make sure you have created related output directory. + +#### Program hang during model loading +If your program hang after `llm_load_tensors: SYCL_Host buffer size = xx.xx MiB`, you can add `--no-mmap` in your command. + +#### How to set `-ngl` parameter +`-ngl` means the number of layers to store in VRAM. If your VRAM is enough, we recommend putting all layers on GPU, you can just set `-ngl` to a large number like 999 to achieve this goal. + +If `-ngl` is set to 0, it means that the entire model will run on CPU. If `-ngl` is set to greater than 0 and less than model layers, then it's mixed GPU + CPU scenario. + +#### How to specificy GPU +If your machine has multi GPUs, `llama.cpp` will default use all GPUs which may slow down your inference for model which can run on single GPU. You can add `-sm none` in your command to use one GPU only. + +Also, you can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select device before excuting your command, more details can refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.html#oneapi-device-selector). + +#### Program crash with Chinese prompt +If you run the llama.cpp program on Windows and find that your program crashes or outputs abnormally when accepting Chinese prompts, you can open `Region->Administrative->Change System locale..`, check `Beta: Use Unicode UTF-8 for worldwide language support` option and then restart your computer. + +For detailed instructions on how to do this, see [this issue](https://github.com/intel-analytics/ipex-llm/issues/10989#issuecomment-2105600469). diff --git a/docs/mddocs/Quickstart/ollama_quickstart.md b/docs/mddocs/Quickstart/ollama_quickstart.md new file mode 100644 index 00000000..fa81d73a --- /dev/null +++ b/docs/mddocs/Quickstart/ollama_quickstart.md @@ -0,0 +1,204 @@ +# Run Ollama with IPEX-LLM on Intel GPU + +[ollama/ollama](https://github.com/ollama/ollama) is popular framework designed to build and run language models on a local machine; you can now use the C++ interface of [`ipex-llm`](https://github.com/intel-analytics/ipex-llm) as an accelerated backend for `ollama` running on Intel **GPU** *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)*. + +See the demo of running LLaMA2-7B on Intel Arc GPU below. + + + +```eval_rst +.. note:: + + `ipex-llm[cpp]==2.5.0b20240527` is consistent with `v0.1.34 `_ of ollama. + + Our current version is consistent with `v0.1.39 `_ of ollama. +``` + +## Quickstart + +### 1 Install IPEX-LLM for Ollama + +IPEX-LLM's support for `ollama` now is available for Linux system and Windows system. + +Visit [Run llama.cpp with IPEX-LLM on Intel GPU Guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html), and follow the instructions in section [Prerequisites](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#prerequisites) to setup and section [Install IPEX-LLM cpp](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#install-ipex-llm-for-llama-cpp) to install the IPEX-LLM with Ollama binaries. + +**After the installation, you should have created a conda environment, named `llm-cpp` for instance, for running `ollama` commands with IPEX-LLM.** + +### 2. Initialize Ollama + +Activate the `llm-cpp` conda environment and initialize Ollama by executing the commands below. A symbolic link to `ollama` will appear in your current directory. + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + conda activate llm-cpp + init-ollama + + .. tab:: Windows + + Please run the following command with **administrator privilege in Miniforge Prompt**. + + .. code-block:: bash + + conda activate llm-cpp + init-ollama.bat + +``` + +```eval_rst +.. note:: + + If you have installed higher version ``ipex-llm[cpp]`` and want to upgrade your ollama binary file, don't forget to remove old binary files first and initialize again with ``init-ollama`` or ``init-ollama.bat``. +``` + +**Now you can use this executable file by standard ollama's usage.** + +### 3 Run Ollama Serve + +You may launch the Ollama service as below: + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + export OLLAMA_NUM_GPU=999 + export no_proxy=localhost,127.0.0.1 + export ZES_ENABLE_SYSMAN=1 + source /opt/intel/oneapi/setvars.sh + export SYCL_CACHE_PERSISTENT=1 + + ./ollama serve + + .. tab:: Windows + + Please run the following command in Miniforge Prompt. + + .. code-block:: bash + + set OLLAMA_NUM_GPU=999 + set no_proxy=localhost,127.0.0.1 + set ZES_ENABLE_SYSMAN=1 + set SYCL_CACHE_PERSISTENT=1 + + ollama serve + +``` + +```eval_rst +.. note:: + + Please set environment variable ``OLLAMA_NUM_GPU`` to ``999`` to make sure all layers of your model are running on Intel GPU, otherwise, some layers may run on CPU. +``` + +```eval_rst +.. tip:: + + If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`: + + .. code-block:: bash + + export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 + +``` + +```eval_rst +.. note:: + + To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`. +``` + +The console will display messages similar to the following: + + + + + + +### 4 Pull Model +Keep the Ollama service on and open another terminal and run `./ollama pull ` in Linux (`ollama.exe pull ` in Windows) to automatically pull a model. e.g. `dolphin-phi:latest`: + + + + + + +### 5 Using Ollama + +#### Using Curl + +Using `curl` is the easiest way to verify the API service and model. Execute the following commands in a terminal. **Replace the with your pulled +model**, e.g. `dolphin-phi`. + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + curl http://localhost:11434/api/generate -d ' + { + "model": "", + "prompt": "Why is the sky blue?", + "stream": false + }' + + .. tab:: Windows + + Please run the following command in Miniforge Prompt. + + .. code-block:: bash + + curl http://localhost:11434/api/generate -d " + { + \"model\": \"\", + \"prompt\": \"Why is the sky blue?\", + \"stream\": false + }" + +``` + + +#### Using Ollama Run GGUF models + +Ollama supports importing GGUF models in the Modelfile, for example, suppose you have downloaded a `mistral-7b-instruct-v0.1.Q4_K_M.gguf` from [Mistral-7B-Instruct-v0.1-GGUF](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/tree/main), then you can create a file named `Modelfile`: + +```bash +FROM ./mistral-7b-instruct-v0.1.Q4_K_M.gguf +TEMPLATE [INST] {{ .Prompt }} [/INST] +PARAMETER num_predict 64 +``` + +Then you can create the model in Ollama by `ollama create example -f Modelfile` and use `ollama run` to run the model directly on console. + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + export no_proxy=localhost,127.0.0.1 + ./ollama create example -f Modelfile + ./ollama run example + + .. tab:: Windows + + Please run the following command in Miniforge Prompt. + + .. code-block:: bash + + set no_proxy=localhost,127.0.0.1 + ollama create example -f Modelfile + ollama run example + +``` + +An example process of interacting with model with `ollama run example` looks like the following: + + + + diff --git a/docs/mddocs/Quickstart/open_webui_with_ollama_quickstart.md b/docs/mddocs/Quickstart/open_webui_with_ollama_quickstart.md new file mode 100644 index 00000000..1eb2ec05 --- /dev/null +++ b/docs/mddocs/Quickstart/open_webui_with_ollama_quickstart.md @@ -0,0 +1,208 @@ +# Run Open WebUI with Intel GPU + +[Open WebUI](https://github.com/open-webui/open-webui) is a user friendly GUI for running LLM locally; by porting it to [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily run LLM in [Open WebUI](https://github.com/open-webui/open-webui) on Intel **GPU** *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)*. + +*See the demo of running Mistral:7B on Intel Arc A770 below.* + + + +## Quickstart + +This quickstart guide walks you through setting up and using [Open WebUI](https://github.com/open-webui/open-webui) with Ollama (using the C++ interface of [`ipex-llm`](https://github.com/intel-analytics/ipex-llm) as an accelerated backend). + + +### 1 Run Ollama with Intel GPU + +Follow the instructions on the [Run Ollama with Intel GPU](ollama_quickstart.html) to install and run "Ollama Serve". Please ensure that the Ollama server continues to run while you're using the Open WebUI. + +### 2 Install the Open-Webui + +#### Install Node.js & npm + +```eval_rst +.. note:: + + Package version requirements for running Open WebUI: Node.js (>= 20.10) or Bun (>= 1.0.21), Python (>= 3.11) +``` + +Please install Node.js & npm as below: + +```eval_rst +.. tabs:: + .. tab:: Linux + + Run below commands to install Node.js & npm. Once the installation is complete, verify the installation by running ```node -v``` and ```npm -v``` to check the versions of Node.js and npm, respectively. + + .. code-block:: bash + + sudo apt update + sudo apt install nodejs + sudo apt install npm + + .. tab:: Windows + + You may download Node.js installation package from https://nodejs.org/dist/v20.12.2/node-v20.12.2-x64.msi, which will install both Node.js & npm on your system. + + Once the installation is complete, verify the installation by running ```node -v``` and ```npm -v``` to check the versions of Node.js and npm, respectively. +``` + + +#### Download the Open-Webui + +Use `git` to clone the [open-webui repo](https://github.com/open-webui/open-webui.git), or download the open-webui source code zip from [this link](https://github.com/open-webui/open-webui/archive/refs/heads/main.zip) and unzip it to a directory, e.g. `~/open-webui`. + + +#### Install Dependencies + +You may run below commands to install Open WebUI dependencies: +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + cd ~/open-webui/ + cp -RPp .env.example .env # Copy required .env file + + # Build frontend + npm i + npm run build + + # Install Dependencies + cd ./backend + pip install -r requirements.txt -U + + .. tab:: Windows + + .. code-block:: bash + + cd ~\open-webui\ + copy .env.example .env + + # Build frontend + npm install + npm run build + + # Install Dependencies + cd .\backend + pip install -r requirements.txt -U +``` + +### 3. Start the Open-WebUI + +#### Start the service + +Run below commands to start the service: + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + export no_proxy=localhost,127.0.0.1 + bash start.sh + + .. note: + + If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add `export HF_ENDPOINT=https://hf-mirror.com` before running `bash start.sh`. + + + .. tab:: Windows + + .. code-block:: bash + + set no_proxy=localhost,127.0.0.1 + start_windows.bat + + .. note: + + If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add `set HF_ENDPOINT=https://hf-mirror.com` before running `start_windows.bat`. +``` + + +#### Access the WebUI +Upon successful launch, URLs to access the WebUI will be displayed in the terminal. Open the provided local URL in your browser to interact with the WebUI, e.g. http://localhost:8080/. + + + +### 4. Using the Open-Webui + +```eval_rst +.. note:: + + For detailed information about how to use Open WebUI, visit the README of `open-webui official repository `_. + +``` + +#### Log-in + +If this is your first time using it, you need to register. After registering, log in with the registered account to access the interface. + + + + + + + + + + +#### Configure `Ollama` service URL + +Access the Ollama settings through **Settings -> Connections** in the menu. By default, the **Ollama Base URL** is preset to https://localhost:11434, as illustrated in the snapshot below. To verify the status of the Ollama service connection, click the **Refresh button** located next to the textbox. If the WebUI is unable to establish a connection with the Ollama server, you will see an error message stating, `WebUI could not connect to Ollama`. + + + + + + +If the connection is successful, you will see a message stating `Service Connection Verified`, as illustrated below. + + + + + +```eval_rst +.. note:: + + If you want to use an Ollama server hosted at a different URL, simply update the **Ollama Base URL** to the new URL and press the **Refresh** button to re-confirm the connection to Ollama. +``` + +#### Pull Model + +Go to **Settings -> Models** in the menu, choose a model under **Pull a model from Ollama.com** using the drop-down menu, and then hit the **Download** button on the right. Ollama will automatically download the selected model for you. + + + + + + +#### Chat with the Model + +Start new conversations with **New chat** in the left-side menu. + +On the right-side, choose a downloaded model from the **Select a model** drop-down menu at the top, input your questions into the **Send a Message** textbox at the bottom, and click the button on the right to get responses. + + + + + + +
+Additionally, you can drag and drop a document into the textbox, allowing the LLM to access its contents. The LLM will then generate answers based on the document provided. + + + + + +#### Exit Open-Webui + +To shut down the open-webui server, use **Ctrl+C** in the terminal where the open-webui server is runing, then close your browser tab. + + +### 5. Troubleshooting + +##### Error `No module named 'torch._C` + +When you encounter the error ``ModuleNotFoundError: No module named 'torch._C'`` after executing ```bash start.sh```, you can resolve it by reinstalling PyTorch. First, use ```pip uninstall torch``` to remove the existing PyTorch installation, and then reinstall it along with its dependencies by running ```pip install torch torchvision torchaudio```. diff --git a/docs/mddocs/Quickstart/privateGPT_quickstart.md b/docs/mddocs/Quickstart/privateGPT_quickstart.md new file mode 100644 index 00000000..0d605068 --- /dev/null +++ b/docs/mddocs/Quickstart/privateGPT_quickstart.md @@ -0,0 +1,129 @@ +# Run PrivateGPT with IPEX-LLM on Intel GPU + +[PrivateGPT](https://github.com/zylon-ai/private-gpt) is a production-ready AI project that allows users to chat over documents, etc.; by integrating it with [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily leverage local LLMs running on Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max). + +*See the demo of privateGPT running Mistral:7B on Intel Arc A770 below.* + + + + +## Quickstart + + +### 1. Install and Start `Ollama` Service on Intel GPU + +Follow the steps in [Run Ollama on Intel GPU Guide](./ollama_quickstart.md) to install and run Ollama on Intel GPU. Ensure that `ollama serve` is running correctly and can be accessed through a local URL (e.g., `https://127.0.0.1:11434`) or a remote URL (e.g., `http://your_ip:11434`). + +We recommend pulling the desired model before proceeding with PrivateGPT. For instance, to pull the Mistral:7B model, you can use the following command: + +```bash +ollama pull mistral:7b +``` + +### 2. Install PrivateGPT + +#### Download PrivateGPT + +You can either clone the repository or download the source zip from [github](https://github.com/zylon-ai/private-gpt/archive/refs/heads/main.zip): +```bash +git clone https://github.com/zylon-ai/private-gpt +``` + +#### Install Dependencies + +Execute the following commands in a terminal to install the dependencies of PrivateGPT: + +```cmd +cd private-gpt +pip install poetry +pip install ffmpy==0.3.1 +poetry install --extras "ui llms-ollama embeddings-ollama vector-stores-qdrant" +``` +For more details, refer to the [PrivateGPT installation Guide](https://docs.privategpt.dev/installation/getting-started/main-concepts). + + +### 3. Start PrivateGPT + +#### Configure PrivateGPT + +To configure PrivateGPT to use Ollama for running local LLMs, you should edit the `private-gpt/settings-ollama.yaml` file. Modify the `ollama` section by setting the `llm_model` and `embedding_model` you wish to use, and updating the `api_base` and `embedding_api_base` to direct to your Ollama URL. + +Below is an example of how `settings-ollama.yaml` should look. + + +

+ image-p1 +

+ + +```eval_rst + +.. note:: + + `settings-ollama.yaml` is loaded when the Ollama profile is specified in the PGPT_PROFILES environment variable. This can override configurations from the default `settings.yaml`. + +``` + +For more information on configuring PrivateGPT, please visit the [PrivateGPT Main Concepts](https://docs.privategpt.dev/installation/getting-started/main-concepts) page. + + +#### Start the service +Please ensure that the Ollama server continues to run in a terminal while you're using the PrivateGPT. + +Run below commands to start the service in another terminal: + +```eval_rst +.. tabs:: + .. tab:: Linux + + .. code-block:: bash + + export no_proxy=localhost,127.0.0.1 + PGPT_PROFILES=ollama make run + + .. note: + + Setting ``PGPT_PROFILES=ollama`` will load the configuration from ``settings.yaml`` and ``settings-ollama.yaml``. + + .. tab:: Windows + + .. code-block:: bash + + set no_proxy=localhost,127.0.0.1 + set PGPT_PROFILES=ollama + make run + + .. note: + + Setting ``PGPT_PROFILES=ollama`` will load the configuration from ``settings.yaml`` and ``settings-ollama.yaml``. +``` + +Upon successful deployment, you will see logs in the terminal similar to the following: + +

+ image-p1 +

+ +Open a browser (if it doesn't open automatically) and navigate to the URL displayed in the terminal. If it shows http://0.0.0.0:8001, you can access it locally via `http://127.0.0.1:8001` or remotely via `http://your_ip:8001`. + + +### 4. Using PrivateGPT + +#### Chat with the Model + +To chat with the LLM, select the "LLM Chat" option located in the upper left corner of the page. Type your messages at the bottom of the page and click the "Submit" button to receive responses from the model. + + +

+ image-p1 +

+ + + +#### Chat over Documents (RAG) + +To interact with documents, select the "Query Files" option in the upper left corner of the page. Click the "Upload File(s)" button to upload documents. After the documents have been vectorized, you can type your messages at the bottom of the page and click the "Submit" button to receive responses from the model based on the uploaded content. + + + + diff --git a/docs/mddocs/Quickstart/vLLM_quickstart.md b/docs/mddocs/Quickstart/vLLM_quickstart.md new file mode 100644 index 00000000..71e34834 --- /dev/null +++ b/docs/mddocs/Quickstart/vLLM_quickstart.md @@ -0,0 +1,276 @@ +# Serving using IPEX-LLM and vLLM on Intel GPU + +vLLM is a fast and easy-to-use library for LLM inference and serving. You can find the detailed information at their [homepage](https://github.com/vllm-project/vllm). + +IPEX-LLM can be integrated into vLLM so that user can use `IPEX-LLM` to boost the performance of vLLM engine on Intel **GPUs** *(e.g., local PC with descrete GPU such as Arc, Flex and Max)*. + +Currently, IPEX-LLM integrated vLLM only supports the following models: + +- Qwen series models +- Llama series models +- ChatGLM series models +- Baichuan series models + + +## Quick Start + +This quickstart guide walks you through installing and running `vLLM` with `ipex-llm`. + +### 1. Install IPEX-LLM for vLLM + +IPEX-LLM's support for `vLLM` now is available for only Linux system. + +Visit [Install IPEX-LLM on Linux with Intel GPU](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html) and follow the instructions in section [Install Prerequisites](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-prerequisites) to isntall prerequisites that are needed for running code on Intel GPUs. + +Then,follow instructions in section [Install ipex-llm](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-ipex-llm) to install `ipex-llm[xpu]` and setup the recommended runtime configurations. + +**After the installation, you should have created a conda environment, named `ipex-vllm` for instance, for running `vLLM` commands with IPEX-LLM.** + +### 2. Install vLLM + +Currently, we maintain a specific branch of vLLM, which only works on Intel GPUs. + +Activate the `ipex-vllm` conda environment and install vLLM by execcuting the commands below. + +```bash +conda activate ipex-vllm +source /opt/intel/oneapi/setvars.sh +git clone -b sycl_xpu https://github.com/analytics-zoo/vllm.git +cd vllm +pip install -r requirements-xpu.txt +pip install --no-deps xformers +VLLM_BUILD_XPU_OPS=1 pip install --no-build-isolation -v -e . +pip install outlines==0.0.34 --no-deps +pip install interegular cloudpickle diskcache joblib lark nest-asyncio numba scipy +# For Qwen model support +pip install transformers_stream_generator einops tiktoken +``` + +**Now you are all set to use vLLM with IPEX-LLM** + +## 3. Offline inference/Service + +### Offline inference + +To run offline inference using vLLM for a quick impression, use the following example. + +```eval_rst +.. note:: + + Please modify the MODEL_PATH in offline_inference.py to use your chosen model. + You can try modify load_in_low_bit to different values in **[sym_int4, fp6, fp8, fp8_e4m3, fp16]** to use different quantization dtype. +``` + +```bash +#!/bin/bash +wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/vLLM-Serving/offline_inference.py +python offline_inference.py +``` + +For instructions on how to change the `load_in_low_bit` value in `offline_inference.py`, check the following example: + +```bash +llm = LLM(model="YOUR_MODEL", + device="xpu", + dtype="float16", + enforce_eager=True, + # Simply change here for the desired load_in_low_bit value + load_in_low_bit="sym_int4", + tensor_parallel_size=1, + trust_remote_code=True) +``` + +The result of executing `Baichuan2-7B-Chat` model with `sym_int4` low-bit format is shown as follows: + +``` +Prompt: 'Hello, my name is', Generated text: ' [Your Name] and I am a [Your Job Title] at [Your' +Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government in the United States. The president leads' +Prompt: 'The capital of France is', Generated text: ' Paris.\nThe capital of France is Paris.' +Prompt: 'The future of AI is', Generated text: " bright, but it's not without challenges. As AI continues to evolve," +``` + +### Service + +```eval_rst +.. note:: + + Because of using JIT compilation for kernels. We recommend to send a few requests for warmup before using the service for the best performance. +``` + +To fully utilize the continuous batching feature of the `vLLM`, you can send requests to the service using `curl` or other similar methods. The requests sent to the engine will be batched at token level. Queries will be executed in the same `forward` step of the LLM and be removed when they are finished instead of waiting for all sequences to be finished. + + +For vLLM, you can start the service using the following command: + +```bash +#!/bin/bash +model="YOUR_MODEL_PATH" +served_model_name="YOUR_MODEL_NAME" + + # You may need to adjust the value of + # --max-model-len, --max-num-batched-tokens, --max-num-seqs + # to acquire the best performance + + # Change value --load-in-low-bit to [fp6, fp8, fp8_e4m3, fp16] to use different low-bit formats +python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \ + --served-model-name $served_model_name \ + --port 8000 \ + --model $model \ + --trust-remote-code \ + --gpu-memory-utilization 0.75 \ + --device xpu \ + --dtype float16 \ + --enforce-eager \ + --load-in-low-bit sym_int4 \ + --max-model-len 4096 \ + --max-num-batched-tokens 10240 \ + --max-num-seqs 12 \ + --tensor-parallel-size 1 +``` + +You can tune the service using these four arguments: + +1. `--gpu-memory-utilization`: The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9. +2. `--max-model-len`: Model context length. If unspecified, will be automatically derived from the model config. +3. `--max-num-batched-token`: Maximum number of batched tokens per iteration. +4. `--max-num-seq`: Maximum number of sequences per iteration. Default: 256 + +For longer input prompt, we would suggest to use `--max-num-batched-token` to restrict the service. The reason behind this logic is that the `peak GPU memory usage` will appear when generating first token. By using `--max-num-batched-token`, we can restrict the input size when generating first token. + +`--max-num-seqs` will restrict the generation for both first token and rest token. It will restrict the maximum batch size to the value set by `--max-num-seqs`. + +When out-of-memory error occurs, the most obvious solution is to reduce the `gpu-memory-utilization`. Other ways to resolve this error is to set `--max-num-batched-token` if peak memory occurs when generating first token or using `--max-num-seq` if peak memory occurs when generating rest tokens. + +If the service have been booted successfully, the console will display messages similar to the following: + + + + + + +After the service has been booted successfully, you can send a test request using `curl`. Here, `YOUR_MODEL` should be set equal to `$served_model_name` in your booting script, e.g. `Qwen1.5`. + + +```bash +curl http://localhost:8000/v1/completions \ +-H "Content-Type: application/json" \ +-d '{ + "model": "YOUR_MODEL", + "prompt": "San Francisco is a", + "max_tokens": 128, + "temperature": 0 +}' | jq '.choices[0].text' +``` + +Below shows an example output using `Qwen1.5-7B-Chat` with low-bit format `sym_int4`: + + + + + +```eval_rst +.. tip:: + + If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before starting the service: + + .. code-block:: bash + + export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 + +``` + +## 4. About Tensor parallel + +> Note: We recommend to use docker for tensor parallel deployment. Check our serving docker image `intelanalytics/ipex-llm-serving-xpu`. + +We have also supported tensor parallel by using multiple Intel GPU cards. To enable tensor parallel, you will need to install `libfabric-dev` in your environment. In ubuntu, you can install it by: + +```bash +sudo apt-get install libfabric-dev +``` + +To deploy your model across multiple cards, simplely change the value of `--tensor-parallel-size` to the desired value. + + +For instance, if you have two Arc A770 cards in your environment, then you can set this value to 2. Some OneCCL environment variable settings are also needed, check the following example: + +```bash +#!/bin/bash +model="YOUR_MODEL_PATH" +served_model_name="YOUR_MODEL_NAME" + +# CCL needed environment variables +export CCL_WORKER_COUNT=2 +export FI_PROVIDER=shm +export CCL_ATL_TRANSPORT=ofi +export CCL_ZE_IPC_EXCHANGE=sockets +export CCL_ATL_SHM=1 + # You may need to adjust the value of + # --max-model-len, --max-num-batched-tokens, --max-num-seqs + # to acquire the best performance + +python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \ + --served-model-name $served_model_name \ + --port 8000 \ + --model $model \ + --trust-remote-code \ + --gpu-memory-utilization 0.75 \ + --device xpu \ + --dtype float16 \ + --enforce-eager \ + --load-in-low-bit sym_int4 \ + --max-model-len 4096 \ + --max-num-batched-tokens 10240 \ + --max-num-seqs 12 \ + --tensor-parallel-size 2 +``` + +If the service have booted successfully, you should see the output similar to the following figure: + + + + + +## 5.Performing benchmark + +To perform benchmark, you can use the **benchmark_throughput** script that is originally provided by vLLM repo. + +```bash +conda activate ipex-vllm + +source /opt/intel/oneapi/setvars.sh + +wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json + +wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/docker/llm/serving/xpu/docker/benchmark_vllm_throughput.py -O benchmark_throughput.py + +export MODEL="YOUR_MODEL" + +# You can change load-in-low-bit from values in [sym_int4, fp6, fp8, fp8_e4m3, fp16] + +python3 ./benchmark_throughput.py \ + --backend vllm \ + --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json \ + --model $MODEL \ + --num-prompts 1000 \ + --seed 42 \ + --trust-remote-code \ + --enforce-eager \ + --dtype float16 \ + --device xpu \ + --load-in-low-bit sym_int4 \ + --gpu-memory-utilization 0.85 +``` + +The following figure shows the result of benchmarking `Llama-2-7b-chat-hf` using 50 prompts: + + + + + + +```eval_rst +.. tip:: + + To find the best config that fits your workload, you may need to start the service and use tools like `wrk` or `jmeter` to perform a stress tests. +``` diff --git a/docs/mddocs/Quickstart/webui_quickstart.md b/docs/mddocs/Quickstart/webui_quickstart.md new file mode 100644 index 00000000..3aab9589 --- /dev/null +++ b/docs/mddocs/Quickstart/webui_quickstart.md @@ -0,0 +1,217 @@ +# Run Text Generation WebUI on Intel GPU + +The [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) provides a user friendly GUI for anyone to run LLM locally; by porting it to [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily run LLM in [Text Generation WebUI](https://github.com/intel-analytics/text-generation-webui) on Intel **GPU** *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)*. + +See the demo of running LLaMA2-7B on an Intel Core Ultra laptop below. + + + +## Quickstart +This quickstart guide walks you through setting up and using the [Text Generation WebUI](https://github.com/intel-analytics/text-generation-webui) with `ipex-llm`. + +A preview of the WebUI in action is shown below: + + + + + + +### 1 Install IPEX-LLM + +To use the WebUI, first ensure that IPEX-LLM is installed. Follow the instructions on the [IPEX-LLM Installation Quickstart for Windows with Intel GPU](install_windows_gpu.html). + +**After the installation, you should have created a conda environment, named `llm` for instance, for running `ipex-llm` applications.** + +### 2 Install the WebUI + + +#### Download the WebUI +Download the `text-generation-webui` with IPEX-LLM integrations from [this link](https://github.com/intel-analytics/text-generation-webui/archive/refs/heads/ipex-llm.zip). Unzip the content into a directory, e.g.,`C:\text-generation-webui`. + +#### Install Dependencies + +Open **Miniforge Prompt** and activate the conda environment you have created in [section 1](#1-install-ipex-llm), e.g., `llm`. +``` +conda activate llm +``` +Then, change to the directory of WebUI (e.g.,`C:\text-generation-webui`) and install the necessary dependencies: +```cmd +cd C:\text-generation-webui +pip install -r requirements_cpu_only.txt +pip install -r extensions/openai/requirements.txt +``` + +```eval_rst +.. note:: + + `extensions/openai/requirements.txt` is for API service. If you don't need the API service, you can omit this command. +``` + +### 3 Start the WebUI Server + +#### Set Environment Variables +Configure oneAPI variables by running the following command in **Miniforge Prompt**: + +```eval_rst +.. note:: + + For more details about runtime configurations, refer to `this guide `_ +``` + +```cmd +set SYCL_CACHE_PERSISTENT=1 +``` +If you're running on iGPU, set additional environment variables by running the following commands: +```cmd +set BIGDL_LLM_XMX_DISABLED=1 +``` + +#### Launch the Server +In **Miniforge Prompt** with the conda environment `llm` activated, navigate to the `text-generation-webui` folder and execute the following commands (You can optionally lanch the server with or without the API service): + +##### without API service + ```cmd + python server.py --load-in-4bit + ``` +##### with API service + ``` + python server.py --load-in-4bit --api --api-port 5000 --listen + ``` +```eval_rst +.. note:: + + with ``--load-in-4bit`` option, the models will be optimized and run at 4-bit precision. For configuration for other formats and precisions, refer to `this link `_ +``` + +```eval_rst +.. note:: + + The API service allows user to access models using OpenAI-compatible API. For usage examples, refer to [this link](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#examples) +``` + +```eval_rst +.. note:: + + The API server will by default use port ``5000``. To change the port, use ``--api-port 1234`` in the command above. You can also specify using SSL or API Key in the command. Please see `this guide `_ for the full list of arguments. +``` + + +#### Access the WebUI +Upon successful launch, URLs to access the WebUI will be displayed in the terminal as shown below. Open the provided local URL in your browser to interact with the WebUI. + + + + + +### 4. Using the WebUI + +#### Model Download + +Place Huggingface models in `C:\text-generation-webui\models` by either copying locally or downloading via the WebUI. To download, navigate to the **Model** tab, enter the model's huggingface id (for instance, `microsoft/phi-1_5`) in the **Download model or LoRA** section, and click **Download**, as illustrated below. + + + + + +After copying or downloading the models, click on the blue **refresh** button to update the **Model** drop-down menu. Then, choose your desired model from the newly updated list. + + + + + +#### Load Model + +Default settings are recommended for most users. Click **Load** to activate the model. Address any errors by installing missing packages as prompted, and ensure compatibility with your version of the transformers package. Refer to [troubleshooting section](#troubleshooting) for more details. + +If everything goes well, you will get a message as shown below. + + + + + +```eval_rst +.. note:: + + Model loading might take a few minutes as it includes a **warm-up** phase. This `warm-up` step is used to improve the speed of subsequent model uses. +``` + +#### Chat with the Model + +In the **Chat** tab, start new conversations with **New chat**. + +Enter prompts into the textbox at the bottom and press the **Generate** button to receive responses. + + + + + + + +#### Exit the WebUI + +To shut down the WebUI server, use **Ctrl+C** in the **Miniforge Prompt** terminal where the WebUI Server is runing, then close your browser tab. + + +### 5. Advanced Usage +#### Using Instruct mode +Instruction-following models are models that are fine-tuned with specific prompt formats. +For these models, you should ideally use the `instruct` chat mode. +Under this mode, the model receives user prompts that are formatted according to prompt formats it was trained with. + +To use `instruct` chat mode, select `chat` tab, scroll down the page, and then select `instruct` under `Mode`. + + + + + +When a model is loaded, its corresponding instruction template, which contains prompt formatting, is automatically loaded. +If chat responses are poor, the loaded instruction template might be incorrect. +In this case, go to `Parameters` tab and then `Instruction template` tab. + + + + + +You can verify and edit the loaded instruction template in the `Instruction template` field. +You can also manually select an instruction template from `Saved instruction templates` and click `load` to load it into `Instruction template`. +You can add custom template files to this list in `/instruction-templates/` [folder](https://github.com/intel-analytics/text-generation-webui/tree/ipex-llm/instruction-templates). + + +#### Tested models +We have tested the following models with `ipex-llm` using Text Generation WebUI. + +| Model | Notes | +|-------|-------| +| llama-2-7b-chat-hf | | +| chatglm3-6b | Manually load ChatGLM3 template for Instruct chat mode | +| Mistral-7B-v0.1 | | +| qwen-7B-Chat | | + + +### Troubleshooting + +### Potentially slower first response + +The first response to user prompt might be slower than expected, with delays of up to several minutes before the response is generated. This delay occurs because the GPU kernels require compilation and initialization, which varies across different GPU types. + +#### Missing Required Dependencies + +During model loading, you may encounter an **ImportError** like `ImportError: This modeling file requires the following packages that were not found in your environment`. This indicates certain packages required by the model are absent from your environment. Detailed instructions for installing these necessary packages can be found at the bottom of the error messages. Take the following steps to fix these errors: + +- Exit the WebUI Server by pressing **Ctrl+C** in the **Miniforge Prompt** terminal. +- Install the missing pip packages as specified in the error message +- Restart the WebUI Server. + +If there are still errors on missing packages, repeat the installation process for any additional required packages. + +#### Compatiblity issues +If you encounter **AttributeError** errors like `AttributeError: 'BaichuanTokenizer' object has no attribute 'sp_model'`, it may be due to some models being incompatible with the current version of the transformers package because the models are outdated. In such instances, using a more recent model is recommended. +