Add initial md docs (#11371)

This commit is contained in:
Yuwen Hu 2024-06-20 13:47:49 +08:00 committed by GitHub
parent 9601fae5d5
commit 769728c1eb
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
47 changed files with 6406 additions and 0 deletions

View file

@ -0,0 +1,221 @@
## Run llama.cpp/Ollama/Open-WebUI on an Intel GPU via Docker
## Quick Start
### Install Docker
1. Linux Installation
Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux.
2. Windows Installation
For Windows installation, refer to this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#install-docker-desktop-for-windows).
#### Setting Docker on windows
Need to enable `--net=host`,follow [this guide](https://docs.docker.com/network/drivers/host/#docker-desktop) so that you can easily access the service running on the docker. The [v6.1x kernel version wsl]( https://learn.microsoft.com/en-us/community/content/wsl-user-msft-kernel-v6#1---building-the-microsoft-linux-kernel-v61x) is recommended to use.Otherwise, you may encounter the blocking issue before loading the model to GPU.
### Pull the latest image
```bash
# This image will be updated every day
docker pull intelanalytics/ipex-llm-inference-cpp-xpu:latest
```
### Start Docker Container
```eval_rst
.. tabs::
.. tab:: Linux
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Select the device you are running(device type:(Max, Flex, Arc, iGPU)). And change the `/path/to/models` to mount the models. `bench_model` is used to benchmark quickly. If want to benchmark, make sure it on the `/path/to/models`
.. code-block:: bash
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest
export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container
sudo docker run -itd \
--net=host \
--device=/dev/dri \
-v /path/to/models:/models \
-e no_proxy=localhost,127.0.0.1 \
--memory="32G" \
--name=$CONTAINER_NAME \
-e bench_model="mistral-7b-v0.1.Q4_0.gguf" \
-e DEVICE=Arc \
--shm-size="16g" \
$DOCKER_IMAGE
.. tab:: Windows
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. And change the `/path/to/models` to mount the models. Then add `--privileged` and map the `/usr/lib/wsl` to the docker.
.. code-block:: bash
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest
export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container
sudo docker run -itd \
--net=host \
--device=/dev/dri \
--privileged \
-v /path/to/models:/models \
-v /usr/lib/wsl:/usr/lib/wsl \
-e no_proxy=localhost,127.0.0.1 \
--memory="32G" \
--name=$CONTAINER_NAME \
-e bench_model="mistral-7b-v0.1.Q4_0.gguf" \
-e DEVICE=Arc \
--shm-size="16g" \
$DOCKER_IMAGE
```
After the container is booted, you could get into the container through `docker exec`.
```bash
docker exec -it ipex-llm-inference-cpp-xpu-container /bin/bash
```
To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is:
```bash
root@arda-arc12:/# sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
```
### Quick benchmark for llama.cpp
Notice that the performance on windows wsl docker is a little slower than on windows host, ant it's caused by the implementation of wsl kernel.
```bash
bash /llm/scripts/benchmark_llama-cpp.sh
```
The benchmark will run three times to warm up to get the accurate results, and the example output is like:
```bash
llama_print_timings: load time = xxx ms
llama_print_timings: sample time = xxx ms / 128 runs ( xxx ms per token, xxx tokens per second)
llama_print_timings: prompt eval time = xxx ms / xxx tokens ( xxx ms per token, xxx tokens per second)
llama_print_timings: eval time = xxx ms / 127 runs ( xxx ms per token, xxx tokens per second)
llama_print_timings: total time = xxx ms / xxx tokens
```
### Running llama.cpp inference with IPEX-LLM on Intel GPU
```bash
cd /llm/scripts/
# set the recommended Env
source ipex-llm-init --gpu --device $DEVICE
# mount models and change the model_path in `start-llama-cpp.sh`
bash start-llama-cpp.sh
```
The example output is like:
```bash
llama_print_timings: load time = xxx ms
llama_print_timings: sample time = xxx ms / 32 runs ( xxx ms per token, xxx tokens per second)
llama_print_timings: prompt eval time = xxx ms / xxx tokens ( xxx ms per token, xxx tokens per second)
llama_print_timings: eval time = xxx ms / 31 runs ( xxx ms per token, xxx tokens per second)
llama_print_timings: total time = xxx ms / xxx tokens
```
Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html) for more details.
### Running Ollama serving with IPEX-LLM on Intel GPU
Running the ollama on the background, you can see the ollama.log in `/root/ollama/ollama.log`
```bash
cd /llm/scripts/
# set the recommended Env
source ipex-llm-init --gpu --device $DEVICE
bash start-ollama.sh # ctrl+c to exit, and the ollama serve will run on the background
```
Sample output:
```bash
time=2024-05-16T10:45:33.536+08:00 level=INFO source=images.go:697 msg="total blobs: 0"
time=2024-05-16T10:45:33.536+08:00 level=INFO source=images.go:704 msg="total unused blobs removed: 0"
time=2024-05-16T10:45:33.536+08:00 level=INFO source=routes.go:1044 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-05-16T10:45:33.537+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama751325299/runners
time=2024-05-16T10:45:33.565+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
time=2024-05-16T10:45:33.565+08:00 level=INFO source=gpu.go:122 msg="Detecting GPUs"
time=2024-05-16T10:45:33.566+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
```
#### Run Ollama models (interactive)
```bash
cd /llm/ollama
# create a file named Modelfile
FROM /models/mistral-7b-v0.1.Q4_0.gguf
TEMPLATE [INST] {{ .Prompt }} [/INST]
PARAMETER num_predict 64
# create example and run it on console
./ollama create example -f Modelfile
./ollama run example
```
An example process of interacting with model with `ollama run example` looks like the following:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" width=100%; />
</a>
#### Pull models from ollama to serve
```bash
cd /llm/ollama
./ollama pull llama2
```
Use the Curl to Test:
```bash
curl http://localhost:11434/api/generate -d '
{
"model": "llama2",
"prompt": "What is AI?",
"stream": false
}'
```
Sample output:
```bash
{"model":"llama2","created_at":"2024-05-16T02:52:18.972296097Z","response":"\nArtificial intelligence (AI) refers to the development of computer systems that can perform tasks that typically require human intelligence, such as learning, problem-solving, and decision-making. AI systems use algorithms and data to mimic human behavior and perform tasks such as:\n\n1. Image recognition: AI can identify objects in images and classify them into different categories.\n2. Natural Language Processing (NLP): AI can understand and generate human language, allowing it to interact with humans through voice assistants or chatbots.\n3. Predictive analytics: AI can analyze data to make predictions about future events, such as stock prices or weather patterns.\n4. Robotics: AI can control robots that perform tasks such as assembly, maintenance, and logistics.\n5. Recommendation systems: AI can suggest products or services based on a user's past behavior or preferences.\n6. Autonomous vehicles: AI can control self-driving cars that can navigate through roads and traffic without human intervention.\n7. Fraud detection: AI can identify and flag fraudulent transactions, such as credit card purchases or insurance claims.\n8. Personalized medicine: AI can analyze genetic data to provide personalized medical recommendations, such as drug dosages or treatment plans.\n9. Virtual assistants: AI can interact with users through voice or text interfaces, providing information or completing tasks.\n10. Sentiment analysis: AI can analyze text or speech to determine the sentiment or emotional tone of a message.\n\nThese are just a few examples of what AI can do. As the technology continues to evolve, we can expect to see even more innovative applications of AI in various industries and aspects of our lives.","done":true,"context":[xxx,xxx],"total_duration":12831317190,"load_duration":6453932096,"prompt_eval_count":25,"prompt_eval_duration":254970000,"eval_count":390,"eval_duration":6079077000}
```
Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html#pull-model) for more details.
### Running Open WebUI with Intel GPU
Start the ollama and load the model first, then use the open-webui to chat.
If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add export HF_ENDPOINT=https://hf-mirror.com before running bash start.sh.
```bash
cd /llm/scripts/
bash start-open-webui.sh
```
Sample output:
```bash
INFO: Started server process [1055]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
```
<a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_signup.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_signup.png" width="100%" />
</a>
For how to log-in or other guide, Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/open_webui_with_ollama_quickstart.html) for more details.

View file

@ -0,0 +1,171 @@
# Python Inference using IPEX-LLM on Intel GPU
We can run PyTorch Inference Benchmark, Chat Service and PyTorch Examples on Intel GPUs within Docker (on Linux or WSL).
```eval_rst
.. note::
The current Windows + WSL + Docker solution only supports Arc series dGPU. For Windows users with MTL iGPU, it is recommended to install directly via pip install in Miniforge Prompt. Refer to `this guide <https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_windows_gpu.html>`_.
```
## Install Docker
Follow the [Docker installation Guide](./docker_windows_gpu.html#install-docker) to install docker on either Linux or Windows.
## Launch Docker
Prepare ipex-llm-xpu Docker Image:
```bash
docker pull intelanalytics/ipex-llm-xpu:latest
```
Start ipex-llm-xpu Docker Container:
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
docker run -itd \
--net=host \
--device=/dev/dri \
--memory="32G" \
--name=$CONTAINER_NAME \
--shm-size="16g" \
-v $MODEL_PATH:/llm/models \
$DOCKER_IMAGE
.. tab:: Windows WSL
.. code-block:: bash
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
sudo docker run -itd \
--net=host \
--privileged \
--device /dev/dri \
--memory="32G" \
--name=$CONTAINER_NAME \
--shm-size="16g" \
-v $MODEL_PATH:/llm/llm-models \
-v /usr/lib/wsl:/usr/lib/wsl \
$DOCKER_IMAGE
```
Access the container:
```
docker exec -it $CONTAINER_NAME bash
```
To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is:
```bash
root@arda-arc12:/# sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
```
```eval_rst
.. tip::
You can run the Env-Check script to verify your ipex-llm installation and runtime environment.
.. code-block:: bash
cd /ipex-llm/python/llm/scripts
bash env-check.sh
```
## Run Inference Benchmark
Navigate to benchmark directory, and modify the `config.yaml` under the `all-in-one` folder for benchmark configurations.
```bash
cd /benchmark/all-in-one
vim config.yaml
```
In the `config.yaml`, change `repo_id` to the model you want to test and `local_model_hub` to point to your model hub path.
```yaml
...
repo_id:
- 'meta-llama/Llama-2-7b-chat-hf'
local_model_hub: '/path/to/your/mode/folder'
...
```
After modifying `config.yaml`, run the following commands to run benchmarking:
```bash
source ipex-llm-init --gpu --device <value>
python run.py
```
**Result Interpretation**
After the benchmarking is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns `1st token avg latency (ms)` and `2+ avg latency (ms/token)` for the benchmark results. You can also check whether the column `actual input/output tokens` is consistent with the column `input/output tokens` and whether the parameters you specified in `config.yaml` have been successfully applied in the benchmarking.
## Run Chat Service
We provide `chat.py` for conversational AI.
For example, if your model is Llama-2-7b-chat-hf and mounted on /llm/models, you can execute the following command to initiate a conversation:
```bash
cd /llm
python chat.py --model-path /llm/models/Llama-2-7b-chat-hf
```
Here is a demostration:
<a align="left" href="https://llm-assets.readthedocs.io/en/latest/_images/llm-inference-cpu-docker-chatpy-demo.gif">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/llm-inference-cpu-docker-chatpy-demo.gif" width='60%' />
</a><br>
## Run PyTorch Examples
We provide several PyTorch examples that you could apply IPEX-LLM INT4 optimizations on models on Intel GPUs
For example, if your model is Llama-2-7b-chat-hf and mounted on /llm/models, you can navigate to /examples/llama2 directory, excute the following command to run example:
```bash
cd /examples/<model_dir>
python ./generate.py --repo-id-or-model-path /llm/models/Llama-2-7b-chat-hf --prompt PROMPT --n-predict N_PREDICT
```
Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`.
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
**Sample Output**
```log
Inference time: xxxx s
-------------------- Prompt --------------------
<s>[INST] <<SYS>>
<</SYS>>
What is AI? [/INST]
-------------------- Output --------------------
[INST] <<SYS>>
<</SYS>>
What is AI? [/INST] Artificial intelligence (AI) is the broader field of research and development aimed at creating machines that can perform tasks that typically require human intelligence,
```

View file

@ -0,0 +1,139 @@
# Run/Develop PyTorch in VSCode with Docker on Intel GPU
An IPEX-LLM container is a pre-configured environment that includes all necessary dependencies for running LLMs on Intel GPUs.
This guide provides steps to run/develop PyTorch examples in VSCode with Docker on Intel GPUs.
```eval_rst
.. note::
This guide assumes you have already installed VSCode in your environment.
To run/develop on Windows, install VSCode and then follow the steps below.
To run/develop on Linux, you might open VSCode first and SSH to a remote Linux machine, then proceed with the following steps.
```
## Install Docker
Follow the [Docker installation Guide](./docker_windows_gpu.html#install-docker) to install docker on either Linux or Windows.
## Install Extensions for VSCcode
#### Install Dev Containers Extension
For both Linux/Windows, you will need to Install Dev Containers extension.
Open the Extensions view in VSCode (you can use the shortcut `Ctrl+Shift+X`), then search for and install the `Dev Containers` extension.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/install_dev_container_extension_in_vscode.gif" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/install_dev_container_extension_in_vscode.gif" width=100%; />
</a>
#### Install WSL Extension for Windows
For Windows, you will need to install wsl extension to to the WSL environment. Open the Extensions view in VSCode (you can use the shortcut `Ctrl+Shift+X`), then search for and install the `WSL` extension.
Press F1 to bring up the Command Palette and type in`WSL: Connect to WSL Using Distro...` and select it and then select a specific WSL distro `Ubuntu`
<a href="https://llm-assets.readthedocs.io/en/latest/_images/install_wsl_extention_in_vscode.gif" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/install_wsl_extention_in_vscode.gif" width=100%; />
</a>
## Launch Container
Open the Terminal in VSCode (you can use the shortcut `` Ctrl+Shift+` ``), then pull ipex-llm-xpu Docker Image:
```bash
docker pull intelanalytics/ipex-llm-xpu:latest
```
Start ipex-llm-xpu Docker Container:
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
docker run -itd \
--net=host \
--device=/dev/dri \
--memory="32G" \
--name=$CONTAINER_NAME \
--shm-size="16g" \
-v $MODEL_PATH:/llm/models \
$DOCKER_IMAGE
.. tab:: Windows WSL
.. code-block:: bash
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
sudo docker run -itd \
--net=host \
--privileged \
--device /dev/dri \
--memory="32G" \
--name=$CONTAINER_NAME \
--shm-size="16g" \
-v $MODEL_PATH:/llm/llm-models \
-v /usr/lib/wsl:/usr/lib/wsl \
$DOCKER_IMAGE
```
## Run/Develop Pytorch Examples
Press F1 to bring up the Command Palette and type in`Dev Containers: Attach to Running Container...` and select it and then select `my_container`
Now you are in a running Docker Container, Open folder `/ipex-llm/python/llm/example/GPU/HF-Transformers-AutoModels/Model/`.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/run_example_in_vscode.gif" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/run_example_in_vscode.gif" width=100%; />
</a>
In this folder, we provide several PyTorch examples that you could apply IPEX-LLM INT4 optimizations on models on Intel GPUs.
For example, if your model is Llama-2-7b-chat-hf and mounted on /llm/models, you can navigate to llama2 directory, excute the following command to run example:
```bash
cd <model_dir>
python ./generate.py --repo-id-or-model-path /llm/models/Llama-2-7b-chat-hf --prompt PROMPT --n-predict N_PREDICT
```
Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`.
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
**Sample Output**
```log
Inference time: xxxx s
-------------------- Prompt --------------------
<s>[INST] <<SYS>>
<</SYS>>
What is AI? [/INST]
-------------------- Output --------------------
[INST] <<SYS>>
<</SYS>>
What is AI? [/INST] Artificial intelligence (AI) is the broader field of research and development aimed at creating machines that can perform tasks that typically require human intelligence,
```
You can develop your own PyTorch example based on these examples.

View file

@ -0,0 +1,111 @@
# Overview of IPEX-LLM Containers for Intel GPU
An IPEX-LLM container is a pre-configured environment that includes all necessary dependencies for running LLMs on Intel GPUs.
This guide provides general instructions for setting up the IPEX-LLM Docker containers with Intel GPU. It begins with instructions and tips for Docker installation, and then introduce the available IPEX-LLM containers and their uses.
## Install Docker
### Linux
Follow the instructions in the [Offcial Docker Guide](https://www.docker.com/get-started/) to install Docker on Linux.
### Windows
```eval_rst
.. tip::
The installation requires at least 35GB of free disk space on C drive.
```
```eval_rst
.. note::
Detailed installation instructions for Windows, including steps for enabling WSL2, can be found on the [Docker Desktop for Windows installation page](https://docs.docker.com/desktop/install/windows-install/).
```
#### Install Docker Desktop for Windows
Follow the instructions in [this guide](https://docs.docker.com/desktop/install/windows-install/) to install **Docker Desktop for Windows**. Restart you machine after the installation is complete.
#### Install WSL2
Follow the instructions in [this guide](https://docs.microsoft.com/en-us/windows/wsl/install) to install **Windows Subsystem for Linux 2 (WSL2)**.
```eval_rst
.. tip::
You may verify WSL2 installation by running the command `wsl --list` in PowerShell or Command Prompt. If WSL2 is installed, you will see a list of installed Linux distributions.
```
#### Enable Docker integration with WSL2
Open **Docker desktop**, and select `Settings`->`Resources`->`WSL integration`->turn on `Ubuntu` button->`Apply & restart`.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/docker_desktop_new.png">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/docker_desktop_new.png" width=100%; />
</a>
```eval_rst
.. tip::
If you encounter **Docker Engine stopped** when opening Docker Desktop, you can reopen it in administrator mode.
```
#### Verify Docker is enabled in WSL2
Execute the following commands in PowerShell or Command Prompt to verify that Docker is enabled in WSL2:
```bash
wsl -d Ubuntu # Run Ubuntu WSL distribution
docker version # Check if Docker is enabled in WSL
```
You can see the output similar to the following:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/docker_wsl.png">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/docker_wsl.png" width=100%; />
</a>
```eval_rst
.. tip::
During the use of Docker in WSL, Docker Desktop needs to be kept open all the time.
```
## IPEX-LLM Docker Containers
We have several docker images available for running LLMs on Intel GPUs. The following table lists the available images and their uses:
| Image Name | Description | Use Case |
|------------|-------------|----------|
| intelanalytics/ipex-llm-cpu:latest | CPU Inference |For development and running LLMs using llama.cpp, Ollama and Python|
| intelanalytics/ipex-llm-xpu:latest | GPU Inference |For development and running LLMs using llama.cpp, Ollama and Python|
| intelanalytics/ipex-llm-serving-cpu:latest | CPU Serving|For serving multiple users/requests through REST APIs using vLLM/FastChat|
| intelanalytics/ipex-llm-serving-xpu:latest | GPU Serving|For serving multiple users/requests through REST APIs using vLLM/FastChat|
| intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:latest | CPU Finetuning via Docker|For fine-tuning LLMs using QLora/Lora, etc. |
|intelanalytics/ipex-llm-finetune-qlora-cpu-k8s:latest|CPU Finetuning via Kubernetes|For fine-tuning LLMs using QLora/Lora, etc. |
| intelanalytics/ipex-llm-finetune-qlora-xpu:latest| GPU Finetuning|For fine-tuning LLMs using QLora/Lora, etc.|
We have also provided several quickstarts for various usage scenarios:
- [Run and develop LLM applications in PyTorch](./docker_pytorch_inference_gpu.html)
... to be added soon.
## Troubleshooting
If your machine has both an integrated GPU (iGPU) and a dedicated GPU (dGPU) like ARC, you may encounter the following issue:
```bash
Abort was called at 62 line in file:
./shared/source/os_interface/os_interface.h
LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: adl [Intel(R) Core(TM) i7-14700K]
Registry and code: 13 MB
Command: python chat.py --model-path /llm/llm-models/chatglm2-6b/
Uptime: 29.349235 s
Aborted
```
To resolve this problem, you can disable the iGPU in Device Manager on Windows. For details, refer to [this guide](https://www.elevenforum.com/t/enable-or-disable-integrated-graphics-igpu-in-windows-11.18616/)

View file

@ -0,0 +1,117 @@
# FastChat Serving with IPEX-LLM on Intel GPUs via docker
This guide demonstrates how to run `FastChat` serving with `IPEX-LLM` on Intel GPUs via Docker.
## Install docker
Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux.
## Pull the latest image
```bash
# This image will be updated every day
docker pull intelanalytics/ipex-llm-serving-xpu:latest
```
## Start Docker Container
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Change the `/path/to/models` to mount the models.
```
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
export CONTAINER_NAME=ipex-llm-serving-xpu-container
sudo docker run -itd \
--net=host \
--device=/dev/dri \
-v /path/to/models:/llm/models \
-e no_proxy=localhost,127.0.0.1 \
--memory="32G" \
--name=$CONTAINER_NAME \
--shm-size="16g" \
$DOCKER_IMAGE
```
After the container is booted, you could get into the container through `docker exec`.
```bash
docker exec -it ipex-llm-serving-xpu-container /bin/bash
```
To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is:
```bash
root@arda-arc12:/# sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
```
## Running FastChat serving with IPEX-LLM on Intel GPU in Docker
For convenience, we have provided a script named `/llm/start-fastchat-service.sh` for you to start the service.
However, the script only provide instructions for the most common scenarios. If this script doesn't meet your needs, you can always find the complete guidance for FastChat at [Serving using IPEX-LLM and FastChat](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#start-the-service).
Before starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#runtime-configurations) to setup our recommended runtime configurations.
Now we can start the FastChat service, you can use our provided script `/llm/start-fastchat-service.sh` like the following way:
```bash
# Only the MODEL_PATH needs to be set, other parameters have default values
export MODEL_PATH=YOUR_SELECTED_MODEL_PATH
export LOW_BIT_FORMAT=sym_int4
export CONTROLLER_HOST=localhost
export CONTROLLER_PORT=21001
export WORKER_HOST=localhost
export WORKER_PORT=21002
export API_HOST=localhost
export API_PORT=8000
# Use the default model_worker
bash /llm/start-fastchat-service.sh -w model_worker
```
If everything goes smoothly, the result should be similar to the following figure:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/start-fastchat.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/start-fastchat.png" width=100%; />
</a>
By default, we are using the `ipex_llm_worker` as the backend engine. You can also use `vLLM` as the backend engine. Try the following examples:
```bash
# Only the MODEL_PATH needs to be set, other parameters have default values
export MODEL_PATH=YOUR_SELECTED_MODEL_PATH
export LOW_BIT_FORMAT=sym_int4
export CONTROLLER_HOST=localhost
export CONTROLLER_PORT=21001
export WORKER_HOST=localhost
export WORKER_PORT=21002
export API_HOST=localhost
export API_PORT=8000
# Use the default model_worker
bash /llm/start-fastchat-service.sh -w vllm_worker
```
The `vllm_worker` may start slowly than normal `ipex_llm_worker`. The booted service should be similar to the following figure:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/fastchat-vllm-worker.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/fastchat-vllm-worker.png" width=100%; />
</a>
```eval_rst
.. note::
To verify/use the service booted by the script, follow the instructions in `this guide <https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#launch-restful-api-serve>`_.
```
After a request has been sent to the `openai_api_server`, the corresponding inference time latency can be found in the worker log as shown below:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/fastchat-benchmark.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/fastchat-benchmark.png" width=100%; />
</a>

View file

@ -0,0 +1,15 @@
IPEX-LLM Docker Container User Guides
=====================================
In this section, you will find guides related to using IPEX-LLM with Docker, covering how to:
* `Overview of IPEX-LLM Containers <./docker_windows_gpu.html>`_
* Inference in Python/C++
* `GPU Inference in Python with IPEX-LLM <./docker_pytorch_inference_gpu.html>`_
* `VSCode LLM Development with IPEX-LLM on Intel GPU <./docker_pytorch_inference_gpu.html>`_
* `llama.cpp/Ollama/Open-WebUI with IPEX-LLM on Intel GPU <./docker_cpp_xpu_quickstart.html>`_
* Serving
* `FastChat with IPEX-LLM on Intel GPU <./fastchat_docker_quickstart.html>`_
* `vLLM with IPEX-LLM on Intel GPU <./vllm_docker_quickstart.html>`_
* `vLLM with IPEX-LLM on Intel CPU <./vllm_cpu_docker_quickstart.html>`_

View file

@ -0,0 +1,118 @@
# vLLM Serving with IPEX-LLM on Intel CPU via Docker
This guide demonstrates how to run `vLLM` serving with `ipex-llm` on Intel CPU via Docker.
## Install docker
Follow the instructions in this [guide](https://www.docker.com/get-started/) to install Docker on Linux.
## Pull the latest image
*Note: For running vLLM serving on Intel CPUs, you can currently use either the `intelanalytics/ipex-llm-serving-cpu:latest` or `intelanalytics/ipex-llm-serving-vllm-cpu:latest` Docker image.*
```bash
# This image will be updated every day
docker pull intelanalytics/ipex-llm-serving-cpu:latest
```
## Start Docker Container
To fully use your Intel CPU to run vLLM inference and serving, you should
```
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:latest
export CONTAINER_NAME=ipex-llm-serving-cpu-container
sudo docker run -itd \
--net=host \
--cpuset-cpus="0-47" \
--cpuset-mems="0" \
-v /path/to/models:/llm/models \
-e no_proxy=localhost,127.0.0.1 \
--memory="64G" \
--name=$CONTAINER_NAME \
--shm-size="16g" \
$DOCKER_IMAGE
```
After the container is booted, you could get into the container through `docker exec`.
```bash
docker exec -it ipex-llm-serving-cpu-container /bin/bash
```
## Running vLLM serving with IPEX-LLM on Intel CPU in Docker
We have included multiple vLLM-related files in `/llm/`:
1. `vllm_offline_inference.py`: Used for vLLM offline inference example
2. `benchmark_vllm_throughput.py`: Used for benchmarking throughput
3. `payload-1024.lua`: Used for testing request per second using 1k-128 request
4. `start-vllm-service.sh`: Used for template for starting vLLM service
Before performing benchmark or starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_cpu.html#environment-setup) to setup our recommended runtime configurations.
### Service
A script named `/llm/start-vllm-service.sh` have been included in the image for starting the service conveniently.
Modify the `model` and `served_model_name` in the script so that it fits your requirement. The `served_model_name` indicates the model name used in the API.
Then start the service using `bash /llm/start-vllm-service.sh`, the following message should be print if the service started successfully.
If the service have booted successfully, you should see the output similar to the following figure:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" width=100%; />
</a>
#### Verify
After the service has been booted successfully, you can send a test request using `curl`. Here, `YOUR_MODEL` should be set equal to `served_model_name` in your booting script, e.g. `Qwen1.5`.
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "YOUR_MODEL",
"prompt": "San Francisco is a",
"max_tokens": 128,
"temperature": 0
}' | jq '.choices[0].text'
```
Below shows an example output using `Qwen1.5-7B-Chat` with low-bit format `sym_int4`:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/vllm-curl-result.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/vllm-curl-result.png" width=100%; />
</a>
#### Tuning
You can tune the service using these four arguments:
- `--max-model-len`
- `--max-num-batched-token`
- `--max-num-seq`
You can refer to this [doc](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#service) for a detailed explaination on these parameters.
### Benchmark
#### Online benchmark throurgh api_server
We can benchmark the api_server to get an estimation about TPS (transactions per second). To do so, you need to start the service first according to the instructions mentioned above.
Then in the container, do the following:
1. modify the `/llm/payload-1024.lua` so that the "model" attribute is correct. By default, we use a prompt that is roughly 1024 token long, you can change it if needed.
2. Start the benchmark using `wrk` using the script below:
```bash
cd /llm
# warmup
wrk -t4 -c4 -d3m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
# You can change -t and -c to control the concurrency.
# By default, we use 8 connections to benchmark the service.
wrk -t8 -c8 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
```
#### Offline benchmark through benchmark_vllm_throughput.py
Please refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#performing-benchmark) on how to use `benchmark_vllm_throughput.py` for benchmarking.

View file

@ -0,0 +1,146 @@
# vLLM Serving with IPEX-LLM on Intel GPUs via Docker
This guide demonstrates how to run `vLLM` serving with `IPEX-LLM` on Intel GPUs via Docker.
## Install docker
Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux.
## Pull the latest image
*Note: For running vLLM serving on Intel GPUs, you can currently use either the `intelanalytics/ipex-llm-serving-xpu:latest` or `intelanalytics/ipex-llm-serving-vllm-xpu:latest` Docker image.*
```bash
# This image will be updated every day
docker pull intelanalytics/ipex-llm-serving-xpu:latest
```
## Start Docker Container
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Change the `/path/to/models` to mount the models.
```
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
export CONTAINER_NAME=ipex-llm-serving-xpu-container
sudo docker run -itd \
--net=host \
--device=/dev/dri \
-v /path/to/models:/llm/models \
-e no_proxy=localhost,127.0.0.1 \
--memory="32G" \
--name=$CONTAINER_NAME \
--shm-size="16g" \
$DOCKER_IMAGE
```
After the container is booted, you could get into the container through `docker exec`.
```bash
docker exec -it ipex-llm-serving-xpu-container /bin/bash
```
To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is:
```bash
root@arda-arc12:/# sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
```
## Running vLLM serving with IPEX-LLM on Intel GPU in Docker
We have included multiple vLLM-related files in `/llm/`:
1. `vllm_offline_inference.py`: Used for vLLM offline inference example
2. `benchmark_vllm_throughput.py`: Used for benchmarking throughput
3. `payload-1024.lua`: Used for testing request per second using 1k-128 request
4. `start-vllm-service.sh`: Used for template for starting vLLM service
Before performing benchmark or starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#runtime-configurations) to setup our recommended runtime configurations.
### Service
#### Single card serving
A script named `/llm/start-vllm-service.sh` have been included in the image for starting the service conveniently.
Modify the `model` and `served_model_name` in the script so that it fits your requirement. The `served_model_name` indicates the model name used in the API.
Then start the service using `bash /llm/start-vllm-service.sh`, the following message should be print if the service started successfully.
If the service have booted successfully, you should see the output similar to the following figure:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" width=100%; />
</a>
#### Multi-card serving
vLLM supports to utilize multiple cards through tensor parallel.
You can refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#about-tensor-parallel) on how to utilize the `tensor-parallel` feature and start the service.
#### Verify
After the service has been booted successfully, you can send a test request using `curl`. Here, `YOUR_MODEL` should be set equal to `served_model_name` in your booting script, e.g. `Qwen1.5`.
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "YOUR_MODEL",
"prompt": "San Francisco is a",
"max_tokens": 128,
"temperature": 0
}' | jq '.choices[0].text'
```
Below shows an example output using `Qwen1.5-7B-Chat` with low-bit format `sym_int4`:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/vllm-curl-result.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/vllm-curl-result.png" width=100%; />
</a>
#### Tuning
You can tune the service using these four arguments:
- `--gpu-memory-utilization`
- `--max-model-len`
- `--max-num-batched-token`
- `--max-num-seq`
You can refer to this [doc](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#service) for a detailed explaination on these parameters.
### Benchmark
#### Online benchmark throurgh api_server
We can benchmark the api_server to get an estimation about TPS (transactions per second). To do so, you need to start the service first according to the instructions mentioned above.
Then in the container, do the following:
1. modify the `/llm/payload-1024.lua` so that the "model" attribute is correct. By default, we use a prompt that is roughly 1024 token long, you can change it if needed.
2. Start the benchmark using `wrk` using the script below:
```bash
cd /llm
# warmup due to JIT compliation
wrk -t4 -c4 -d3m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
# You can change -t and -c to control the concurrency.
# By default, we use 12 connections to benchmark the service.
wrk -t12 -c12 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
```
The following figure shows performing benchmark on `Llama-2-7b-chat-hf` using the above script:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/service-benchmark-result.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/service-benchmark-result.png" width=100%; />
</a>
#### Offline benchmark through benchmark_vllm_throughput.py
Please refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#performing-benchmark) on how to use `benchmark_vllm_throughput.py` for benchmarking.

View file

@ -0,0 +1,23 @@
# Self-Speculative Decoding
### Speculative Decoding in Practice
In [speculative](https://arxiv.org/abs/2302.01318) [decoding](https://arxiv.org/abs/2211.17192), a small (draft) model quickly generates multiple draft tokens, which are then verified in parallel by the large (target) model. While speculative decoding can effectively speed up the target model, ***in practice it is difficult to maintain or even obtain a proper draft model***, especially when the target model is finetuned with customized data.
### Self-Speculative Decoding
Built on top of the concept of “[self-speculative decoding](https://arxiv.org/abs/2309.08168)”, IPEX-LLM can now accelerate the original FP16 or BF16 model ***without the need of a separate draft model or model finetuning***; instead, it automatically converts the original model to INT4, and uses the INT4 model as the draft model behind the scene. In practice, this brings ***~30% speedup*** for FP16 and BF16 LLM inference latency on Intel GPU and CPU respectively.
### Using IPEX-LLM Self-Speculative Decoding
Please refer to IPEX-LLM self-speculative decoding code snippets below, and the detailed [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Speculative-Decoding) and [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Speculative-Decoding) examples in the project repo.
```python
model = AutoModelForCausalLM.from_pretrained(model_path,
optimize_model=True,
torch_dtype=torch.float16, #use bfloat16 on cpu
load_in_low_bit="fp16", #use bf16 on cpu
speculative=True, #set speculative to true
trust_remote_code=True,
use_cache=True)
output = model.generate(input_ids,
max_new_tokens=args.n_predict,
do_sample=False)
```

View file

@ -0,0 +1,79 @@
# Frequently Asked Questions (FAQ)
## General Info & Concepts
### GGUF format usage with IPEX-LLM?
IPEX-LLM supports running GGUF/AWQ/GPTQ models on both [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations) and [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations).
Please also refer to [here](https://github.com/intel-analytics/ipex-llm?tab=readme-ov-file#latest-update-) for our latest support.
## How to Resolve Errors
### Fail to install `ipex-llm` through `pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/` or `pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/`
You could try to install IPEX-LLM dependencies for Intel XPU from source archives:
- For Windows system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#install-ipex-llm-from-wheel) for the steps.
- For Linux system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id3) for the steps.
### PyTorch is not linked with support for xpu devices
1. Before running on Intel GPUs, please make sure you've prepared environment follwing [installation instruction](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html).
2. If you are using an older version of `ipex-llm` (specifically, older than 2.5.0b20240104), you need to manually add `import intel_extension_for_pytorch as ipex` at the beginning of your code.
3. After optimizing the model with IPEX-LLM, you need to move model to GPU through `model = model.to('xpu')`.
4. If you have mutil GPUs, you could refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.html) for details about GPU selection.
5. If you do inference using the optimized model on Intel GPUs, you also need to set `to('xpu')` for input tensors.
### Import `intel_extension_for_pytorch` error on Windows GPU
Please refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#error-loading-intel-extension-for-pytorch) for detailed guide. We list the possible missing requirements in environment which could lead to this error.
### XPU device count is zero
It's recommended to reinstall driver:
- For Windows system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#prerequisites) for the steps.
- For Linux system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id1) for the steps.
### Error such as `The size of tensor a (33) must match the size of tensor b (17) at non-singleton dimension 2` duing attention forward function
If you are using IPEX-LLM PyTorch API, please try to set `optimize_llm=False` manually when call `optimize_model` function to work around. As for IPEX-LLM `transformers`-style API, you could try to set `optimize_model=False` manually when call `from_pretrained` function to work around.
### ValueError: Unrecognized configuration class
This error is not quite relevant to IPEX-LLM. It could be that you're using the incorrect AutoClass, or the transformers version is not updated, or transformers does not support using AutoClasses to load this model. You need to refer to the model card in huggingface to confirm these information. Besides, if you load the model from local path, please also make sure you download the complete model files.
### `mixed dtype (CPU): expect input to have scalar type of BFloat16` during inference
You could solve this error by converting the optimized model to `bf16` through `model.to(torch.bfloat16)` before inference.
### Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)
This error is caused by out of GPU memory. Some possible solutions to decrease GPU memory uage:
1. If you run several models continuously, please make sure you have released GPU memory of previous model through `del model` timely.
2. You could try `model = model.float16()` or `model = model.bfloat16()` before moving model to GPU to use less GPU memory.
3. You could try set `cpu_embedding=True` when call `from_pretrained` of AutoClass or `optimize_model` function.
### Failed to enable AMX
You could use `export BIGDL_LLM_AMX_DISABLED=1` to disable AMX manually and solve this error.
### oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
You may encounter this error during finetuning on multi GPUs. Please try `sudo apt install level-zero-dev` to fix it.
### Random and unreadable output of Gemma-7b-it on Arc770 ubuntu 22.04 due to driver and OneAPI missmatching.
If driver and OneAPI missmatching, it will lead to some error when IPEX-LLM uses XMX(short prompts) for speeding up.
The output of `What's AI?` may like below:
```
wiedzy Artificial Intelligence meliti: Artificial Intelligence undenti beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng
```
If you meet this error. Please check your driver version and OneAPI version. Commnad is `sudo apt list --installed | egrep "intel-basekit|intel-level-zero-gpu"`.
Make sure intel-basekit>=2024.0.1-43 and intel-level-zero-gpu>=1.3.27191.42-775~22.04.
### Too many open files
You may encounter this error during finetuning, expecially when run 70B model. Please raise the system open file limit using `ulimit -n 1048576`.
### `RuntimeError: could not create a primitive` on Windows
This error may happen when multiple GPUs exists for Windows Users. To solve this error, you can open Device Manager (search "Device Manager" in your start menu). Then click the "Display adapter" option, and disable all the GPU device you do not want to use. Restart your computer and try again. IPEX-LLM should work fine this time.

View file

@ -0,0 +1,40 @@
# CLI (Command Line Interface) Tool
```eval_rst
.. note::
Currently ``ipex-llm`` CLI supports *LLaMA* (e.g., vicuna), *GPT-NeoX* (e.g., redpajama), *BLOOM* (e.g., pheonix) and *GPT2* (e.g., starcoder) model architecture; for other models, you may use the ``transformers``-style or LangChain APIs.
```
## Convert Model
You may convert the downloaded model into native INT4 format using `llm-convert`.
```bash
# convert PyTorch (fp16 or fp32) model;
# llama/bloom/gptneox/starcoder model family is currently supported
llm-convert "/path/to/model/" --model-format pth --model-family "bloom" --outfile "/path/to/output/"
# convert GPTQ-4bit model
# only llama model family is currently supported
llm-convert "/path/to/model/" --model-format gptq --model-family "llama" --outfile "/path/to/output/"
```
## Run Model
You may run the converted model using `llm-cli` or `llm-chat` (built on top of `main.cpp` in [`llama.cpp`](https://github.com/ggerganov/llama.cpp))
```bash
# help
# llama/bloom/gptneox/starcoder model family is currently supported
llm-cli -x gptneox -h
# text completion
# llama/bloom/gptneox/starcoder model family is currently supported
llm-cli -t 16 -x gptneox -m "/path/to/output/model.bin" -p 'Once upon a time,'
# chat mode
# llama/gptneox model family is currently supported
llm-chat -m "/path/to/output/model.bin" -x llama
```

View file

@ -0,0 +1,64 @@
# Finetune (QLoRA)
We also support finetuning LLMs (large language models) using QLoRA with IPEX-LLM 4bit optimizations on Intel GPUs.
```eval_rst
.. note::
Currently, only Hugging Face Transformers models are supported running QLoRA finetuning.
```
To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example.
**Make sure you have prepared environment following instructions [here](../install_gpu.html).**
```eval_rst
.. note::
If you are using an older version of ``ipex-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code.
```
First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`.
```python
from ipex_llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf",
load_in_low_bit="nf4",
optimize_model=False,
torch_dtype=torch.float16,
modules_to_not_convert=["lm_head"],)
model = model.to('xpu')
```
Then, we have to apply some preprocessing to the model to prepare it for training.
```python
from ipex_llm.transformers.qlora import prepare_model_for_kbit_training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
```
Next, we can obtain a Peft model from the optimized model and a configuration object containing the parameters as follows:
```python
from ipex_llm.transformers.qlora import get_peft_model
from peft import LoraConfig
config = LoraConfig(r=8,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM")
model = get_peft_model(model, config)
```
```eval_rst
.. important::
Instead of ``from peft import prepare_model_for_kbit_training, get_peft_model`` as we did for regular QLoRA using bitandbytes and cuda, we import them from ``ipex_llm.transformers.qlora`` here to get a IPEX-LLM compatible Peft model. And the rest is just the same as regular LoRA finetuning process using ``peft``.
```
```eval_rst
.. seealso::
See the complete examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU>`_
```

View file

@ -0,0 +1,14 @@
GPU Supports
================================
IPEX-LLM not only supports running large language models for inference, but also supports QLoRA finetuning on Intel GPUs.
* |inference_on_gpu|_
* `Finetune (QLoRA) <./finetune.html>`_
* `Multi GPUs selection <./multi_gpus_selection.html>`_
.. |inference_on_gpu| replace:: Inference on GPU
.. _inference_on_gpu: ./inference_on_gpu.html
.. |multi_gpus_selection| replace:: Multi GPUs selection
.. _multi_gpus_selection: ./multi_gpus_selection.html

View file

@ -0,0 +1,54 @@
# Hugging Face ``transformers`` Format
## Load in Low Precision
You may apply INT4 optimizations to any Hugging Face *Transformers* models as follows:
```python
# load Hugging Face Transformers model with INT4 optimizations
from ipex_llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
```
After loading the Hugging Face *Transformers* model, you may easily run the optimized model as follows:
```python
# run the optimized model
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...)
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids)
```
```eval_rst
.. seealso::
See the complete CPU examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels>`_ and GPU examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels>`_.
.. note::
You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows:
.. code-block:: python
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5")
See the CPU example `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types>`_ and GPU example `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types>`_.
```
## Save & Load
After the model is optimized using INT4 (or INT8/INT5), you may save and load the optimized model as follows:
```python
model.save_low_bit(model_path)
new_model = AutoModelForCausalLM.load_low_bit(model_path)
```
```eval_rst
.. seealso::
See the CPU example `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load>`_ and GPU example `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load>`_
```

View file

@ -0,0 +1,33 @@
IPEX-LLM Key Features
================================
You may run the LLMs using ``ipex-llm`` through one of the following APIs:
* `PyTorch API <./optimize_model.html>`_
* |transformers_style_api|_
* |hugging_face_transformers_format|_
* `Native Format <./native_format.html>`_
* `LangChain API <./langchain_api.html>`_
* |gpu_supports|_
* |inference_on_gpu|_
* `Finetune (QLoRA) <./finetune.html>`_
* `Multi GPUs selection <./multi_gpus_selection.html>`_
.. |transformers_style_api| replace:: ``transformers``-style API
.. _transformers_style_api: ./transformers_style_api.html
.. |hugging_face_transformers_format| replace:: Hugging Face ``transformers`` Format
.. _hugging_face_transformers_format: ./hugging_face_format.html
.. |gpu_supports| replace:: GPU Supports
.. _gpu_supports: ./gpu_supports.html
.. |inference_on_gpu| replace:: Inference on GPU
.. _inference_on_gpu: ./inference_on_gpu.html
.. |multi_gpus_selection| replace:: Multi GPUs selection
.. _multi_gpus_selection: ./multi_gpus_selection.html

View file

@ -0,0 +1,128 @@
# Inference on GPU
Apart from the significant acceleration capabilites on Intel CPUs, IPEX-LLM also supports optimizations and acceleration for running LLMs (large language models) on Intel GPUs. With IPEX-LLM, PyTorch models (in FP16/BF16/FP32) can be optimized with low-bit quantizations (supported precisions include INT4, INT5, INT8, etc).
Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.
**Make sure you have prepared environment following instructions [here](../install_gpu.html).**
```eval_rst
.. note::
If you are using an older version of ``ipex-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code.
```
## Load and Optimize Model
You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-style API](./transformers_style_api.html) on Intel GPUs according to your preference.
**Once you have the model with IPEX-LLM low bit optimization, set it to `to('xpu')`**.
```eval_rst
.. tabs::
.. tab:: PyTorch API
You could optimize any PyTorch model with "one-line code change", and the loading and optimizing process on Intel GPUs maybe as follows:
.. code-block:: python
# Take Llama-2-7b-chat-hf as an example
from transformers import LlamaForCausalLM
from ipex_llm import optimize_model
model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype='auto', low_cpu_mem_usage=True)
model = optimize_model(model) # With only one line to enable IPEX-LLM INT4 optimization
model = model.to('xpu') # Important after obtaining the optimized model
.. tip::
When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``optimize_model`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
See the `API doc <../../../PythonAPI/LLM/optimize.html#ipex_llm.optimize_model>`_ for ``optimize_model`` to find more information.
Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows:
.. code-block:: python
from transformers import LlamaForCausalLM
from ipex_llm.optimize import low_memory_init, load_low_bit
saved_dir='./llama-2-ipex-llm-4-bit'
with low_memory_init(): # Fast and low cost by loading model on meta device
model = LlamaForCausalLM.from_pretrained(saved_dir,
torch_dtype="auto",
trust_remote_code=True)
model = load_low_bit(model, saved_dir) # Load the optimized model
model = model.to('xpu') # Important after obtaining the optimized model
.. tab:: ``transformers``-style API
You could run any Hugging Face Transformers model with ``transformers``-style API, and the loading and optimizing process on Intel GPUs maybe as follows:
.. code-block:: python
# Take Llama-2-7b-chat-hf as an example
from ipex_llm.transformers import AutoModelForCausalLM
# Load model in 4 bit, which convert the relevant layers in the model into INT4 format
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', load_in_4bit=True)
model = model.to('xpu') # Important after obtaining the optimized model
.. tip::
When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``from_pretrained`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
See the `API doc <../../../PythonAPI/LLM/transformers.html#hugging-face-transformers-automodel>`_ to find more information.
Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows:
.. code-block:: python
from ipex_llm.transformers import AutoModelForCausalLM
saved_dir='./llama-2-ipex-llm-4-bit'
model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model
model = model.to('xpu') # Important after obtaining the optimized model
.. tip::
When running saved optimized models on Intel iGPUs for Windows users, we also recommend setting ``cpu_embedding=True`` in the ``load_low_bit`` function.
```
## Run Optimized Model
You could then do inference using the optimized model on Intel GPUs almostly the same as on CPUs. **The only difference is to set `to('xpu')` for input tensors.**
Continuing with the [example of Llama-2-7b-chat-hf](#load-and-optimize-model), running as follows:
```python
import torch
with torch.inference_mode():
prompt = 'Q: What is CPU?\nA:'
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') # With .to('xpu') specifically for inference on Intel GPUs
output = model.generate(input_ids, max_new_tokens=32)
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
```
```eval_rst
.. note::
The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation.
```
```eval_rst
.. note::
If you are a Windows user, please also note that for **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
```
```eval_rst
.. seealso::
See the complete examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU>`_
```

View file

@ -0,0 +1,57 @@
# LangChain API
You may run the models using the LangChain API in `ipex-llm`.
## Using Hugging Face `transformers` INT4 Format
You may run any Hugging Face *Transformers* model (with INT4 optimiztions applied) using the LangChain API as follows:
```python
from ipex_llm.langchain.llms import TransformersLLM
from ipex_llm.langchain.embeddings import TransformersEmbeddings
from langchain.chains.question_answering import load_qa_chain
embeddings = TransformersEmbeddings.from_model_id(model_id=model_path)
ipex_llm = TransformersLLM.from_model_id(model_id=model_path, ...)
doc_chain = load_qa_chain(ipex_llm, ...)
output = doc_chain.run(...)
```
```eval_rst
.. seealso::
See the examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/LangChain/transformers_int4>`_.
```
## Using Native INT4 Format
You may also convert Hugging Face *Transformers* models into native INT4 format, and then run the converted models using the LangChain API as follows.
```eval_rst
.. note::
* Currently only llama/bloom/gptneox/starcoder model families are supported; for other models, you may use the Hugging Face ``transformers`` INT4 format as described `above <./langchain_api.html#using-hugging-face-transformers-int4-format>`_.
* You may choose the corresponding API developed for specific native models to load the converted model.
```
```python
from ipex_llm.langchain.llms import LlamaLLM
from ipex_llm.langchain.embeddings import LlamaEmbeddings
from langchain.chains.question_answering import load_qa_chain
# switch to GptneoxEmbeddings/BloomEmbeddings/StarcoderEmbeddings to load other models
embeddings = LlamaEmbeddings(model_path='/path/to/converted/model.bin')
# switch to GptneoxLLM/BloomLLM/StarcoderLLM to load other models
ipex_llm = LlamaLLM(model_path='/path/to/converted/model.bin')
doc_chain = load_qa_chain(ipex_llm, ...)
doc_chain.run(...)
```
```eval_rst
.. seealso::
See the examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/LangChain/native_int4>`_.
```

View file

@ -0,0 +1,86 @@
# Multi Intel GPUs selection
In [Inference on GPU](inference_on_gpu.md) and [Finetune (QLoRA)](finetune.md), you have known how to run inference and finetune on Intel GPUs. In this section, we will show you two approaches to select GPU devices.
## List devices
The `sycl-ls` tool enumerates a list of devices available in the system. You can use it after you setup oneapi environment:
```eval_rst
.. tabs::
.. tab:: Windows
Please make sure you are using CMD (Miniforge Prompt if using conda):
.. code-block:: cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
sycl-ls
.. tab:: Linux
.. code-block:: bash
source /opt/intel/oneapi/setvars.sh
sycl-ls
```
If you have two Arc770 GPUs, you can get something like below:
```
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) i9-14900K 3.0 [2023.16.7.0.21_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[opencl:gpu:4] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 3.0 [23.17.26241.33]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.26241]
```
This output shows there are two Arc A770 GPUs as well as an Intel iGPU on this machine.
## Devices selection
To enable xpu, you should convert your model and input to xpu by below code:
```
model = model.to('xpu')
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
```
To select the desired devices, there are two ways: one is changing the code, another is adding an environment variable. See:
### 1. Select device in python
To specify a xpu, you can change the `to('xpu')` to `to('xpu:[device_id]')`, this device_id is counted from zero.
If you you want to use the second device, you can change the code like this:
```
model = model.to('xpu:1')
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu:1')
```
### 2. OneAPI device selector
Device selection environment variable, `ONEAPI_DEVICE_SELECTOR`, can be used to limit the choice of Intel GPU devices. As upon `sycl-ls` shows, the last three lines are three Level Zero GPU devices. So we can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select devices.
For example, you want to use the second A770 GPU, you can run the python like this:
```eval_rst
.. tabs::
.. tab:: Windows
.. code-block:: cmd
set ONEAPI_DEVICE_SELECTOR=level_zero:1
python generate.py
Through ``set ONEAPI_DEVICE_SELECTOR=level_zero:1``, only the second A770 GPU will be available for the current environment.
.. tab:: Linux
.. code-block:: bash
ONEAPI_DEVICE_SELECTOR=level_zero:1 python generate.py
``ONEAPI_DEVICE_SELECTOR=level_zero:1`` in upon command only affect in current python program. Also, you can export the environment variable, then run your python:
.. code-block:: bash
export ONEAPI_DEVICE_SELECTOR=level_zero:1
python generate.py
```

View file

@ -0,0 +1,32 @@
# Native Format
You may also convert Hugging Face *Transformers* models into native INT4 format for maximum performance as follows.
```eval_rst
.. note::
Currently only llama/bloom/gptneox/starcoder/chatglm model families are supported; you may use the corresponding API to load the converted model. (For other models, you can use the Hugging Face ``transformers`` format as described `here <./hugging_face_format.html>`_).
```
```python
# convert the model
from ipex_llm import llm_convert
ipex_llm_path = llm_convert(model='/path/to/model/',
outfile='/path/to/output/', outtype='int4', model_family="llama")
# load the converted model
# switch to ChatGLMForCausalLM/GptneoxForCausalLM/BloomForCausalLM/StarcoderForCausalLM to load other models
from ipex_llm.transformers import LlamaForCausalLM
llm = LlamaForCausalLM.from_pretrained("/path/to/output/model.bin", native=True, ...)
# run the converted model
input_ids = llm.tokenize(prompt)
output_ids = llm.generate(input_ids, ...)
output = llm.batch_decode(output_ids)
```
```eval_rst
.. seealso::
See the complete example `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models>`_
```

View file

@ -0,0 +1,69 @@
## PyTorch API
In general, you just need one-line `optimize_model` to easily optimize any loaded PyTorch model, regardless of the library or API you are using. With IPEX-LLM, PyTorch models (in FP16/BF16/FP32) can be optimized with low-bit quantizations (supported precisions include INT4, INT5, INT8, etc).
### Optimize model
First, use any PyTorch APIs you like to load your model. To help you better understand the process, here we use [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) library `LlamaForCausalLM` to load a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example:
```python
# Create or load any Pytorch model, take Llama-2-7b-chat-hf as an example
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype='auto', low_cpu_mem_usage=True)
```
Then, just need to call `optimize_model` to optimize the loaded model and INT4 optimization is applied on model by default:
```python
from ipex_llm import optimize_model
# With only one line to enable IPEX-LLM INT4 optimization
model = optimize_model(model)
```
After optimizing the model, IPEX-LLM does not require any change in the inference code. You can use any libraries to run the optimized model with very low latency.
### More Precisions
In the [Optimize Model](#optimize-model), symmetric INT4 optimization is applied by default. You may apply other low bit optimizations (INT5, INT8, etc) by specifying the ``low_bit`` parameter.
Currently, ``low_bit`` supports options 'sym_int4', 'asym_int4', 'sym_int5', 'asym_int5' or 'sym_int8', in which 'sym' and 'asym' differentiate between symmetric and asymmetric quantization. Symmetric quantization allocates bits for positive and negative values equally, whereas asymmetric quantization allows different bit allocations for positive and negative values.
You may apply symmetric INT8 optimization as follows:
```python
from ipex_llm import optimize_model
# Apply symmetric INT8 optimization
model = optimize_model(model, low_bit="sym_int8")
```
### Save & Load Optimized Model
The loading process of the original model may be time-consuming and memory-intensive. For example, the [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model is stored with float16 precision, resulting in large memory usage when loaded using `LlamaForCausalLM`. To avoid high resource consumption and expedite loading process, you can use `save_low_bit` to store the model after low-bit optimization. Then, in subsequent uses, you can opt to use the `load_low_bit` API to directly load the optimized model. Besides, saving and loading operations are platform-independent, regardless of their operating systems.
#### Save
Continuing with the [example of Llama-2-7b-chat-hf](#optimize-model), we can save the previously optimized model as follows:
```python
saved_dir='./llama-2-ipex-llm-4-bit'
model.save_low_bit(saved_dir)
```
#### Load
We recommend to use the context manager `low_memory_init` to quickly initiate a model instance with low cost, and then use `load_low_bit` to load the optimized low-bit model as follows:
```python
from ipex_llm.optimize import low_memory_init, load_low_bit
with low_memory_init(): # Fast and low cost by loading model on meta device
model = LlamaForCausalLM.from_pretrained(saved_dir,
torch_dtype="auto",
trust_remote_code=True)
model = load_low_bit(model, saved_dir) # Load the optimized model
```
```eval_rst
.. seealso::
* Please refer to the `API documentation <https://ipex-llm.readthedocs.io/en/latest/doc/PythonAPI/LLM/optimize.html>`_ for more details.
* We also provide detailed examples on how to run PyTorch models (e.g., Openai Whisper, LLaMA2, ChatGLM2, Falcon, MPT, Baichuan2, etc.) using IPEX-LLM. See the complete CPU examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models>`_ and GPU examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models>`_.
```

View file

@ -0,0 +1,10 @@
``transformers``-style API
================================
You may run the LLMs using ``transformers``-style API in ``ipex-llm``.
* |hugging_face_transformers_format|_
* `Native Format <./native_format.html>`_
.. |hugging_face_transformers_format| replace:: Hugging Face ``transformers`` Format
.. _hugging_face_transformers_format: ./hugging_face_format.html

View file

@ -0,0 +1,9 @@
IPEX-LLM Examples
================================
You can use IPEX-LLM to run any PyTorch model with INT4 optimizations on Intel XPU (from Laptop to GPU to Cloud).
Here, we provide examples to help you quickly get started using IPEX-LLM to run some popular open-source models in the community. Please refer to the appropriate guide based on your device:
* `CPU <./examples_cpu.html>`_
* `GPU <./examples_gpu.html>`_

View file

@ -0,0 +1,64 @@
# IPEX-LLM Examples: CPU
Here, we provide some examples on how you could apply IPEX-LLM INT4 optimizations on popular open-source models in the community.
To run these examples, please first refer to [here](./install_cpu.html) for more information about how to install ``ipex-llm``, requirements and best practices for setting up your environment.
The following models have been verified on either servers or laptops with Intel CPUs.
## Example of PyTorch API
| Model | Example of PyTorch API |
|------------|-------------------------------------------------------|
| LLaMA 2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/llama2) |
| ChatGLM | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/chatglm) |
| Mistral | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/mistral) |
| Bark | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/bark) |
| BERT | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/bert) |
| Openai Whisper | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/openai-whisper) |
```eval_rst
.. important::
In addition to INT4 optimization, IPEX-LLM also provides other low bit optimizations (such as INT8, INT5, NF4, etc.). You may apply other low bit optimizations through PyTorch API as `example <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/More-Data-Types>`_.
```
## Example of `transformers`-style API
| Model | Example of `transformers`-style API |
|------------|-------------------------------------------------------|
| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* | [link1](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models), [link2](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/vicuna) |
| LLaMA 2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/llama2) | [link1](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models), [link2](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2) |
| ChatGLM | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/chatglm) | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm) |
| ChatGLM2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm2) |
| Mistral | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/mistral) |
| Falcon | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/falcon) |
| MPT | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/mpt) |
| Dolly-v1 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v1) |
| Dolly-v2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v2) |
| Replit Code| [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/replit) |
| RedPajama | [link1](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models), [link2](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/redpajama) |
| Phoenix | [link1](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models), [link2](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/phoenix) |
| StarCoder | [link1](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models), [link2](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/starcoder) |
| Baichuan | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan) |
| Baichuan2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan2) |
| InternLM | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm) |
| Qwen | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen) |
| Aquila | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/aquila) |
| MOSS | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/moss) |
| Whisper | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/whisper) |
```eval_rst
.. important::
In addition to INT4 optimization, IPEX-LLM also provides other low bit optimizations (such as INT8, INT5, NF4, etc.). You may apply other low bit optimizations through ``transformers``-style API as `example <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types>`_.
```
```eval_rst
.. seealso::
See the complete examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU>`_.
```

View file

@ -0,0 +1,70 @@
# IPEX-LLM Examples: GPU
Here, we provide some examples on how you could apply IPEX-LLM INT4 optimizations on popular open-source models in the community.
To run these examples, please first refer to [here](./install_gpu.html) for more information about how to install ``ipex-llm``, requirements and best practices for setting up your environment.
```eval_rst
.. important::
Only Linux system is supported now, Ubuntu 22.04 is prefered.
```
The following models have been verified on either servers or laptops with Intel GPUs.
## Example of PyTorch API
| Model | Example of PyTorch API |
|------------|-------------------------------------------------------|
| LLaMA 2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/llama2) |
| ChatGLM 2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/chatglm2) |
| Mistral | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/mistral) |
| Baichuan | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/baichuan) |
| Baichuan2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/baichuan2) |
| Replit | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/replit) |
| StarCoder | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/starcoder) |
| Dolly-v1 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/dolly-v1) |
| Dolly-v2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/dolly-v2) |
```eval_rst
.. important::
In addition to INT4 optimization, IPEX-LLM also provides other low bit optimizations (such as INT8, INT5, NF4, etc.). You may apply other low bit optimizations through PyTorch API as `example <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/More-Data-Types>`_.
```
## Example of `transformers`-style API
| Model | Example of `transformers`-style API |
|------------|-------------------------------------------------------|
| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* |[link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna)|
| LLaMA 2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2) |
| ChatGLM2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2) |
| Mistral | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral) |
| Falcon | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon) |
| MPT | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/mpt) |
| Dolly-v1 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v1) |
| Dolly-v2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v2) |
| Replit | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/replit) |
| StarCoder | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder) |
| Baichuan | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan) |
| Baichuan2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2) |
| InternLM | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm) |
| Qwen | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen) |
| Aquila | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila) |
| Whisper | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper) |
| Chinese Llama2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2) |
| GPT-J | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j) |
```eval_rst
.. important::
In addition to INT4 optimization, IPEX-LLM also provides other low bit optimizations (such as INT8, INT5, NF4, etc.). You may apply other low bit optimizations through ``transformers``-style API as `example <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types>`_.
```
```eval_rst
.. seealso::
See the complete examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU>`_.
```

View file

@ -0,0 +1,7 @@
IPEX-LLM Installation
================================
Here, we provide instructions on how to install ``ipex-llm`` and best practices for setting up your environment. Please refer to the appropriate guide based on your device:
* `CPU <./install_cpu.html>`_
* `GPU <./install_gpu.html>`_

View file

@ -0,0 +1,100 @@
# IPEX-LLM Installation: CPU
## Quick Installation
Install IPEX-LLM for CPU supports using pip through:
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
.. tab:: Windows
.. code-block:: cmd
pip install --pre --upgrade ipex-llm[all]
```
Please refer to [Environment Setup](#environment-setup) for more information.
```eval_rst
.. note::
``all`` option will trigger installation of all the dependencies for common LLM application development.
.. important::
``ipex-llm`` is tested with Python 3.9, 3.10 and 3.11; Python 3.11 is recommended for best practices.
```
## Recommended Requirements
Here list the recommended hardware and OS for smooth IPEX-LLM optimization experiences on CPU:
* Hardware
* PCs equipped with 12th Gen Intel® Core™ processor or higher, and at least 16GB RAM
* Servers equipped with Intel® Xeon® processors, at least 32G RAM.
* Operating System
* Ubuntu 20.04 or later
* CentOS 7 or later
* Windows 10/11, with or without WSL
## Environment Setup
For optimal performance with LLM models using IPEX-LLM optimizations on Intel CPUs, here are some best practices for setting up environment:
First we recommend using [Conda](https://conda-forge.org/download/) to create a python 3.11 enviroment:
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
conda create -n llm python=3.11
conda activate llm
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
.. tab:: Windows
.. code-block:: cmd
conda create -n llm python=3.11
conda activate llm
pip install --pre --upgrade ipex-llm[all]
```
Then for running a LLM model with IPEX-LLM optimizations (taking an `example.py` an example):
```eval_rst
.. tabs::
.. tab:: Client
It is recommended to run directly with full utilization of all CPU cores:
.. code-block:: bash
python example.py
.. tab:: Server
It is recommended to run with all the physical cores of a single socket:
.. code-block:: bash
# e.g. for a server with 48 cores per socket
export OMP_NUM_THREADS=48
numactl -C 0-47 -m 0 python example.py
```

View file

@ -0,0 +1,666 @@
# IPEX-LLM Installation: GPU
## Windows
### Prerequisites
IPEX-LLM on Windows supports Intel iGPU and dGPU.
```eval_rst
.. important::
IPEX-LLM on Windows only supports PyTorch 2.1.
```
To apply Intel GPU acceleration, please first verify your GPU driver version.
```eval_rst
.. note::
The GPU driver version of your device can be checked in the "Task Manager" -> GPU 0 (or GPU 1, etc.) -> Driver version.
```
If you have driver version lower than `31.0.101.5122`, it is recommended to [**update your GPU driver to the latest**](https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html):
<!-- Intel® oneAPI Base Toolkit 2024.0 installation methods:
```eval_rst
.. tabs::
.. tab:: Offline installer
Download and install `Intel® oneAPI Base Toolkit <https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=window&distributions=offline>`_ version 2024.0 through Offline Installer.
During installation, you could just continue with "Recommended Installation". If you would like to continue with "Custom Installation", please note that oneAPI Deep Neural Network Library, oneAPI Math Kernel Library, and oneAPI DPC++/C++ Compiler are required, the other components are optional.
.. tab:: PIP installer
Pip install oneAPI in your working conda environment.
.. code-block:: bash
pip install dpcpp-cpp-rt==2024.0.2 mkl-dpcpp==2024.0.0 onednn==2024.0.0
.. note::
Activating your working conda environment will automatically configure oneAPI environment variables.
``` -->
### Install IPEX-LLM
#### Install IPEX-LLM From PyPI
We recommend using [Miniforge](https://conda-forge.org/download/) to create a python 3.11 enviroment.
```eval_rst
.. important::
``ipex-llm`` is tested with Python 3.9, 3.10 and 3.11. Python 3.11 is recommended for best practices.
```
The easiest ways to install `ipex-llm` is the following commands, choosing either US or CN website for `extra-index-url`:
```eval_rst
.. tabs::
.. tab:: US
.. code-block:: cmd
conda create -n llm python=3.11 libuv
conda activate llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
.. tab:: CN
.. code-block:: cmd
conda create -n llm python=3.11 libuv
conda activate llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
```
#### Install IPEX-LLM From Wheel
If you encounter network issues when installing IPEX, you can also install IPEX-LLM dependencies for Intel XPU from source archives. First you need to download and install torch/torchvision/ipex from wheels listed below before installing `ipex-llm`.
Download the wheels on Windows system:
```
wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torch-2.1.0a0%2Bcxx11.abi-cp311-cp311-win_amd64.whl
wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torchvision-0.16.0a0%2Bcxx11.abi-cp311-cp311-win_amd64.whl
wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/intel_extension_for_pytorch-2.1.10%2Bxpu-cp311-cp311-win_amd64.whl
```
You may install dependencies directly from the wheel archives and then install `ipex-llm` using following commands:
```
pip install torch-2.1.0a0+cxx11.abi-cp311-cp311-win_amd64.whl
pip install torchvision-0.16.0a0+cxx11.abi-cp311-cp311-win_amd64.whl
pip install intel_extension_for_pytorch-2.1.10+xpu-cp311-cp311-win_amd64.whl
pip install --pre --upgrade ipex-llm[xpu]
```
```eval_rst
.. note::
All the wheel packages mentioned here are for Python 3.11. If you would like to use Python 3.9 or 3.10, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp11`` with ``cp39`` or ``cp310``, respectively.
```
### Runtime Configuration
To use GPU acceleration on Windows, several environment variables are required before running a GPU example:
<!-- Make sure you are using CMD (Miniforge Prompt if using conda) as PowerShell is not supported, and configure oneAPI environment variables with:
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
Please also set the following environment variable if you would like to run LLMs on: -->
```eval_rst
.. tabs::
.. tab:: Intel iGPU
.. code-block:: cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
.. tab:: Intel Arc™ A-Series Graphics
.. code-block:: cmd
set SYCL_CACHE_PERSISTENT=1
```
```eval_rst
.. note::
For **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
```
### Troubleshooting
#### 1. Error loading `intel_extension_for_pytorch`
If you met error when importing `intel_extension_for_pytorch`, please ensure that you have completed the following steps:
* Ensure that you have installed Visual Studio with "Desktop development with C++" workload.
* Make sure that the correct version of oneAPI, specifically 2024.0, is installed.
* Ensure that `libuv` is installed in your conda environment. This can be done during the creation of the environment with the command:
```cmd
conda create -n llm python=3.11 libuv
```
If you missed `libuv`, you can add it to your existing environment through
```cmd
conda install libuv
```
<!-- * For oneAPI installed using the Offline installer, make sure you have configured oneAPI environment variables in your Miniforge Prompt through
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
Please note that you need to set these environment variables again once you have a new Miniforge Prompt window. -->
## Linux
### Prerequisites
IPEX-LLM GPU support on Linux has been verified on:
* Intel Arc™ A-Series Graphics
* Intel Data Center GPU Flex Series
* Intel Data Center GPU Max Series
```eval_rst
.. important::
IPEX-LLM on Linux supports PyTorch 2.0 and PyTorch 2.1.
.. warning::
IPEX-LLM support for Pytorch 2.0 is deprecated as of ``ipex-llm >= 2.1.0b20240511``.
```
```eval_rst
.. important::
We currently support the Ubuntu 20.04 operating system and later.
```
```eval_rst
.. tabs::
.. tab:: PyTorch 2.1
To enable IPEX-LLM for Intel GPUs with PyTorch 2.1, here are several prerequisite steps for tools installation and environment preparation:
* Step 1: Install Intel GPU Driver version >= stable_775_20_20231219. We highly recommend installing the latest version of intel-i915-dkms using apt.
.. seealso::
Please refer to our `driver installation <https://dgpu-docs.intel.com/driver/installation.html>`_ for general purpose GPU capabilities.
See `release page <https://dgpu-docs.intel.com/releases/index.html>`_ for latest version.
.. note::
For Intel Core™ Ultra integrated GPU, please make sure level_zero version >= 1.3.28717. The level_zero version can be checked with ``sycl-ls``, and verison will be tagged be ``[ext_oneapi_level_zero:gpu]``.
.. code-block::
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 5 125H OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO [24.09.28717.12]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.28717]
If you have level_zero version < 1.3.28717, you could update as follows:
.. code-block:: bash
wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.16238.4/intel-igc-core_1.0.16238.4_amd64.deb
wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.16238.4/intel-igc-opencl_1.0.16238.4_amd64.deb
wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-level-zero-gpu-dbgsym_1.3.28717.12_amd64.ddeb
wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-level-zero-gpu_1.3.28717.12_amd64.deb
wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-opencl-icd-dbgsym_24.09.28717.12_amd64.ddeb
wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-opencl-icd_24.09.28717.12_amd64.deb
wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/libigdgmm12_22.3.17_amd64.deb
sudo dpkg -i *.deb
* Step 2: Download and install `Intel® oneAPI Base Toolkit <https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html>`_ with version 2024.0. OneDNN, OneMKL and DPC++ compiler are needed, others are optional.
Intel® oneAPI Base Toolkit 2024.0 installation methods:
.. tabs::
.. tab:: APT installer
Step 1: Set up repository
.. code-block:: bash
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
Step 2: Install the package
.. code-block:: bash
sudo apt install intel-oneapi-common-vars=2024.0.0-49406 \
intel-oneapi-common-oneapi-vars=2024.0.0-49406 \
intel-oneapi-diagnostics-utility=2024.0.0-49093 \
intel-oneapi-compiler-dpcpp-cpp=2024.0.2-49895 \
intel-oneapi-dpcpp-ct=2024.0.0-49381 \
intel-oneapi-mkl=2024.0.0-49656 \
intel-oneapi-mkl-devel=2024.0.0-49656 \
intel-oneapi-mpi=2021.11.0-49493 \
intel-oneapi-mpi-devel=2021.11.0-49493 \
intel-oneapi-dal=2024.0.1-25 \
intel-oneapi-dal-devel=2024.0.1-25 \
intel-oneapi-ippcp=2021.9.1-5 \
intel-oneapi-ippcp-devel=2021.9.1-5 \
intel-oneapi-ipp=2021.10.1-13 \
intel-oneapi-ipp-devel=2021.10.1-13 \
intel-oneapi-tlt=2024.0.0-352 \
intel-oneapi-ccl=2021.11.2-5 \
intel-oneapi-ccl-devel=2021.11.2-5 \
intel-oneapi-dnnl-devel=2024.0.0-49521 \
intel-oneapi-dnnl=2024.0.0-49521 \
intel-oneapi-tcm-1.0=1.0.0-435
.. note::
You can uninstall the package by running the following command:
.. code-block:: bash
sudo apt autoremove intel-oneapi-common-vars
.. tab:: PIP installer
Step 1: Install oneAPI in a user-defined folder, e.g., ``~/intel/oneapi``.
.. code-block:: bash
export PYTHONUSERBASE=~/intel/oneapi
pip install dpcpp-cpp-rt==2024.0.2 mkl-dpcpp==2024.0.0 onednn==2024.0.0 --user
.. note::
The oneAPI packages are visible in ``pip list`` only if ``PYTHONUSERBASE`` is properly set.
Step 2: Configure your working conda environment (e.g. with name ``llm``) to append oneAPI path (e.g. ``~/intel/oneapi/lib``) to the environment variable ``LD_LIBRARY_PATH``.
.. code-block:: bash
conda env config vars set LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/intel/oneapi/lib -n llm
.. note::
You can view the configured environment variables for your environment (e.g. with name ``llm``) by running ``conda env config vars list -n llm``.
You can continue with your working conda environment and install ``ipex-llm`` as guided in the next section.
.. note::
You are recommended not to install other pip packages in the user-defined folder for oneAPI (e.g. ``~/intel/oneapi``).
You can uninstall the oneAPI package by simply deleting the package folder, and unsetting the configuration of your working conda environment (e.g., with name ``llm``).
.. code-block:: bash
rm -r ~/intel/oneapi
conda env config vars unset LD_LIBRARY_PATH -n llm
.. tab:: Offline installer
Using the offline installer allows you to customize the installation path.
.. code-block:: bash
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/20f4e6a1-6b0b-4752-b8c1-e5eacba10e01/l_BaseKit_p_2024.0.0.49564_offline.sh
sudo sh ./l_BaseKit_p_2024.0.0.49564_offline.sh
.. note::
You can also modify the installation or uninstall the package by running the following commands:
.. code-block:: bash
cd /opt/intel/oneapi/installer
sudo ./installer
.. tab:: PyTorch 2.0 (deprecated for versions ``ipex-llm >= 2.1.0b20240511``)
To enable IPEX-LLM for Intel GPUs with PyTorch 2.0, here're several prerequisite steps for tools installation and environment preparation:
* Step 1: Install Intel GPU Driver version >= stable_775_20_20231219. Highly recommend installing the latest version of intel-i915-dkms using apt.
.. seealso::
Please refer to our `driver installation <https://dgpu-docs.intel.com/driver/installation.html>`_ for general purpose GPU capabilities.
See `release page <https://dgpu-docs.intel.com/releases/index.html>`_ for latest version.
* Step 2: Download and install `Intel® oneAPI Base Toolkit <https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html>`_ with version 2023.2. OneDNN, OneMKL and DPC++ compiler are needed, others are optional.
Intel® oneAPI Base Toolkit 2023.2 installation methods:
.. tabs::
.. tab:: APT installer
Step 1: Set up repository
.. code-block:: bash
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
Step 2: Install the packages
.. code-block:: bash
sudo apt install -y intel-oneapi-common-vars=2023.2.0-49462 \
intel-oneapi-compiler-cpp-eclipse-cfg=2023.2.0-49495 intel-oneapi-compiler-dpcpp-eclipse-cfg=2023.2.0-49495 \
intel-oneapi-diagnostics-utility=2022.4.0-49091 \
intel-oneapi-compiler-dpcpp-cpp=2023.2.0-49495 \
intel-oneapi-mkl=2023.2.0-49495 intel-oneapi-mkl-devel=2023.2.0-49495 \
intel-oneapi-mpi=2021.10.0-49371 intel-oneapi-mpi-devel=2021.10.0-49371 \
intel-oneapi-tbb=2021.10.0-49541 intel-oneapi-tbb-devel=2021.10.0-49541\
intel-oneapi-ccl=2021.10.0-49084 intel-oneapi-ccl-devel=2021.10.0-49084\
intel-oneapi-dnnl-devel=2023.2.0-49516 intel-oneapi-dnnl=2023.2.0-49516
.. note::
You can uninstall the package by running the following command:
.. code-block:: bash
sudo apt autoremove intel-oneapi-common-vars
.. tab:: PIP installer
Step 1: Install oneAPI in a user-defined folder, e.g., ``~/intel/oneapi``
.. code-block:: bash
export PYTHONUSERBASE=~/intel/oneapi
pip install dpcpp-cpp-rt==2023.2.0 mkl-dpcpp==2023.2.0 onednn-cpu-dpcpp-gpu-dpcpp==2023.2.0 --user
.. note::
The oneAPI packages are visible in ``pip list`` only if ``PYTHONUSERBASE`` is properly set.
Step 2: Configure your working conda environment (e.g. with name ``llm``) to append oneAPI path (e.g. ``~/intel/oneapi/lib``) to the environment variable ``LD_LIBRARY_PATH``.
.. code-block:: bash
conda env config vars set LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/intel/oneapi/lib -n llm
.. note::
You can view the configured environment variables for your environment (e.g. with name ``llm``) by running ``conda env config vars list -n llm``.
You can continue with your working conda environment and install ``ipex-llm`` as guided in the next section.
.. note::
You are recommended not to install other pip packages in the user-defined folder for oneAPI (e.g. ``~/intel/oneapi``).
You can uninstall the oneAPI package by simply deleting the package folder, and unsetting the configuration of your working conda environment (e.g., with name ``llm``).
.. code-block:: bash
rm -r ~/intel/oneapi
conda env config vars unset LD_LIBRARY_PATH -n llm
.. tab:: Offline installer
Using the offline installer allows you to customize the installation path.
.. code-block:: bash
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/992857b9-624c-45de-9701-f6445d845359/l_BaseKit_p_2023.2.0.49397_offline.sh
sudo sh ./l_BaseKit_p_2023.2.0.49397_offline.sh
.. note::
You can also modify the installation or uninstall the package by running the following commands:
.. code-block:: bash
cd /opt/intel/oneapi/installer
sudo ./installer
```
### Install IPEX-LLM
#### Install IPEX-LLM From PyPI
We recommend using [Miniforge](https://conda-forge.org/download/ to create a python 3.11 enviroment:
```eval_rst
.. important::
``ipex-llm`` is tested with Python 3.9, 3.10 and 3.11. Python 3.11 is recommended for best practices.
```
```eval_rst
.. important::
Make sure you install matching versions of ipex-llm/pytorch/IPEX and oneAPI Base Toolkit. IPEX-LLM with Pytorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. IPEX-LLM with Pytorch 2.0 should be used with oneAPI Base Toolkit version 2023.2.
```
```eval_rst
.. tabs::
.. tab:: PyTorch 2.1
Choose either US or CN website for ``extra-index-url``:
.. tabs::
.. tab:: US
.. code-block:: bash
conda create -n llm python=3.11
conda activate llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
.. note::
The ``xpu`` option will install IPEX-LLM with PyTorch 2.1 by default, which is equivalent to
.. code-block:: bash
pip install --pre --upgrade ipex-llm[xpu_2.1] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
.. tab:: CN
.. code-block:: bash
conda create -n llm python=3.11
conda activate llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
.. note::
The ``xpu`` option will install IPEX-LLM with PyTorch 2.1 by default, which is equivalent to
.. code-block:: bash
pip install --pre --upgrade ipex-llm[xpu_2.1] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
.. tab:: PyTorch 2.0 (deprecated for versions ``ipex-llm >= 2.1.0b20240511``)
Choose either US or CN website for ``extra-index-url``:
.. tabs::
.. tab:: US
.. code-block:: bash
conda create -n llm python=3.11
conda activate llm
pip install --pre --upgrade ipex-llm[xpu_2.0]==2.1.0b20240510 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
.. tab:: CN
.. code-block:: bash
conda create -n llm python=3.11
conda activate llm
pip install --pre --upgrade ipex-llm[xpu_2.0]==2.1.0b20240510 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
```
#### Install IPEX-LLM From Wheel
If you encounter network issues when installing IPEX, you can also install IPEX-LLM dependencies for Intel XPU from source archives. First you need to download and install torch/torchvision/ipex from wheels listed below before installing `ipex-llm`.
```eval_rst
.. tabs::
.. tab:: PyTorch 2.1
.. code-block:: bash
# get the wheels on Linux system for IPEX 2.1.10+xpu
wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torch-2.1.0a0%2Bcxx11.abi-cp311-cp311-linux_x86_64.whl
wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torchvision-0.16.0a0%2Bcxx11.abi-cp311-cp311-linux_x86_64.whl
wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/intel_extension_for_pytorch-2.1.10%2Bxpu-cp311-cp311-linux_x86_64.whl
Then you may install directly from the wheel archives using following commands:
.. code-block:: bash
# install the packages from the wheels
pip install torch-2.1.0a0+cxx11.abi-cp311-cp311-linux_x86_64.whl
pip install torchvision-0.16.0a0+cxx11.abi-cp311-cp311-linux_x86_64.whl
pip install intel_extension_for_pytorch-2.1.10+xpu-cp311-cp311-linux_x86_64.whl
# install ipex-llm for Intel GPU
pip install --pre --upgrade ipex-llm[xpu]
.. tab:: PyTorch 2.0 (deprecated for versions ``ipex-llm >= 2.1.0b20240511``)
.. code-block:: bash
# get the wheels on Linux system for IPEX 2.0.110+xpu
wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torch-2.0.1a0%2Bcxx11.abi-cp311-cp311-linux_x86_64.whl
wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torchvision-0.15.2a0%2Bcxx11.abi-cp311-cp311-linux_x86_64.whl
wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/intel_extension_for_pytorch-2.0.110%2Bxpu-cp311-cp311-linux_x86_64.whl
Then you may install directly from the wheel archives using following commands:
.. code-block:: bash
# install the packages from the wheels
pip install torch-2.0.1a0+cxx11.abi-cp311-cp311-linux_x86_64.whl
pip install torchvision-0.15.2a0+cxx11.abi-cp311-cp311-linux_x86_64.whl
pip install intel_extension_for_pytorch-2.0.110+xpu-cp311-cp311-linux_x86_64.whl
# install ipex-llm for Intel GPU
pip install --pre --upgrade ipex-llm[xpu_2.0]==2.1.0b20240510
```
```eval_rst
.. note::
All the wheel packages mentioned here are for Python 3.11. If you would like to use Python 3.9 or 3.10, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp11`` with ``cp39`` or ``cp310``, respectively.
```
### Runtime Configuration
To use GPU acceleration on Linux, several environment variables are required or recommended before running a GPU example.
```eval_rst
.. tabs::
.. tab:: Intel Arc™ A-Series and Intel Data Center GPU Flex
For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, we recommend:
.. code-block:: bash
# Configure oneAPI environment variables. Required step for APT or offline installed oneAPI.
# Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH.
source /opt/intel/oneapi/setvars.sh
# Recommended Environment Variables for optimal performance
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
.. tab:: Intel Data Center GPU Max
For Intel Data Center GPU Max Series, we recommend:
.. code-block:: bash
# Configure oneAPI environment variables. Required step for APT or offline installed oneAPI.
# Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH.
source /opt/intel/oneapi/setvars.sh
# Recommended Environment Variables for optimal performance
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
export ENABLE_SDP_FUSION=1
Please note that ``libtcmalloc.so`` can be installed by ``conda install -c conda-forge -y gperftools=2.10``
.. tab:: Intel iGPU
.. code-block:: bash
# Configure oneAPI environment variables. Required step for APT or offline installed oneAPI.
# Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH.
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
export BIGDL_LLM_XMX_DISABLED=1
```
```eval_rst
.. note::
For **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
```
### Known issues
#### 1. Potential suboptimal performance with Linux kernel 6.2.0
For Ubuntu 22.04 and driver version < stable_775_20_20231219, the performance on Linux kernel 6.2.0 is worse than Linux kernel 5.19.0. You can use `sudo apt update && sudo apt install -y intel-i915-dkms intel-fw-gpu` to install the latest driver to solve this issue (need to reboot OS).
Tips: You can use `sudo apt list --installed | grep intel-i915-dkms` to check your intel-i915-dkms's version, the version should be latest and >= `1.23.9.11.231003.15+i19-1`.
#### 2. Driver installation unmet dependencies error: intel-i915-dkms
The last apt install command of the driver installation may produce the following error:
```
The following packages have unmet dependencies:
intel-i915-dkms : Conflicts: intel-platform-cse-dkms
Conflicts: intel-platform-vsec-dkms
```
You can use `sudo apt install -y intel-i915-dkms intel-fw-gpu` to install instead. As the intel-platform-cse-dkms and intel-platform-vsec-dkms are already provided by intel-i915-dkms.
### Troubleshooting
#### 1. Cannot open shared object file: No such file or directory
Error where libmkl file is not found, for example,
```
OSError: libmkl_intel_lp64.so.2: cannot open shared object file: No such file or directory
```
```
Error: libmkl_sycl_blas.so.4: cannot open shared object file: No such file or directory
```
The reason for such errors is that oneAPI has not been initialized properly before running IPEX-LLM code or before importing IPEX package.
* For oneAPI installed using APT or Offline Installer, make sure you execute `setvars.sh` of oneAPI Base Toolkit before running IPEX-LLM.
* For PIP-installed oneAPI, activate your working environment and run ``echo $LD_LIBRARY_PATH`` to check if the installation path is properly configured for the environment. If the output does not contain oneAPI path (e.g. ``~/intel/oneapi/lib``), check [Prerequisites](#id1) to re-install oneAPI with PIP installer.
* Make sure you install matching versions of ipex-llm/pytorch/IPEX and oneAPI Base Toolkit. IPEX-LLM with PyTorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. IPEX-LLM with PyTorch 2.0 should be used with oneAPI Base Toolkit version 2023.2.

View file

@ -0,0 +1 @@
# IPEX-LLM Known Issues

View file

@ -0,0 +1,68 @@
# IPEX-LLM in 5 minutes
You can use IPEX-LLM to run any [*Hugging Face Transformers*](https://huggingface.co/docs/transformers/index) PyTorch model. It automatically optimizes and accelerates LLMs using low-precision (INT4/INT5/INT8) techniques, modern hardware accelerations and latest software optimizations.
Hugging Face transformers-based applications can run on IPEX-LLM with one-line code change, and you'll immediately observe significant speedup<sup><a href="#footnote-perf" id="ref-perf">[1]</a></sup>.
Here, let's take a relatively small LLM model, i.e [open_llama_3b_v2](https://huggingface.co/openlm-research/open_llama_3b_v2), and IPEX-LLM INT4 optimizations as an example.
## Load a Pretrained Model
Simply use one-line `transformers`-style API in `ipex-llm` to load `open_llama_3b_v2` with INT4 optimization (by specifying `load_in_4bit=True`) as follows:
```python
from ipex_llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b_v2",
load_in_4bit=True)
```
```eval_rst
.. tip::
`open_llama_3b_v2 <https://huggingface.co/openlm-research/open_llama_3b_v2>`_ is a pretrained large language model hosted on Hugging Face. ``openlm-research/open_llama_3b_v2`` is its Hugging Face model id. ``from_pretrained`` will automatically download the model from Hugging Face to a local cache path (e.g. ``~/.cache/huggingface``), load the model, and converted it to ``ipex-llm`` INT4 format.
It may take a long time to download the model using API. You can also download the model yourself, and set ``pretrained_model_name_or_path`` to the local path of the downloaded model. This way, ``from_pretrained`` will load and convert directly from local path without download.
```
## Load Tokenizer
You also need a tokenizer for inference. Just use the official `transformers` API to load `LlamaTokenizer`:
```python
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b_v2")
```
## Run LLM
Now you can do model inference exactly the same way as using official `transformers` API:
```python
import torch
with torch.inference_mode():
prompt = 'Q: What is CPU?\nA:'
# tokenize the input prompt from string to token ids
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# predict the next tokens (maximum 32) based on the input token ids
output = model.generate(input_ids,
max_new_tokens=32)
# decode the predicted token ids to output string
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_str)
```
------
<div>
<p>
<sup><a href="#ref-perf" id="footnote-perf">[1]</a>
Performance varies by use, configuration and other factors. <code><span>ipex-llm</span></code> may not optimize to the same degree for non-Intel products. Learn more at <a href="https://www.Intel.com/PerformanceIndex">www.Intel.com/PerformanceIndex</a>.
</sup>
</p>
</div>

View file

@ -0,0 +1,314 @@
# Finetune LLM with Axolotl on Intel GPU
[Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) is a popular tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures. You can now use [`ipex-llm`](https://github.com/intel-analytics/ipex-llm) as an accelerated backend for `Axolotl` running on Intel **GPU** *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)*.
See the demo of finetuning LLaMA2-7B on Intel Arc GPU below.
<video src="https://llm-assets.readthedocs.io/en/latest/_images/axolotl-qlora-linux-arc.mp4" width="100%" controls></video>
## Quickstart
### 0. Prerequisites
IPEX-LLM's support for [Axolotl v0.4.0](https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0) is only available for Linux system. We recommend Ubuntu 20.04 or later (Ubuntu 22.04 is preferred).
Visit the [Install IPEX-LLM on Linux with Intel GPU](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html), follow [Install Intel GPU Driver](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-intel-gpu-driver) and [Install oneAPI](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-oneapi) to install GPU driver and Intel® oneAPI Base Toolkit 2024.0.
### 1. Install IPEX-LLM for Axolotl
Create a new conda env, and install `ipex-llm[xpu]`.
```cmd
conda create -n axolotl python=3.11
conda activate axolotl
# install ipex-llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
```
Install [axolotl v0.4.0](https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0) from git.
```cmd
# install axolotl v0.4.0
git clone https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0
cd axolotl
# replace requirements.txt
remove requirements.txt
wget -O requirements.txt https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/requirements-xpu.txt
pip install -e .
pip install transformers==4.36.0
# to avoid https://github.com/OpenAccess-AI-Collective/axolotl/issues/1544
pip install datasets==2.15.0
# prepare axolotl entrypoints
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/finetune.py
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/train.py
```
**After the installation, you should have created a conda environment, named `axolotl` for instance, for running `Axolotl` commands with IPEX-LLM.**
### 2. Example: Finetune Llama-2-7B with Axolotl
The following example will introduce finetuning [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) with [alpaca_2k_test](https://huggingface.co/datasets/mhenrichsen/alpaca_2k_test) dataset using LoRA and QLoRA.
Note that you don't need to write any code in this example.
| Model | Dataset | Finetune method |
|-------|-------|-------|
| Llama-2-7B | alpaca_2k_test | LoRA (Low-Rank Adaptation) |
| Llama-2-7B | alpaca_2k_test | QLoRA (Quantized Low-Rank Adaptation) |
For more technical details, please refer to [Llama 2](https://arxiv.org/abs/2307.09288), [LoRA](https://arxiv.org/abs/2106.09685) and [QLoRA](https://arxiv.org/abs/2305.14314).
#### 2.1 Download Llama-2-7B and alpaca_2k_test
By default, Axolotl will automatically download models and datasets from Huggingface. Please ensure you have login to Huggingface.
```cmd
huggingface-cli login
```
If you prefer offline models and datasets, please download [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) and [alpaca_2k_test](https://huggingface.co/datasets/mhenrichsen/alpaca_2k_test). Then, set `HF_HUB_OFFLINE=1` to avoid connecting to Huggingface.
```cmd
export HF_HUB_OFFLINE=1
```
#### 2.2 Set Environment Variables
```eval_rst
.. note::
This is a required step on for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
```
Configure oneAPI variables by running the following command:
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
source /opt/intel/oneapi/setvars.sh
```
Configure accelerate to avoid training with CPU. You can download a default `default_config.yaml` with `use_cpu: false`.
```cmd
mkdir -p ~/.cache/huggingface/accelerate/
wget -O ~/.cache/huggingface/accelerate/default_config.yaml https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/default_config.yaml
```
As an alternative, you can config accelerate based on your requirements.
```cmd
accelerate config
```
Please answer `NO` in option `Do you want to run your training on CPU only (even if a GPU / Apple Silicon device is available)? [yes/NO]:`.
After finishing accelerate config, check if `use_cpu` is disabled (i.e., `use_cpu: false`) in accelerate config file (`~/.cache/huggingface/accelerate/default_config.yaml`).
#### 2.3 LoRA finetune
Prepare `lora.yml` for Axolotl LoRA finetune. You can download a template from github.
```cmd
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/lora.yml
```
**If you are using the offline model and dataset in local env**, please modify the model path and dataset path in `lora.yml`. Otherwise, keep them unchanged.
```yaml
# Please change to local path if model is offline, e.g., /path/to/model/Llama-2-7b-hf
base_model: NousResearch/Llama-2-7b-hf
datasets:
# Please change to local path if dataset is offline, e.g., /path/to/dataset/alpaca_2k_test
- path: mhenrichsen/alpaca_2k_test
type: alpaca
```
Modify LoRA parameters, such as `lora_r` and `lora_alpha`, etc.
```yaml
adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
```
Launch LoRA training with the following command.
```cmd
accelerate launch finetune.py lora.yml
```
In Axolotl v0.4.0, you can use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`.
```cmd
accelerate launch train.py lora.yml
```
#### 2.4 QLoRA finetune
Prepare `lora.yml` for QLoRA finetune. You can download a template from github.
```cmd
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/qlora.yml
```
**If you are using the offline model and dataset in local env**, please modify the model path and dataset path in `qlora.yml`. Otherwise, keep them unchanged.
```yaml
# Please change to local path if model is offline, e.g., /path/to/model/Llama-2-7b-hf
base_model: NousResearch/Llama-2-7b-hf
datasets:
# Please change to local path if dataset is offline, e.g., /path/to/dataset/alpaca_2k_test
- path: mhenrichsen/alpaca_2k_test
type: alpaca
```
Modify QLoRA parameters, such as `lora_r` and `lora_alpha`, etc.
```yaml
adapter: qlora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
```
Launch LoRA training with the following command.
```cmd
accelerate launch finetune.py qlora.yml
```
In Axolotl v0.4.0, you can use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`.
```cmd
accelerate launch train.py qlora.yml
```
### 3. Finetune Llama-3-8B (Experimental)
Warning: this section will install axolotl main ([796a085](https://github.com/OpenAccess-AI-Collective/axolotl/tree/796a085b2f688f4a5efe249d95f53ff6833bf009)) for new features, e.g., Llama-3-8B.
#### 3.1 Install Axolotl main in conda
Axolotl main has lots of new dependencies. Please setup a new conda env for this version.
```cmd
conda create -n llm python=3.11
conda activate llm
# install axolotl main
git clone https://github.com/OpenAccess-AI-Collective/axolotl
cd axolotl && git checkout 796a085
pip install -e .
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# install transformers etc
# to avoid https://github.com/OpenAccess-AI-Collective/axolotl/issues/1544
pip install datasets==2.15.0
pip install transformers==4.37.0
```
Config accelerate and oneAPIs, according to [Set Environment Variables](#22-set-environment-variables).
#### 3.2 Alpaca QLoRA
Based on [axolotl Llama-3 QLoRA example](https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/examples/llama-3/qlora.yml).
Prepare `llama3-qlora.yml` for QLoRA finetune. You can download a template from github.
```cmd
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/llama3-qlora.yml
```
**If you are using the offline model and dataset in local env**, please modify the model path and dataset path in `llama3-qlora.yml`. Otherwise, keep them unchanged.
```yaml
# Please change to local path if model is offline, e.g., /path/to/model/Meta-Llama-3-8B
base_model: meta-llama/Meta-Llama-3-8B
datasets:
# Please change to local path if dataset is offline, e.g., /path/to/dataset/alpaca_2k_test
- path: aaditya/alpaca_subset_1
type: alpaca
```
Modify QLoRA parameters, such as `lora_r` and `lora_alpha`, etc.
```yaml
adapter: qlora
lora_model_dir:
sequence_len: 256
sample_packing: true
pad_to_sequence_len: true
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
```
```cmd
accelerate launch finetune.py llama3-qlora.yml
```
You can also use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`.
```cmd
accelerate launch train.py llama3-qlora.yml
```
Expected output
```cmd
{'loss': 0.237, 'learning_rate': 1.2254711850265387e-06, 'epoch': 3.77}
{'loss': 0.6068, 'learning_rate': 1.1692453482951115e-06, 'epoch': 3.77}
{'loss': 0.2926, 'learning_rate': 1.1143322458989303e-06, 'epoch': 3.78}
{'loss': 0.2475, 'learning_rate': 1.0607326072295087e-06, 'epoch': 3.78}
{'loss': 0.1531, 'learning_rate': 1.008447144232094e-06, 'epoch': 3.79}
{'loss': 0.1799, 'learning_rate': 9.57476551396197e-07, 'epoch': 3.79}
{'loss': 0.2724, 'learning_rate': 9.078215057463868e-07, 'epoch': 3.79}
{'loss': 0.2534, 'learning_rate': 8.594826668332445e-07, 'epoch': 3.8}
{'loss': 0.3388, 'learning_rate': 8.124606767246579e-07, 'epoch': 3.8}
{'loss': 0.3867, 'learning_rate': 7.667561599972505e-07, 'epoch': 3.81}
{'loss': 0.2108, 'learning_rate': 7.223697237281668e-07, 'epoch': 3.81}
{'loss': 0.0792, 'learning_rate': 6.793019574868775e-07, 'epoch': 3.82}
```
## Troubleshooting
#### TypeError: PosixPath
Error message: `TypeError: argument of type 'PosixPath' is not iterable`
This issue is related to [axolotl #1544](https://github.com/OpenAccess-AI-Collective/axolotl/issues/1544). It can be fixed by downgrading datasets to 2.15.0.
```cmd
pip install datasets==2.15.0
```
#### RuntimeError: out of device memory
Error message: `RuntimeError: Allocation is out of device memory on current platform.`
This issue is caused by running out of GPU memory. Please reduce `lora_r` or `micro_batch_size` in `qlora.yml` or `lora.yml`, or reduce data using in training.
#### OSError: libmkl_intel_lp64.so.2
Error message: `OSError: libmkl_intel_lp64.so.2: cannot open shared object file: No such file or directory`
oneAPI environment is not correctly set. Please refer to [Set Environment Variables](#set-environment-variables).

View file

@ -0,0 +1,174 @@
# Run Performance Benchmarking with IPEX-LLM
We can perform benchmarking for IPEX-LLM on Intel CPUs and GPUs using the benchmark scripts we provide.
## Prepare The Environment
You can refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install.html) to install IPEX-LLM in your environment. The following dependencies are also needed to run the benchmark scripts.
```
pip install pandas
pip install omegaconf
```
## Prepare The Scripts
Navigate to your local workspace and then download IPEX-LLM from GitHub. Modify the `config.yaml` under `all-in-one` folder for your benchmark configurations.
```
cd your/local/workspace
git clone https://github.com/intel-analytics/ipex-llm.git
cd ipex-llm/python/llm/dev/benchmark/all-in-one/
```
## config.yaml
```yaml
repo_id:
- 'meta-llama/Llama-2-7b-chat-hf'
local_model_hub: 'path to your local model hub'
warm_up: 1 # must set >=2 when run "pipeline_parallel_gpu" test_api
num_trials: 3
num_beams: 1 # default to greedy search
low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
batch_size: 1 # default to 1
in_out_pairs:
- '32-32'
- '1024-128'
- '2048-256'
test_api:
- "transformer_int4_gpu" # on Intel GPU, transformer-like API, (qtype=int4)
cpu_embedding: False # whether put embedding to CPU
streaming: False # whether output in streaming way (only avaiable now for gpu win related test_api)
task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
```
Some parameters in the yaml file that you can configure:
- `repo_id`: The name of the model and its organization.
- `local_model_hub`: The folder path where the models are stored on your machine. Replace 'path to your local model hub' with /llm/models.
- `warm_up`: The number of warmup trials before performance benchmarking (must set to >= 2 when using "pipeline_parallel_gpu" test_api).
- `num_trials`: The number of runs for performance benchmarking (the final result is the average of all trials).
- `low_bit`: The low_bit precision you want to convert to for benchmarking.
- `batch_size`: The number of samples on which the models make predictions in one forward pass.
- `in_out_pairs`: Input sequence length and output sequence length combined by '-'.
- `test_api`: Different test functions for different machines.
- `transformer_int4_gpu` on Intel GPU for Linux
- `transformer_int4_gpu_win` on Intel GPU for Windows
- `transformer_int4` on Intel CPU
- `cpu_embedding`: Whether to put embedding on CPU (only available for windows GPU-related test_api).
- `streaming`: Whether to output in a streaming way (only available for GPU Windows-related test_api).
- `use_fp16_torch_dtype`: Whether to use fp16 for the non-linear layer (only available for "pipeline_parallel_gpu" test_api).
- `n_gpu`: Number of GPUs to use (only available for "pipeline_parallel_gpu" test_api).
- `task`: There are three tasks: `continuation`, `QA` and `summarize`. `continuation` refers to writing additional content based on prompt. `QA` refers to answering questions based on prompt. `summarize` refers to summarizing the prompt.
```eval_rst
.. note::
If you want to benchmark the performance without warmup, you can set ``warm_up: 0`` and ``num_trials: 1`` in ``config.yaml``, and run each single model and in_out_pair separately.
```
## Run on Windows
Please refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration) to configure oneAPI environment variables.
```eval_rst
.. tabs::
.. tab:: Intel iGPU
.. code-block:: bash
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
python run.py
.. tab:: Intel Arc™ A300-Series or Pro A60
.. code-block:: bash
set SYCL_CACHE_PERSISTENT=1
python run.py
.. tab:: Other Intel dGPU Series
.. code-block:: bash
# e.g. Arc™ A770
python run.py
```
## Run on Linux
```eval_rst
.. tabs::
.. tab:: Intel Arc™ A-Series and Intel Data Center GPU Flex
For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, we recommend:
.. code-block:: bash
./run-arc.sh
.. tab:: Intel iGPU
For Intel iGPU, we recommend:
.. code-block:: bash
./run-igpu.sh
.. tab:: Intel Data Center GPU Max
Please note that you need to run ``conda install -c conda-forge -y gperftools=2.10`` before running the benchmark script on Intel Data Center GPU Max Series.
.. code-block:: bash
./run-max-gpu.sh
.. tab:: Intel SPR
For Intel SPR machine, we recommend:
.. code-block:: bash
./run-spr.sh
The scipt uses a default numactl strategy. If you want to customize it, please use ``lscpu`` or ``numactl -H`` to check how cpu indexs are assigned to numa node, and make sure the run command is binded to only one socket.
.. tab:: Intel HBM
For Intel HBM machine, we recommend:
.. code-block:: bash
./run-hbm.sh
The scipt uses a default numactl strategy. If you want to customize it, please use ``numactl -H`` to check how the index of hbm node and cpu are assigned.
For example:
.. code-block:: bash
node 0 1 2 3
0: 10 21 13 23
1: 21 10 23 13
2: 13 23 10 23
3: 23 13 23 10
here hbm node is the node whose distance from the checked node is 13, node 2 is node 0's hbm node.
And make sure the run command is binded to only one socket.
```
## Result
After the benchmarking is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns `1st token avg latency (ms)` and `2+ avg latency (ms/token)` for the benchmark results. You can also check whether the column `actual input/output tokens` is consistent with the column `input/output tokens` and whether the parameters you specified in `config.yaml` have been successfully applied in the benchmarking.

View file

@ -0,0 +1,63 @@
# `bigdl-llm` Migration Guide
This guide helps you migrate your `bigdl-llm` application to use `ipex-llm`.
## Upgrade `bigdl-llm` package to `ipex-llm`
```eval_rst
.. note::
This step assumes you have already installed `bigdl-llm`.
```
You need to uninstall `bigdl-llm` and install `ipex-llm`With your `bigdl-llm` conda environment activated, execute the following command according to your device type and location:
### For CPU
```bash
pip uninstall -y bigdl-llm
pip install --pre --upgrade ipex-llm[all] # for cpu
```
### For GPU
Choose either US or CN website for `extra-index-url`:
```eval_rst
.. tabs::
.. tab:: US
.. code-block:: cmd
pip uninstall -y bigdl-llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
.. tab:: CN
.. code-block:: cmd
pip uninstall -y bigdl-llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
```
## Migrate `bigdl-llm` code to `ipex-llm`
There are two options to migrate `bigdl-llm` code to `ipex-llm`.
### 1. Upgrade `bigdl-llm` code to `ipex-llm`
To upgrade `bigdl-llm` code to `ipex-llm`, simply replace all `bigdl.llm` with `ipex_llm`:
```python
#from bigdl.llm.transformers import AutoModelForCausalLM # Original line
from ipex_llm.transformers import AutoModelForCausalLM #Updated line
model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True,
trust_remote_code=True)
```
### 2. Run `bigdl-llm` code in compatible mode (experimental)
To run in the compatible mode, simply add `import ipex_llm` at the beginning of the existing `bigdl-llm` code:
```python
import ipex_llm # Add this line before any bigdl.llm imports
from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True,
trust_remote_code=True)
```

View file

@ -0,0 +1,82 @@
# Run Local RAG using Langchain-Chatchat on Intel CPU and GPU
[chatchat-space/Langchain-Chatchat](https://github.com/chatchat-space/Langchain-Chatchat) is a Knowledge Base QA application using RAG pipeline; by porting it to [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily run ***local RAG pipelines*** using [Langchain-Chatchat](https://github.com/intel-analytics/Langchain-Chatchat) with LLMs and Embedding models on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max).
*See the demos of running LLaMA2-7B (English) and ChatGLM-3-6B (Chinese) on an Intel Core Ultra laptop below.*
<table border="1" width="100%">
<tr>
<td align="center" width="50%">English</td>
<td align="center" width="50%">简体中文</td>
</tr>
<tr>
<td><video src="https://llm-assets.readthedocs.io/en/latest/_images/langchain-chatchat-en.mp4" width="100%" controls></video></td>
<td><video src="https://llm-assets.readthedocs.io/en/latest/_images/langchain-chatchat-cn.mp4" width="100%" controls></video></td>
</tr>
</table>
>You can change the UI language in the left-side menu. We currently support **English** and **简体中文** (see video demos below).
## Langchain-Chatchat Architecture
See the Langchain-Chatchat architecture below ([source](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/img/langchain%2Bchatglm.png)).
<img src="https://llm-assets.readthedocs.io/en/latest/_images/langchain-arch.png" height="50%" />
## Quickstart
### Install and Run
Follow the guide that corresponds to your specific system and device from the links provided below:
- For systems with Intel Core Ultra integrated GPU: [Windows Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_win_mtl.md#) | [Linux Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_linux_mtl.md#)
- For systems with Intel Arc A-Series GPU: [Windows Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_win_arc.md#) | [Linux Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_linux_arc.md#)
- For systems with Intel Data Center Max Series GPU: [Linux Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_linux_max.md#)
- For systems with Xeon-Series CPU: [Linux Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_linux_xeon.md#)
### How to use RAG
#### Step 1: Create Knowledge Base
- Select `Manage Knowledge Base` from the menu on the left, then choose `New Knowledge Base` from the dropdown menu on the right side.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/new-kb.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/new-kb.png" alt="rag-menu" width="100%" align="center">
</a>
- Fill in the name of your new knowledge base (example: "test") and press the `Create` button. Adjust any other settings as needed.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/create-kb.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/create-kb.png" alt="rag-menu" width="100%" align="center">
</a>
- Upload knowledge files from your computer and allow some time for the upload to complete. Once finished, click on `Add files to Knowledge Base` button to build the vector store. Note: this process may take several minutes.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/build-kb.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/build-kb.png" alt="rag-menu" width="100%" align="center">
</a>
#### Step 2: Chat with RAG
You can now click `Dialogue` on the left-side menu to return to the chat UI. Then in `Knowledge base settings` menu, choose the Knowledge Base you just created, e.g, "test". Now you can start chatting.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/rag-menu.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/rag-menu.png" alt="rag-menu" width="100%" align="center">
</a>
<br/>
For more information about how to use Langchain-Chatchat, refer to Official Quickstart guide in [English](./README_en.md#), [Chinese](./README_chs.md#), or the [Wiki](https://github.com/chatchat-space/Langchain-Chatchat/wiki/).
### Trouble Shooting & Tips
#### 1. Version Compatibility
Ensure that you have installed `ipex-llm>=2.1.0b20240327`. To upgrade `ipex-llm`, use
```bash
pip install --pre --upgrade ipex-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
#### 2. Prompt Templates
In the left-side menu, you have the option to choose a prompt template. There're several pre-defined templates - those ending with '_cn' are Chinese templates, and those ending with '_en' are English templates. You can also define your own prompt templates in `configs/prompt_config.py`. Remember to restart the service to enable these changes.

View file

@ -0,0 +1,169 @@
# Run Coding Copilot in VSCode with Intel GPU
[**Continue**](https://marketplace.visualstudio.com/items?itemName=Continue.continue) is a coding copilot extension in [Microsoft Visual Studio Code](https://code.visualstudio.com/); by integrating it with [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily leverage local LLMs running on Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) for code explanation, code generation/completion, etc.
Below is a demo of using `Continue` with [CodeQWen1.5-7B](https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat) running on Intel A770 GPU. This demo illustrates how a programmer used `Continue` to find a solution for the [Kaggle's _Titanic_ challenge](https://www.kaggle.com/competitions/titanic/), which involves asking `Continue` to complete the code for model fitting, evaluation, hyper parameter tuning, feature engineering, and explain generated code.
<video src="https://llm-assets.readthedocs.io/en/latest/_images/continue_demo_ollama_backend_arc.mp4" width="100%" controls></video>
## Quickstart
This guide walks you through setting up and running **Continue** within _Visual Studio Code_, empowered by local large language models served via [Ollama](./ollama_quickstart.html) with `ipex-llm` optimizations.
### 1. Install and Run Ollama Serve
Visit [Run Ollama with IPEX-LLM on Intel GPU](./ollama_quickstart.html), and follow the steps 1) [Install IPEX-LLM for Ollama](./ollama_quickstart.html#install-ipex-llm-for-ollama), 2) [Initialize Ollama](./ollama_quickstart.html#initialize-ollama) 3) [Run Ollama Serve](./ollama_quickstart.html#run-ollama-serve) to install, init and start the Ollama Service.
```eval_rst
.. important::
If the `Continue` plugin is not installed on the same machine where Ollama is running (which means `Continue` needs to connect to a remote Ollama service), you must configure the Ollama service to accept connections from any IP address. To achieve this, set or export the environment variable `OLLAMA_HOST=0.0.0.0` before executing the command `ollama serve`.
.. tip::
If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
.. code-block:: bash
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
```
### 2. Pull and Prepare the Model
#### 2.1 Pull Model
Now we need to pull a model for coding. Here we use [CodeQWen1.5-7B](https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat) model as an example. Open a new terminal window, run the following command to pull [`codeqwen:latest`](https://ollama.com/library/codeqwen).
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
export no_proxy=localhost,127.0.0.1
./ollama pull codeqwen:latest
.. tab:: Windows
Please run the following command in Miniforge Prompt.
.. code-block:: cmd
set no_proxy=localhost,127.0.0.1
ollama pull codeqwen:latest
.. seealso::
Besides CodeQWen, there are other coding models you might want to explore, such as Magicoder, Wizardcoder, Codellama, Codegemma, Starcoder, Starcoder2, and etc. You can find these models in the `Ollama model library <https://ollama.com/library>`_. Simply search for the model, pull it in a similar manner, and give it a try.
```
#### 2.2 Prepare the Model and Pre-load
To make `Continue` run more smoothly with Ollama, we will create a new model in Ollama using the original model with an adjusted num_ctx parameter of 4096.
Start by creating a file named `Modelfile` with the following content:
```dockerfile
FROM codeqwen:latest
PARAMETER num_ctx 4096
```
Next, use the following commands in the terminal (Linux) or Miniforge Prompt (Windows) to create a new model in Ollama named `codeqwen:latest-continue`:
```bash
ollama create codeqwen:latest-continue -f Modelfile
```
After creation, run `ollama list` to see `codeqwen:latest-continue` in the list of models.
Finally, preload the new model by executing the following command in a new terminal (Linux) or Miniforge Prompt (Windows):
```bash
ollama run codeqwen:latest-continue
```
### 3. Install `Continue` Extension
Search for `Continue` in the VSCode `Extensions Marketplace` and install it just like any other extension.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/continue_install.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/continue_install.png" width=100%; />
</a>
<br/>
Once installed, the `Continue` icon will appear on the left sidebar. You can drag and drop the icon to the right sidebar for easy access to the `Continue` view.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/continue_dragdrop.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/continue_dragdrop.png" width=100%; />
</a>
<br/>
If the icon does not appear or you cannot open the view, press `Ctrl+Shift+L` or follow the steps below to open the `Continue` view on the right side.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/continue_openview.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/continue_openview.png" width=100%; />
</a>
<br/>
Once you have successfully opened the `Continue` view, you will see the welcome screen as shown below. Select **Fully local** -> **Continue** -> **Continue** as illustrated.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/continue_welcome.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/continue_welcome.png" width=100%; />
</a>
When you see the screen below, your plug-in is ready to use.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/continue_ready.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/continue_ready.png" width=100%; />
</a>
### 4. `Continue` Configuration
Once `Continue` is installed and ready, simply select the model "`Ollama - codeqwen:latest-continue`" from the bottom of the `Continue` view (all models in `ollama list` will appear in the format `Ollama-xxx`).
Now you can start using `Continue`.
#### Connecting to Remote Ollama Service
You can configure `Continue` by clicking the small gear icon located at the bottom right of the `Continue` view to open `config.json`. In `config.json`, you will find all necessary configuration settings.
If you are running Ollama on the same machine as `Continue`, no changes are necessary. If Ollama is running on a different machine, you'll need to update the `apiBase` key in `Ollama` item in `config.json` to point to the remote Ollama URL, as shown in the example below and in the figure.
```json
{
"title": "Ollama",
"provider": "ollama",
"model": "AUTODETECT",
"apiBase": "http://your-ollama-service-ip:11434"
}
```
<a href="https://llm-assets.readthedocs.io/en/latest/_images/continue_config.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/continue_config.png" width=100%; />
</a>
### 5. How to Use `Continue`
For detailed tutorials please refer to [this link](https://continue.dev/docs/how-to-use-continue). Here we are only showing the most common scenarios.
#### Q&A over specific code
If you don't understand how some code works, highlight(press `Ctrl+Shift+L`) it and ask "how does this code work?"
<a href="https://llm-assets.readthedocs.io/en/latest/_images/continue_quickstart_sample_usage1.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/continue_quickstart_sample_usage1.png" width=100%; />
</a>
#### Editing code
You can ask Continue to edit your highlighted code with the command `/edit`.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/continue_quickstart_sample_usage2.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/continue_quickstart_sample_usage2.png" width=100%; />
</a>

View file

@ -0,0 +1,102 @@
# Run IPEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi
This example demonstrates how to run IPEX-LLM serving on multiple [Intel GPUs](../README.md) by leveraging DeepSpeed AutoTP.
## Requirements
To run this example with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. For this particular example, you will need at least two GPUs on your machine.
## Example
### 1. Install
```bash
conda create -n llm python=3.11
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
pip install oneccl_bind_pt==2.1.100 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# configures OneAPI environment variables
source /opt/intel/oneapi/setvars.sh
pip install git+https://github.com/microsoft/DeepSpeed.git@ed8aed5
pip install git+https://github.com/intel/intel-extension-for-deepspeed.git@0eb734b
pip install mpi4py fastapi uvicorn
conda install -c conda-forge -y gperftools=2.10 # to enable tcmalloc
```
> **Important**: IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0. Please make sure you have installed the correct version.
### 2. Run tensor parallel inference on multiple GPUs
When we run the model in a distributed manner across two GPUs, the memory consumption of each GPU is only half of what it was originally, and the GPUs can work simultaneously during inference computation.
We provide example usage for `Llama-2-7b-chat-hf` model running on Arc A770
Run Llama-2-7b-chat-hf on two Intel Arc A770:
```bash
# Before run this script, you should adjust the YOUR_REPO_ID_OR_MODEL_PATH in last line
# If you want to change server port, you can set port parameter in last line
# To avoid GPU OOM, you could adjust --max-num-seqs and --max-num-batched-tokens parameters in below script
bash run_llama2_7b_chat_hf_arc_2_card.sh
```
If you successfully run the serving, you can get output like this:
```bash
[0] INFO: Started server process [120071]
[0] INFO: Waiting for application startup.
[0] INFO: Application startup complete.
[0] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```
> **Note**: You could change `NUM_GPUS` to the number of GPUs you have on your machine. And you could also specify other low bit optimizations through `--low-bit`.
### 3. Sample Input and Output
We can use `curl` to test serving api
```bash
# Set http_proxy and https_proxy to null to ensure that requests are not forwarded by a proxy.
export http_proxy=
export https_proxy=
curl -X 'POST' \
'http://127.0.0.1:8000/generate/' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "What is AI?",
"n_predict": 32
}'
```
And you should get output like this:
```json
{
"generated_text": "What is AI? Artificial intelligence (AI) refers to the development of computer systems able to perform tasks that would normally require human intelligence, such as visual perception, speech",
"generate_time": "0.45149803161621094s"
}
```
**Important**: The first token latency is much larger than rest token latency, you could use [our benchmark tool](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/dev/benchmark/README.md) to obtain more details about first and rest token latency.
### 4. Benchmark with wrk
We use wrk for testing end-to-end throughput, check [here](https://github.com/wg/wrk).
You can install by:
```bash
sudo apt install wrk
```
Please change the test url accordingly.
```bash
# set t/c to the number of concurrencies to test full throughput.
wrk -t1 -c1 -d5m -s ./wrk_script_1024.lua http://127.0.0.1:8000/generate/ --timeout 1m
```

View file

@ -0,0 +1,150 @@
# Run Dify on Intel GPU
[**Dify**](https://dify.ai/) is an open-source production-ready LLM app development platform; by integrating it with [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily leverage local LLMs running on Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) for building complex AI workflows (e.g. RAG).
*See the demo of a RAG workflow in Dify running LLaMA2-7B on Intel A770 GPU below.*
<video src="https://llm-assets.readthedocs.io/en/latest/_images/dify-rag-small.mp4" width="100%" controls></video>
## Quickstart
### 1. Install and Start `Ollama` Service on Intel GPU
Follow the steps in [Run Ollama on Intel GPU Guide](./ollama_quickstart.md) to install and run Ollama on Intel GPU. Ensure that `ollama serve` is running correctly and can be accessed through a local URL (e.g., `https://127.0.0.1:11434`) or a remote URL (e.g., `http://your_ip:11434`).
We recommend pulling the desired model before proceeding with Dify. For instance, to pull the LLaMA2-7B model, you can use the following command:
```bash
ollama pull llama2:7b
```
### 2. Install and Start `Dify`
#### 2.1 Download `Dify`
You can either clone the repository or download the source zip from [github](https://github.com/langgenius/dify/archive/refs/heads/main.zip):
```bash
git clone https://github.com/langgenius/dify.git
```
#### 2.2 Setup Redis and PostgreSQL
Next, deploy PostgreSQL and Redis. You can choose to utilize Docker, following the steps in the [Local Source Code Start Guide](https://docs.dify.ai/getting-started/install-self-hosted/local-source-code#clone-dify), or proceed without Docker using the following instructions:
- Install Redis by executing `sudo apt-get install redis-server`. Refer to [this guide](https://www.hostinger.com/tutorials/how-to-install-and-setup-redis-on-ubuntu/) for Redis environment setup, including password configuration and daemon settings.
- Install PostgreSQL by following either [the Official PostgreSQL Tutorial](https://www.postgresql.org/docs/current/tutorial.html) or [a PostgreSQL Quickstart Guide](https://www.digitalocean.com/community/tutorials/how-to-install-postgresql-on-ubuntu-20-04-quickstart). After installation, proceed with the following PostgreSQL commands for setting up Dify. These commands create a username/password for Dify (e.g., `dify_user`, change `'your_password'` as desired), create a new database named `dify`, and grant privileges:
```sql
CREATE USER dify_user WITH PASSWORD 'your_password';
CREATE DATABASE dify;
GRANT ALL PRIVILEGES ON DATABASE dify TO dify_user;
```
Configure Redis and PostgreSQL settings in the `.env` file located under dify source folder `dify/api/`:
```bash dify/api/.env
### Example dify/api/.env
## Redis settings
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_USERNAME=your_redis_user_name # change if needed
REDIS_PASSWORD=your_redis_password # change if needed
REDIS_DB=0
## postgreSQL settings
DB_USERNAME=dify_user # change if needed
DB_PASSWORD=your_dify_password # change if needed
DB_HOST=localhost
DB_PORT=5432
DB_DATABASE=dify # change if needed
```
#### 2.3 Server Deployment
Follow the steps in the [`Server Deployment` section in Local Source Code Start Guide](https://docs.dify.ai/getting-started/install-self-hosted/local-source-code#server-deployment) to deploy and start the Dify Server.
Upon successful deployment, you will see logs in the terminal similar to the following:
```bash
INFO:werkzeug:
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:5001
* Running on http://10.239.44.83:5001
INFO:werkzeug:Press CTRL+C to quit
INFO:werkzeug: * Restarting with stat
WARNING:werkzeug: * Debugger is active!
INFO:werkzeug: * Debugger PIN: 227-697-894
```
#### 2.4 Deploy the frontend page
Refer to the instructions provided in the [`Deploy the frontend page` section in Local Source Code Start Guide](https://docs.dify.ai/getting-started/install-self-hosted/local-source-code#deploy-the-frontend-page) to deploy the frontend page.
Below is an example of environment variable configuration found in `dify/web/.env.local`:
```bash
# For production release, change this to PRODUCTION
NEXT_PUBLIC_DEPLOY_ENV=DEVELOPMENT
NEXT_PUBLIC_EDITION=SELF_HOSTED
NEXT_PUBLIC_API_PREFIX=http://localhost:5001/console/api
NEXT_PUBLIC_PUBLIC_API_PREFIX=http://localhost:5001/api
NEXT_PUBLIC_SENTRY_DSN=
```
```eval_rst
.. note::
If you encounter connection problems, you may run `export no_proxy=localhost,127.0.0.1` before starting API servcie, Worker service and frontend.
```
### 3. How to Use `Dify`
For comprehensive usage instructions of Dify, please refer to the [Dify Documentation](https://docs.dify.ai/). In this section, we'll only highlight a few key steps for local LLM setup.
#### Setup Ollama
Open your browser and access the Dify UI at `http://localhost:3000`.
Configure the Ollama URL in `Settings > Model Providers > Ollama`. For detailed instructions on how to do this, see the [Ollama Guide in the Dify Documentation](https://docs.dify.ai/tutorials/model-configuration/ollama).
<p align="center"><a href="https://docs.dify.ai/~gitbook/image?url=https%3A%2F%2F3866086014-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FRncMhlfeYTrpujwzDIqw%252Fuploads%252Fgit-blob-351b275c8b6420ff85c77e67bf39a11aaf899b7b%252Follama-config-en.png%3Falt%3Dmedia&width=768&dpr=2&quality=100&sign=1ec95e72d9d0459384cce28665eb84ffd8ed59c906ab0fdb3f47fa67f61275dc" target="_blank" align="center"><img src="https://docs.dify.ai/~gitbook/image?url=https%3A%2F%2F3866086014-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FRncMhlfeYTrpujwzDIqw%252Fuploads%252Fgit-blob-351b275c8b6420ff85c77e67bf39a11aaf899b7b%252Follama-config-en.png%3Falt%3Dmedia&width=768&dpr=2&quality=100&sign=1ec95e72d9d0459384cce28665eb84ffd8ed59c906ab0fdb3f47fa67f61275dc" alt="rag-menu" width="80%" align="center"></a></p>
Once Ollama is successfully connected, you will see a list of Ollama models similar to the following:
<p align="center"><a href="https://llm-assets.readthedocs.io/en/latest/_images/dify-p1.png" target="_blank" align="center">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/dify-p1.png" alt="image-p1" width=100%; />
</a></p>
#### Run a simple RAG
- Select the text summarization workflow template from the studio.
<p><a href="https://llm-assets.readthedocs.io/en/latest/_images/dify-p2.png" target="_blank" align="center">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/dify-p2.png" alt="image-p2" width=100%; align="center" />
</a></p>
- Add a knowledge base and specify the LLM or embedding model to use.
<p><a href="https://llm-assets.readthedocs.io/en/latest/_images/dify-p3.png" target="_blank" align="center">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/dify-p3.png" alt="image-p3" width=100%; />
</a></p>
- Enter your input in the workflow and execute it. You'll find retrieval results and generated answers on the right.
<p align="center"><a href="https://llm-assets.readthedocs.io/en/latest/_images/dify-p5.png" target="_blank" align="center">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/dify-p5.png" alt="image-20240221102252560" width=100%; align="center"/>
</a></p>

View file

@ -0,0 +1,421 @@
# Serving using IPEX-LLM and FastChat
FastChat is an open platform for training, serving, and evaluating large language model based chatbots. You can find the detailed information at their [homepage](https://github.com/lm-sys/FastChat).
IPEX-LLM can be easily integrated into FastChat so that user can use `IPEX-LLM` as a serving backend in the deployment.
## Quick Start
This quickstart guide walks you through installing and running `FastChat` with `ipex-llm`.
## 1. Install IPEX-LLM with FastChat
To run on CPU, you can install ipex-llm as follows:
```bash
pip install --pre --upgrade ipex-llm[serving,all]
```
To add GPU support for FastChat, you may install **`ipex-llm`** as follows:
```bash
pip install --pre --upgrade ipex-llm[xpu,serving] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
```
## 2. Start the service
### Launch controller
You need first run the fastchat controller
```bash
python3 -m fastchat.serve.controller
```
If the controller run successfully, you can see the output like this:
```bash
Uvicorn running on http://localhost:21001
```
### Launch model worker(s) and load models
Using IPEX-LLM in FastChat does not impose any new limitations on model usage. Therefore, all Hugging Face Transformer models can be utilized in FastChat.
#### IPEX-LLM worker
To integrate IPEX-LLM with `FastChat` efficiently, we have provided a new model_worker implementation named `ipex_llm_worker.py`.
```bash
# On CPU
# Available low_bit format including sym_int4, sym_int8, bf16 etc.
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "cpu"
# On GPU
# Available low_bit format including sym_int4, sym_int8, fp16 etc.
source /opt/intel/oneapi/setvars.sh
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu"
```
We have also provided an option `--load-low-bit-model` to load models that have been converted and saved into disk using the `save_low_bit` interface as introduced in this [document](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/KeyFeatures/hugging_face_format.html#save-load).
Check the following examples:
```bash
# Or --device "cpu"
python -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path /Low/Bit/Model/Path --trust-remote-code --device "xpu" --load-low-bit-model
```
#### For self-speculative decoding example:
You can use IPEX-LLM to run `self-speculative decoding` example. Refer to [here](https://github.com/intel-analytics/ipex-llm/tree/c9fac8c26bf1e1e8f7376fa9a62b32951dd9e85d/python/llm/example/GPU/Speculative-Decoding) for more details on intel MAX GPUs. Refer to [here](https://github.com/intel-analytics/ipex-llm/tree/c9fac8c26bf1e1e8f7376fa9a62b32951dd9e85d/python/llm/example/GPU/Speculative-Decoding) for more details on intel CPUs.
```bash
# Available low_bit format only including bf16 on CPU.
source ipex-llm-init -t
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "bf16" --trust-remote-code --device "cpu" --speculative
# Available low_bit format only including fp16 on GPU.
source /opt/intel/oneapi/setvars.sh
export ENABLE_SDP_FUSION=1
export SYCL_CACHE_PERSISTENT=1
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "fp16" --trust-remote-code --device "xpu" --speculative
```
You can get output like this:
```bash
2024-04-12 18:18:09 | INFO | ipex_llm.transformers.utils | Converting the current model to sym_int4 format......
2024-04-12 18:18:11 | INFO | model_worker | Register to controller
2024-04-12 18:18:11 | ERROR | stderr | INFO: Started server process [126133]
2024-04-12 18:18:11 | ERROR | stderr | INFO: Waiting for application startup.
2024-04-12 18:18:11 | ERROR | stderr | INFO: Application startup complete.
2024-04-12 18:18:11 | ERROR | stderr | INFO: Uvicorn running on http://localhost:21002
```
For a full list of accepted arguments, you can refer to the main method of the `ipex_llm_worker.py`
#### IPEX-LLM vLLM worker
We also provide the `vllm_worker` which uses the [vLLM](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/vLLM-Serving) engine for better hardware utilization.
To run using the `vLLM_worker`, we don't need to change model name, just simply uses the following command:
```bash
# On CPU
python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device cpu
# On GPU
source /opt/intel/oneapi/setvars.sh
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device xpu --load-in-low-bit "sym_int4" --enforce-eager
```
#### Launch multiple workers
Sometimes we may want to start multiple workers for the best performance. For running in CPU, you may want to seperate multiple workers in different sockets. Assuming each socket have 48 physicall cores, then you may want to start two workers using the following example:
```bash
export OMP_NUM_THREADS=48
numactl -C 0-47 -m 0 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "cpu" &
# All the workers other than the first worker need to specify a different worker port and corresponding worker-address
numactl -C 48-95 -m 1 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "cpu" --port 21003 --worker-address "http://localhost:21003" &
```
For GPU, we may want to start two workers using different GPUs. To achieve this, you should use `ZE_AFFINITY_MASK` environment variable to select different GPUs for different workers. Below shows an example:
```bash
ZE_AFFINITY_MASK=1 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu" &
# All the workers other than the first worker need to specify a different worker port and corresponding worker-address
ZE_AFFINITY_MASK=2 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu" --port 21003 --worker-address "http://localhost:21003" &
```
If you are not sure the effect of `ZE_AFFINITY_MASK`, then you could set `ZE_AFFINITY_MASK` and check the result of `sycl-ls`.
### Launch Gradio web server
When you have started the controller and the worker, you can start web server as follows:
```bash
python3 -m fastchat.serve.gradio_web_server
```
This is the user interface that users will interact with.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/fastchat_gradio_web_ui.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/fastchat_gradio_web_ui.png" width=100%; />
</a>
By following these steps, you will be able to serve your models using the web UI with IPEX-LLM as the backend. You can open your browser and chat with a model now.
### Launch TGI Style API server
When you have started the controller and the worker, you can start TGI Style API server as follows:
```bash
python3 -m ipex_llm.serving.fastchat.tgi_api_server --host localhost --port 8000
```
You can use `curl` for observing the output of the api
#### Using /generate API
This is to send a sentence as inputs in the request, and is expected to receive a response containing model-generated answer.
```bash
curl -X POST -H "Content-Type: application/json" -d '{
"inputs": "What is AI?",
"parameters": {
"best_of": 1,
"decoder_input_details": true,
"details": true,
"do_sample": true,
"frequency_penalty": 0.1,
"grammar": {
"type": "json",
"value": "string"
},
"max_new_tokens": 32,
"repetition_penalty": 1.03,
"return_full_text": false,
"seed": 0.1,
"stop": [
"photographer"
],
"temperature": 0.5,
"top_k": 10,
"top_n_tokens": 5,
"top_p": 0.95,
"truncate": true,
"typical_p": 0.95,
"watermark": true
}
}' http://localhost:8000/generate
```
Sample output:
```bash
{
"details": {
"best_of_sequences": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "\nArtificial Intelligence (AI) is a branch of computer science that attempts to simulate the way that the human brain works. It is a branch of computer "
},
"finish_reason": "length",
"generated_text": "\nArtificial Intelligence (AI) is a branch of computer science that attempts to simulate the way that the human brain works. It is a branch of computer ",
"generated_tokens": 31
}
]
},
"generated_text": "\nArtificial Intelligence (AI) is a branch of computer science that attempts to simulate the way that the human brain works. It is a branch of computer ",
"usage": {
"prompt_tokens": 4,
"total_tokens": 35,
"completion_tokens": 31
}
}
```
#### Using /generate_stream API
This is to send a sentence as inputs in the request, and a long connection will be opened to continuously receive multiple responses containing model-generated answer.
```bash
curl -X POST -H "Content-Type: application/json" -d '{
"inputs": "What is AI?",
"parameters": {
"best_of": 1,
"decoder_input_details": true,
"details": true,
"do_sample": true,
"frequency_penalty": 0.1,
"grammar": {
"type": "json",
"value": "string"
},
"max_new_tokens": 32,
"repetition_penalty": 1.03,
"return_full_text": false,
"seed": 0.1,
"stop": [
"photographer"
],
"temperature": 0.5,
"top_k": 10,
"top_n_tokens": 5,
"top_p": 0.95,
"truncate": true,
"typical_p": 0.95,
"watermark": true
}
}' http://localhost:8000/generate_stream
```
Sample output:
```bash
data: {"token": {"id": 663359, "text": "", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
data: {"token": {"id": 300560, "text": "\n", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
data: {"token": {"id": 725120, "text": "Artificial Intelligence ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
data: {"token": {"id": 734609, "text": "(AI) is ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
data: {"token": {"id": 362235, "text": "a branch of computer ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
data: {"token": {"id": 380983, "text": "science that attempts to ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
data: {"token": {"id": 249979, "text": "simulate the way that ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
data: {"token": {"id": 972663, "text": "the human brain ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
data: {"token": {"id": 793301, "text": "works. It is a ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
data: {"token": {"id": 501380, "text": "branch of computer ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
data: {"token": {"id": 673232, "text": "", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
data: {"token": {"id": 2, "text": "</s>", "logprob": 0.0, "special": true}, "generated_text": "\nArtificial Intelligence (AI) is a branch of computer science that attempts to simulate the way that the human brain works. It is a branch of computer ", "details": {"finish_reason": "eos_token", "generated_tokens": 31, "prefill_tokens": 4, "seed": 2023}, "special_ret": {"tensor": []}}
```
### Launch RESTful API server
To start an OpenAI API server that provides compatible APIs using IPEX-LLM backend, you can launch the `openai_api_server` and follow this [doc](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md) to use it.
When you have started the controller and the worker, you can start RESTful API server as follows:
```bash
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
```
You can use `curl` for observing the output of the api
You can format the output using `jq`
#### List Models
```bash
curl http://localhost:8000/v1/models | jq
```
Example output
```json
{
"object": "list",
"data": [
{
"id": "Llama-2-7b-chat-hf",
"object": "model",
"created": 1712919071,
"owned_by": "fastchat",
"root": "Llama-2-7b-chat-hf",
"parent": null,
"permission": [
{
"id": "modelperm-XpFyEE7Sewx4XYbEcdbCVz",
"object": "model_permission",
"created": 1712919071,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": true,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
}
]
}
```
#### Chat Completions
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-2-7b-chat-hf",
"messages": [{"role": "user", "content": "Hello! What is your name?"}]
}' | jq
```
Example output
```json
{
"id": "chatcmpl-jJ9vKSGkcDMTxKfLxK7q2x",
"object": "chat.completion",
"created": 1712919092,
"model": "Llama-2-7b-chat-hf",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " Hello! My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. Unterscheidung. 😊"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 15,
"total_tokens": 53,
"completion_tokens": 38
}
}
```
#### Text Completions
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-2-7b-chat-hf",
"prompt": "Once upon a time",
"max_tokens": 41,
"temperature": 0.5
}' | jq
```
Example Output:
```json
{
"id": "cmpl-PsAkpTWMmBLzWCTtM4r97Y",
"object": "text_completion",
"created": 1712919307,
"model": "Llama-2-7b-chat-hf",
"choices": [
{
"index": 0,
"text": ", in a far-off land, there was a magical kingdom called \"Happily Ever Laughter.\" It was a place where laughter was the key to happiness, and everyone who ",
"logprobs": null,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 5,
"total_tokens": 45,
"completion_tokens": 40
}
}
```

View file

@ -0,0 +1,33 @@
IPEX-LLM Quickstart
================================
.. note::
We are adding more Quickstart guide.
This section includes efficient guide to show you how to:
* |bigdl_llm_migration_guide|_
* `Install IPEX-LLM on Linux with Intel GPU <./install_linux_gpu.html>`_
* `Install IPEX-LLM on Windows with Intel GPU <./install_windows_gpu.html>`_
* `Install IPEX-LLM in Docker on Windows with Intel GPU <./docker_windows_gpu.html>`_
* `Run PyTorch Inference on Intel GPU using Docker (on Linux or WSL) <./docker_benchmark_quickstart.html>`_
* `Run Performance Benchmarking with IPEX-LLM <./benchmark_quickstart.html>`_
* `Run Local RAG using Langchain-Chatchat on Intel GPU <./chatchat_quickstart.html>`_
* `Run Text Generation WebUI on Intel GPU <./webui_quickstart.html>`_
* `Run Open WebUI on Intel GPU <./open_webui_with_ollama_quickstart.html>`_
* `Run PrivateGPT with IPEX-LLM on Intel GPU <./privateGPT_quickstart.html>`_
* `Run Coding Copilot (Continue) in VSCode with Intel GPU <./continue_quickstart.html>`_
* `Run Dify on Intel GPU <./dify_quickstart.html>`_
* `Run llama.cpp with IPEX-LLM on Intel GPU <./llama_cpp_quickstart.html>`_
* `Run Ollama with IPEX-LLM on Intel GPU <./ollama_quickstart.html>`_
* `Run Llama 3 on Intel GPU using llama.cpp and ollama with IPEX-LLM <./llama3_llamacpp_ollama_quickstart.html>`_
* `Run IPEX-LLM Serving with FastChat <./fastchat_quickstart.html>`_
* `Run IPEX-LLM Serving with vLLM on Intel GPU <./vLLM_quickstart.html>`_
* `Finetune LLM with Axolotl on Intel GPU <./axolotl_quickstart.html>`_
* `Run IPEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi <./deepspeed_autotp_fastapi_quickstart.html>`_
.. |bigdl_llm_migration_guide| replace:: ``bigdl-llm`` Migration Guide
.. _bigdl_llm_migration_guide: bigdl_llm_migration.html

View file

@ -0,0 +1,313 @@
# Install IPEX-LLM on Linux with Intel GPU
This guide demonstrates how to install IPEX-LLM on Linux with Intel GPUs. It applies to Intel Data Center GPU Flex Series and Max Series, as well as Intel Arc Series GPU.
IPEX-LLM currently supports the Ubuntu 20.04 operating system and later, and supports PyTorch 2.0 and PyTorch 2.1 on Linux. This page demonstrates IPEX-LLM with PyTorch 2.1. Check the [Installation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#linux) page for more details.
## Install Prerequisites
### Install GPU Driver
#### For Linux kernel 6.2
* Install wget, gpg-agent
```bash
sudo apt-get install -y gpg-agent wget
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client" | \
sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
```
<img src="https://llm-assets.readthedocs.io/en/latest/_images/wget.png" width=100%; />
* Install drivers
```bash
sudo apt-get update
sudo apt-get -y install \
gawk \
dkms \
linux-headers-$(uname -r) \
libc6-dev
sudo apt install intel-i915-dkms intel-fw-gpu
sudo apt-get install -y gawk libc6-dev udev\
intel-opencl-icd intel-level-zero-gpu level-zero \
intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo
sudo reboot
```
<img src="https://llm-assets.readthedocs.io/en/latest/_images/i915.png" width=100%; />
<img src="https://llm-assets.readthedocs.io/en/latest/_images/gawk.png" width=100%; />
* Configure permissions
```bash
sudo gpasswd -a ${USER} render
newgrp render
# Verify the device is working with i915 driver
sudo apt-get install -y hwinfo
hwinfo --display
```
#### For Linux kernel 6.5
* Install wget, gpg-agent
```bash
sudo apt-get install -y gpg-agent wget
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client" | \
sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
```
<img src="https://llm-assets.readthedocs.io/en/latest/_images/wget.png" width=100%; />
* Install drivers
```bash
sudo apt-get update
sudo apt-get -y install \
gawk \
dkms \
linux-headers-$(uname -r) \
libc6-dev
sudo apt-get install -y gawk libc6-dev udev\
intel-opencl-icd intel-level-zero-gpu level-zero \
intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo
sudo apt install -y intel-i915-dkms intel-fw-gpu
sudo reboot
```
<img src="https://llm-assets.readthedocs.io/en/latest/_images/gawk.png" width=100%; />
#### (Optional) Update Level Zero on Intel Core™ Ultra iGPU
For Intel Core™ Ultra integrated GPU, please make sure level_zero version >= 1.3.28717. The level_zero version can be checked with `sycl-ls`, and verison will be tagged behind `[ext_oneapi_level_zero:gpu]`.
Here are the sample output of `sycl-ls`:
```
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 5 125H OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO [24.09.28717.12]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.28717]
```
If you have level_zero version < 1.3.28717, you could update as follows:
```bash
wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.16238.4/intel-igc-core_1.0.16238.4_amd64.deb
wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.16238.4/intel-igc-opencl_1.0.16238.4_amd64.deb
wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-level-zero-gpu-dbgsym_1.3.28717.12_amd64.ddeb
wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-level-zero-gpu_1.3.28717.12_amd64.deb
wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-opencl-icd-dbgsym_24.09.28717.12_amd64.ddeb
wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-opencl-icd_24.09.28717.12_amd64.deb
wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/libigdgmm12_22.3.17_amd64.deb
sudo dpkg -i *.deb
```
### Install oneAPI
```
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install intel-oneapi-common-vars=2024.0.0-49406 \
intel-oneapi-common-oneapi-vars=2024.0.0-49406 \
intel-oneapi-diagnostics-utility=2024.0.0-49093 \
intel-oneapi-compiler-dpcpp-cpp=2024.0.2-49895 \
intel-oneapi-dpcpp-ct=2024.0.0-49381 \
intel-oneapi-mkl=2024.0.0-49656 \
intel-oneapi-mkl-devel=2024.0.0-49656 \
intel-oneapi-mpi=2021.11.0-49493 \
intel-oneapi-mpi-devel=2021.11.0-49493 \
intel-oneapi-dal=2024.0.1-25 \
intel-oneapi-dal-devel=2024.0.1-25 \
intel-oneapi-ippcp=2021.9.1-5 \
intel-oneapi-ippcp-devel=2021.9.1-5 \
intel-oneapi-ipp=2021.10.1-13 \
intel-oneapi-ipp-devel=2021.10.1-13 \
intel-oneapi-tlt=2024.0.0-352 \
intel-oneapi-ccl=2021.11.2-5 \
intel-oneapi-ccl-devel=2021.11.2-5 \
intel-oneapi-dnnl-devel=2024.0.0-49521 \
intel-oneapi-dnnl=2024.0.0-49521 \
intel-oneapi-tcm-1.0=1.0.0-435
```
<img src="https://llm-assets.readthedocs.io/en/latest/_images/oneapi.png" alt="image-20240221102252565" width=100%; />
<img src="https://llm-assets.readthedocs.io/en/latest/_images/basekit.png" alt="image-20240221102252565" width=100%; />
### Setup Python Environment
Download and install the Miniforge as follows if you don't have conda installed on your machine:
```bash
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh
source ~/.bashrc
```
You can use `conda --version` to verify you conda installation.
After installation, create a new python environment `llm`:
```cmd
conda create -n llm python=3.11
```
Activate the newly created environment `llm`:
```cmd
conda activate llm
```
## Install `ipex-llm`
With the `llm` environment active, use `pip` to install `ipex-llm` for GPU.
Choose either US or CN website for `extra-index-url`:
```eval_rst
.. tabs::
.. tab:: US
.. code-block:: cmd
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
.. tab:: CN
.. code-block:: cmd
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
```
```eval_rst
.. note::
If you encounter network issues while installing IPEX, refer to `this guide <https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id3>`_ for troubleshooting advice.
```
## Verify Installation
* You can verify if `ipex-llm` is successfully installed by simply importing a few classes from the library. For example, execute the following import command in the terminal:
```bash
source /opt/intel/oneapi/setvars.sh
python
> from ipex_llm.transformers import AutoModel, AutoModelForCausalLM
```
## Runtime Configurations
To use GPU acceleration on Linux, several environment variables are required or recommended before running a GPU example.
```eval_rst
.. tabs::
.. tab:: Intel Arc™ A-Series and Intel Data Center GPU Flex
For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, we recommend:
.. code-block:: bash
# Configure oneAPI environment variables. Required step for APT or offline installed oneAPI.
# Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH.
source /opt/intel/oneapi/setvars.sh
# Recommended Environment Variables for optimal performance
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
.. tab:: Intel Data Center GPU Max
For Intel Data Center GPU Max Series, we recommend:
.. code-block:: bash
# Configure oneAPI environment variables. Required step for APT or offline installed oneAPI.
# Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH.
source /opt/intel/oneapi/setvars.sh
# Recommended Environment Variables for optimal performance
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
export ENABLE_SDP_FUSION=1
Please note that ``libtcmalloc.so`` can be installed by ``conda install -c conda-forge -y gperftools=2.10``
```
```eval_rst
.. seealso::
Please refer to `this guide <../Overview/install_gpu.html#id5>`_ for more details regarding runtime configuration.
```
## A Quick Example
Now let's play with a real LLM. We'll be using the [phi-1.5](https://huggingface.co/microsoft/phi-1_5) model, a 1.3 billion parameter LLM for this demostration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?".
* Step 1: Activate the Python environment `llm` you previously created:
```bash
conda activate llm
```
* Step 2: Follow [Runtime Configurations Section](#runtime-configurations) above to prepare your runtime environment.
* Step 3: Create a new file named `demo.py` and insert the code snippet below.
```python
# Copy/Paste the contents to a new file demo.py
import torch
from ipex_llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer, GenerationConfig
generation_config = GenerationConfig(use_cache = True)
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b", trust_remote_code=True)
# load Model using ipex-llm and load it to GPU
model = AutoModelForCausalLM.from_pretrained(
"tiiuae/falcon-7b", load_in_4bit=True, cpu_embedding=True, trust_remote_code=True)
model = model.to('xpu')
# Format the prompt
question = "What is AI?"
prompt = " Question:{prompt}\n\n Answer:".format(prompt=question)
# Generate predicted tokens
with torch.inference_mode():
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# warm up one more time before the actual generation task for the first run, see details in `Tips & Troubleshooting`
# output = model.generate(input_ids, do_sample=False, max_new_tokens=32, generation_config = generation_config)
output = model.generate(input_ids, do_sample=False, max_new_tokens=32, generation_config = generation_config).cpu()
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_str)
```
> Note: when running LLMs on Intel iGPUs with limited memory size, we recommend setting `cpu_embedding=True` in the `from_pretrained` function.
> This will allow the memory-intensive embedding layer to utilize the CPU instead of GPU.
* Step 5. Run `demo.py` within the activated Python environment using the following command:
```bash
python demo.py
```
### Example output
Example output on a system equipped with an 11th Gen Intel Core i7 CPU and Iris Xe Graphics iGPU:
```
Question:What is AI?
Answer: AI stands for Artificial Intelligence, which is the simulation of human intelligence in machines.
```
## Tips & Troubleshooting
### Warmup for optimial performance on first run
When running LLMs on GPU for the first time, you might notice the performance is lower than expected, with delays up to several minutes before the first token is generated. This delay occurs because the GPU kernels require compilation and initialization, which varies across different GPU types. To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks. If you're developing an application, you can incorporate this warmup step into start-up or loading routine to enhance the user experience.

View file

@ -0,0 +1,305 @@
# Install IPEX-LLM on Windows with Intel GPU
This guide demonstrates how to install IPEX-LLM on Windows with Intel GPUs.
It applies to Intel Core Ultra and Core 11 - 14 gen integrated GPUs (iGPUs), as well as Intel Arc Series GPU.
## Install Prerequisites
### (Optional) Update GPU Driver
```eval_rst
.. tip::
It is recommended to update your GPU driver, if you have driver version lower than ``31.0.101.5122``. Refer to `here <../Overview/install_gpu.html#prerequisites>`_ for more information.
```
Download and install the latest GPU driver from the [official Intel download page](https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html). A system reboot is necessary to apply the changes after the installation is complete.
```eval_rst
.. note::
The process could take around 10 minutes. After reboot, check for the **Intel Arc Control** application to verify the driver has been installed correctly. If the installation was successful, you should see the **Arc Control** interface similar to the figure below
```
<img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_3.png" width=100%; />
<!-- ### Install oneAPI -->
<!-- Download and install the [**Intel oneAPI Base Toolkit 2024.0**](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=window&distributions=offline). During installation, you can continue with the default installation settings.
<img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_oneapi_offline_installer.png" width=100%; />
```eval_rst
.. tip::
If the oneAPI installation hangs at the finalization step for more than 10 minutes, the error might be due to a problematic install of Visual Studio. Please reboot your computer and then launch the Visual Studio installer. If you see installation error messages, please repair your Visual Studio installation. After the repair is done, oneAPI installation is completed successfully.
``` -->
### Setup Python Environment
Visit [Miniforge installation page](https://conda-forge.org/download/), download the **Miniforge installer for Windows**, and follow the instructions to complete the installation.
<div align="center">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_miniforge_download.png" width=80%/>
</div>
After installation, open the **Miniforge Prompt**, create a new python environment `llm`:
```cmd
conda create -n llm python=3.11 libuv
```
Activate the newly created environment `llm`:
```cmd
conda activate llm
```
## Install `ipex-llm`
With the `llm` environment active, use `pip` to install `ipex-llm` for GPU. Choose either US or CN website for `extra-index-url`:
```eval_rst
.. tabs::
.. tab:: US
.. code-block:: cmd
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
.. tab:: CN
.. code-block:: cmd
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
```
```eval_rst
.. note::
If you encounter network issues while installing IPEX, refer to `this guide <https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#install-ipex-llm-from-wheel>`_ for troubleshooting advice.
```
## Verify Installation
You can verify if `ipex-llm` is successfully installed following below steps.
### Step 1: Runtime Configurations
* Open the **Miniforge Prompt** and activate the Python environment `llm` you previously created:
```cmd
conda activate llm
```
<!-- * Configure oneAPI variables by running the following command:
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
``` -->
* Set the following environment variables according to your device:
```eval_rst
.. tabs::
.. tab:: Intel iGPU
.. code-block:: cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
.. tab:: Intel Arc™ A770
.. code-block:: cmd
set SYCL_CACHE_PERSISTENT=1
```
```eval_rst
.. seealso::
For other Intel dGPU Series, please refer to `this guide <../Overview/install_gpu.html#runtime-configuration>`_ for more details regarding runtime configuration.
```
### Step 2: Run Python Code
* Launch the Python interactive shell by typing `python` in the Miniforge Prompt window and then press Enter.
* Copy following code to Miniforge Prompt **line by line** and press Enter **after copying each line**.
```python
import torch
from ipex_llm.transformers import AutoModel,AutoModelForCausalLM
tensor_1 = torch.randn(1, 1, 40, 128).to('xpu')
tensor_2 = torch.randn(1, 1, 128, 40).to('xpu')
print(torch.matmul(tensor_1, tensor_2).size())
```
It will output following content at the end:
```
torch.Size([1, 1, 40, 40])
```
```eval_rst
.. seealso::
If you encounter any problem, please refer to `here <https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#troubleshooting>`_ for help.
```
* To exit the Python interactive shell, simply press Ctrl+Z then press Enter (or input `exit()` then press Enter).
## Monitor GPU Status
To monitor your GPU's performance and status (e.g. memory consumption, utilization, etc.), you can use either the **Windows Task Manager (in `Performance` Tab)** (see the left side of the figure below) or the **Arc Control** application (see the right side of the figure below)
<img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_4.png" width=100%; />
## A Quick Example
Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat) model, a 1.8 billion parameter LLM for this demonstration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?".
* Step 1: Follow [Runtime Configurations Section](#step-1-runtime-configurations) above to prepare your runtime environment.
* Step 2: Install additional package required for Qwen-1.8B-Chat to conduct:
```cmd
pip install tiktoken transformers_stream_generator einops
```
* Step 3: Create code file. IPEX-LLM supports loading model from Hugging Face or ModelScope. Please choose according to your requirements.
```eval_rst
.. tabs::
.. tab:: Hugging Face
Create a new file named ``demo.py`` and insert the code snippet below to run `Qwen-1.8B-Chat <https://huggingface.co/Qwen/Qwen-1_8B-Chat>`_ model with IPEX-LLM optimizations.
.. code-block:: python
# Copy/Paste the contents to a new file demo.py
import torch
from ipex_llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer, GenerationConfig
generation_config = GenerationConfig(use_cache=True)
print('Now start loading Tokenizer and optimizing Model...')
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat",
trust_remote_code=True)
# Load Model using ipex-llm and load it to GPU
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat",
load_in_4bit=True,
cpu_embedding=True,
trust_remote_code=True)
model = model.to('xpu')
print('Successfully loaded Tokenizer and optimized Model!')
# Format the prompt
question = "What is AI?"
prompt = "user: {prompt}\n\nassistant:".format(prompt=question)
# Generate predicted tokens
with torch.inference_mode():
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
print('--------------------------------------Note-----------------------------------------')
print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |')
print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |')
print('| Please be patient until it finishes warm-up... |')
print('-----------------------------------------------------------------------------------')
# To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks.
# If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience.
output = model.generate(input_ids,
do_sample=False,
max_new_tokens=32,
generation_config=generation_config) # warm-up
print('Successfully finished warm-up, now start generation...')
output = model.generate(input_ids,
do_sample=False,
max_new_tokens=32,
generation_config=generation_config).cpu()
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_str)
.. tab:: ModelScope
Please first run following command in Miniforge Prompt to install ModelScope:
.. code-block:: cmd
pip install modelscope==1.11.0
Create a new file named ``demo.py`` and insert the code snippet below to run `Qwen-1.8B-Chat <https://www.modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary>`_ model with IPEX-LLM optimizations.
.. code-block:: python
# Copy/Paste the contents to a new file demo.py
import torch
from ipex_llm.transformers import AutoModelForCausalLM
from transformers import GenerationConfig
from modelscope import AutoTokenizer
generation_config = GenerationConfig(use_cache=True)
print('Now start loading Tokenizer and optimizing Model...')
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat",
trust_remote_code=True)
# Load Model using ipex-llm and load it to GPU
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat",
load_in_4bit=True,
cpu_embedding=True,
trust_remote_code=True,
model_hub='modelscope')
model = model.to('xpu')
print('Successfully loaded Tokenizer and optimized Model!')
# Format the prompt
question = "What is AI?"
prompt = "user: {prompt}\n\nassistant:".format(prompt=question)
# Generate predicted tokens
with torch.inference_mode():
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
print('--------------------------------------Note-----------------------------------------')
print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |')
print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |')
print('| Please be patient until it finishes warm-up... |')
print('-----------------------------------------------------------------------------------')
# To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks.
# If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience.
output = model.generate(input_ids,
do_sample=False,
max_new_tokens=32,
generation_config=generation_config) # warm-up
print('Successfully finished warm-up, now start generation...')
output = model.generate(input_ids,
do_sample=False,
max_new_tokens=32,
generation_config=generation_config).cpu()
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_str)
.. tip::
Please note that the repo id on ModelScope may be different from Hugging Face for some models.
```
```eval_rst
.. note::
When running LLMs on Intel iGPUs with limited memory size, we recommend setting ``cpu_embedding=True`` in the ``from_pretrained`` function.
This will allow the memory-intensive embedding layer to utilize the CPU instead of GPU.
```
* Step 4. Run `demo.py` within the activated Python environment using the following command:
```cmd
python demo.py
```
### Example output
Example output on a system equipped with an Intel Core Ultra 5 125H CPU and Intel Arc Graphics iGPU:
```
user: What is AI?
assistant: AI stands for Artificial Intelligence, which refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition,
```
## Tips & Troubleshooting
### Warm-up for optimal performance on first run
When running LLMs on GPU for the first time, you might notice the performance is lower than expected, with delays up to several minutes before the first token is generated. This delay occurs because the GPU kernels require compilation and initialization, which varies across different GPU types. To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks. If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience.

View file

@ -0,0 +1,201 @@
# Run Llama 3 on Intel GPU using llama.cpp and ollama with IPEX-LLM
[Llama 3](https://llama.meta.com/llama3/) is the latest Large Language Models released by [Meta](https://llama.meta.com/) which provides state-of-the-art performance and excels at language nuances, contextual understanding, and complex tasks like translation and dialogue generation.
Now, you can easily run Llama 3 on Intel GPU using `llama.cpp` and `Ollama` with IPEX-LLM.
See the demo of running Llama-3-8B-Instruct on Intel Arc GPU using `Ollama` below.
<video src="https://llm-assets.readthedocs.io/en/latest/_images/ollama-llama3-linux-arc.mp4" width="100%" controls></video>
## Quick Start
This quickstart guide walks you through how to run Llama 3 on Intel GPU using `llama.cpp` / `Ollama` with IPEX-LLM.
### 1. Run Llama 3 using llama.cpp
#### 1.1 Install IPEX-LLM for llama.cpp and Initialize
Visit [Run llama.cpp with IPEX-LLM on Intel GPU Guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html), and follow the instructions in section [Prerequisites](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#prerequisites) to setup and section [Install IPEX-LLM for llama.cpp](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#install-ipex-llm-for-llama-cpp) to install the IPEX-LLM with llama.cpp binaries, then follow the instructions in section [Initialize llama.cpp with IPEX-LLM](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#initialize-llama-cpp-with-ipex-llm) to initialize.
**After above steps, you should have created a conda environment, named `llm-cpp` for instance and have llama.cpp binaries in your current directory.**
**Now you can use these executable files by standard llama.cpp usage.**
#### 1.2 Download Llama3
There already are some GGUF models of Llama3 in community, here we take [Meta-Llama-3-8B-Instruct-GGUF](https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF) for example.
Suppose you have downloaded a [Meta-Llama-3-8B-Instruct-Q4_K_M.gguf](https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf) model from [Meta-Llama-3-8B-Instruct-GGUF](https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF) and put it under `<model_dir>`.
#### 1.3 Run Llama3 on Intel GPU using llama.cpp
#### Runtime Configuration
To use GPU acceleration, several environment variables are required or recommended before running `llama.cpp`.
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
.. tab:: Windows
.. code-block:: bash
set SYCL_CACHE_PERSISTENT=1
```
```eval_rst
.. tip::
If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance:
.. code-block:: bash
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
```
##### Run llama3
Under your current directory, exceuting below command to do inference with Llama3:
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
./main -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing something" -t 8 -e -ngl 33 --color --no-mmap
.. tab:: Windows
Please run the following command in Miniforge Prompt.
.. code-block:: bash
main -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing something" -e -ngl 33 --color --no-mmap
```
Under your current directory, you can also execute below command to have interactive chat with Llama3:
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
./main -ngl 33 --interactive-first --color -e --in-prefix '<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -r '<|eot_id|>' -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
.. tab:: Windows
Please run the following command in Miniforge Prompt.
.. code-block:: bash
main -ngl 33 --interactive-first --color -e --in-prefix "<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -r "<|eot_id|>" -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
```
Below is a sample output on Intel Arc GPU:
<img src="https://llm-assets.readthedocs.io/en/latest/_images/llama3-cpp-arc-demo.png" width=100%; />
### 2. Run Llama3 using Ollama
#### 2.1 Install IPEX-LLM for Ollama and Initialize
Visit [Run Ollama with IPEX-LLM on Intel GPU](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html), and follow the instructions in section [Install IPEX-LLM for llama.cpp](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#install-ipex-llm-for-llama-cpp) to install the IPEX-LLM with Ollama binary, then follow the instructions in section [Initialize Ollama](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html#initialize-ollama) to initialize.
**After above steps, you should have created a conda environment, named `llm-cpp` for instance and have ollama binary file in your current directory.**
**Now you can use this executable file by standard Ollama usage.**
#### 2.2 Run Llama3 on Intel GPU using Ollama
[ollama/ollama](https://github.com/ollama/ollama) has alreadly added [Llama3](https://ollama.com/library/llama3) into its library, so it's really easy to run Llama3 using ollama now.
##### 2.2.1 Run Ollama Serve
Launch the Ollama service:
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
export OLLAMA_NUM_GPU=999
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
./ollama serve
.. tab:: Windows
Please run the following command in Miniforge Prompt.
.. code-block:: bash
set no_proxy=localhost,127.0.0.1
set ZES_ENABLE_SYSMAN=1
set OLLAMA_NUM_GPU=999
set SYCL_CACHE_PERSISTENT=1
ollama serve
```
```eval_rst
.. tip::
If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
.. code-block:: bash
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
```
```eval_rst
.. note::
To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`.
```
##### 2.2.2 Using Ollama Run Llama3
Keep the Ollama service on and open another terminal and run llama3 with `ollama run`:
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
export no_proxy=localhost,127.0.0.1
./ollama run llama3:8b-instruct-q4_K_M
.. tab:: Windows
Please run the following command in Miniforge Prompt.
.. code-block:: bash
set no_proxy=localhost,127.0.0.1
ollama run llama3:8b-instruct-q4_K_M
```
```eval_rst
.. note::
Here we just take `llama3:8b-instruct-q4_K_M` for example, you can replace it with any other Llama3 model you want.
```
Below is a sample output on Intel Arc GPU :
<img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama-llama3-arc-demo.png" width=100%; />

View file

@ -0,0 +1,333 @@
# Run llama.cpp with IPEX-LLM on Intel GPU
[ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) prvoides fast LLM inference in in pure C++ across a variety of hardware; you can now use the C++ interface of [`ipex-llm`](https://github.com/intel-analytics/ipex-llm) as an accelerated backend for `llama.cpp` running on Intel **GPU** *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)*.
See the demo of running LLaMA2-7B on Intel Arc GPU below.
<video src="https://llm-assets.readthedocs.io/en/latest/_images/llama-cpp-arc.mp4" width="100%" controls></video>
```eval_rst
.. note::
`ipex-llm[cpp]==2.5.0b20240527` is consistent with `c780e75 <https://github.com/ggerganov/llama.cpp/commit/c780e75305dba1f67691a8dc0e8bc8425838a452>`_ of llama.cpp.
Our current version is consistent with `62bfef5 <https://github.com/ggerganov/llama.cpp/commit/62bfef5194d5582486d62da3db59bf44981b7912>`_ of llama.cpp.
```
## Quick Start
This quickstart guide walks you through installing and running `llama.cpp` with `ipex-llm`.
### 0 Prerequisites
IPEX-LLM's support for `llama.cpp` now is available for Linux system and Windows system.
#### Linux
For Linux system, we recommend Ubuntu 20.04 or later (Ubuntu 22.04 is preferred).
Visit the [Install IPEX-LLM on Linux with Intel GPU](./install_linux_gpu.html), follow [Install Intel GPU Driver](./install_linux_gpu.html#install-intel-gpu-driver) and [Install oneAPI](./install_linux_gpu.html#install-oneapi) to install GPU driver and Intel® oneAPI Base Toolkit 2024.0.
#### Windows (Optional)
IPEX-LLM backend for llama.cpp only supports the more recent GPU drivers. Please make sure your GPU driver version is equal or newer than `31.0.101.5333`, otherwise you might find gibberish output.
If you have lower GPU driver version, visit the [Install IPEX-LLM on Windows with Intel GPU Guide](./install_windows_gpu.html), and follow [Update GPU driver](./install_windows_gpu.html#optional-update-gpu-driver).
### 1 Install IPEX-LLM for llama.cpp
To use `llama.cpp` with IPEX-LLM, first ensure that `ipex-llm[cpp]` is installed.
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
conda create -n llm-cpp python=3.11
conda activate llm-cpp
pip install --pre --upgrade ipex-llm[cpp]
.. tab:: Windows
.. note::
Please run the following command in Miniforge Prompt.
.. code-block:: cmd
conda create -n llm-cpp python=3.11
conda activate llm-cpp
pip install --pre --upgrade ipex-llm[cpp]
```
**After the installation, you should have created a conda environment, named `llm-cpp` for instance, for running `llama.cpp` commands with IPEX-LLM.**
### 2 Setup for running llama.cpp
First you should create a directory to use `llama.cpp`, for instance, use following command to create a `llama-cpp` directory and enter it.
```cmd
mkdir llama-cpp
cd llama-cpp
```
#### Initialize llama.cpp with IPEX-LLM
Then you can use following command to initialize `llama.cpp` with IPEX-LLM:
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
init-llama-cpp
After ``init-llama-cpp``, you should see many soft links of ``llama.cpp``'s executable files and a ``convert.py`` in current directory.
.. image:: https://llm-assets.readthedocs.io/en/latest/_images/init_llama_cpp_demo_image.png
.. tab:: Windows
Please run the following command with **administrator privilege in Miniforge Prompt**.
.. code-block:: bash
init-llama-cpp.bat
After ``init-llama-cpp.bat``, you should see many soft links of ``llama.cpp``'s executable files and a ``convert.py`` in current directory.
.. image:: https://llm-assets.readthedocs.io/en/latest/_images/init_llama_cpp_demo_image_windows.png
```
```eval_rst
.. note::
``init-llama-cpp`` will create soft links of llama.cpp's executable files to current directory, if you want to use these executable files in other places, don't forget to run above commands again.
```
```eval_rst
.. note::
If you have installed higher version ``ipex-llm[cpp]`` and want to upgrade your binary file, don't forget to remove old binary files first and initialize again with ``init-llama-cpp`` or ``init-llama-cpp.bat``.
```
**Now you can use these executable files by standard llama.cpp's usage.**
#### Runtime Configuration
To use GPU acceleration, several environment variables are required or recommended before running `llama.cpp`.
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
.. tab:: Windows
Please run the following command in Miniforge Prompt.
.. code-block:: bash
set SYCL_CACHE_PERSISTENT=1
```
```eval_rst
.. tip::
If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance:
.. code-block:: bash
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
```
### 3 Example: Running community GGUF models with IPEX-LLM
Here we provide a simple example to show how to run a community GGUF model with IPEX-LLM.
#### Model Download
Before running, you should download or copy community GGUF model to your current directory. For instance, `mistral-7b-instruct-v0.1.Q4_K_M.gguf` of [Mistral-7B-Instruct-v0.1-GGUF](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/tree/main).
#### Run the quantized model
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
./main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color
.. note::
For more details about meaning of each parameter, you can use ``./main -h``.
.. tab:: Windows
Please run the following command in Miniforge Prompt.
.. code-block:: bash
main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color
.. note::
For more details about meaning of each parameter, you can use ``main -h``.
```
#### Sample Output
```
Log start
main: build = 1 (38bcbd4)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.0 (2024.0.0.20231017) for x86_64-unknown-linux-gnu
main: seed = 1710359960
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 8 SYCL devices:
|ID| Name |compute capability|Max compute units|Max work group|Max sub group|Global mem size|
|--|---------------------------------------------|------------------|-----------------|--------------|-------------|---------------|
| 0| Intel(R) Arc(TM) A770 Graphics| 1.3| 512| 1024| 32| 16225243136|
| 1| Intel(R) FPGA Emulation Device| 1.2| 32| 67108864| 64| 67181625344|
| 2| 13th Gen Intel(R) Core(TM) i9-13900K| 3.0| 32| 8192| 64| 67181625344|
| 3| Intel(R) Arc(TM) A770 Graphics| 3.0| 512| 1024| 32| 16225243136|
| 4| Intel(R) Arc(TM) A770 Graphics| 3.0| 512| 1024| 32| 16225243136|
| 5| Intel(R) UHD Graphics 770| 3.0| 32| 512| 32| 53745299456|
| 6| Intel(R) Arc(TM) A770 Graphics| 1.3| 512| 1024| 32| 16225243136|
| 7| Intel(R) UHD Graphics 770| 1.3| 32| 512| 32| 53745299456|
detect 2 SYCL GPUs: [0,6] with Max compute units:512
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ~/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 15
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attm = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 4.07 GiB (4.83 BPW)
llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.1
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size = 0.33 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: SYCL0 buffer size = 2113.28 MiB
llm_load_tensors: SYCL6 buffer size = 1981.77 MiB
llm_load_tensors: SYCL_Host buffer size = 70.31 MiB
...............................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: SYCL0 KV buffer size = 34.00 MiB
llama_kv_cache_init: SYCL6 KV buffer size = 30.00 MiB
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB
llama_new_context_with_model: SYCL_Host input buffer size = 10.01 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 73.00 MiB
llama_new_context_with_model: SYCL6 compute buffer size = 73.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 8.00 MiB
llama_new_context_with_model: graph splits (measure): 3
system_info: n_threads = 8 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 1
Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun exploring the world around her. Her parents were kind and let her do what she wanted, as long as she stayed safe.
One day, the little
llama_print_timings: load time = 10096.78 ms
llama_print_timings: sample time = x.xx ms / 32 runs ( xx.xx ms per token, xx.xx tokens per second)
llama_print_timings: prompt eval time = xx.xx ms / 31 tokens ( xx.xx ms per token, xx.xx tokens per second)
llama_print_timings: eval time = xx.xx ms / 31 runs ( xx.xx ms per token, xx.xx tokens per second)
llama_print_timings: total time = xx.xx ms / 62 tokens
Log end
```
### Troubleshooting
#### Fail to quantize model
If you encounter `main: failed to quantize model from xxx`, please make sure you have created related output directory.
#### Program hang during model loading
If your program hang after `llm_load_tensors: SYCL_Host buffer size = xx.xx MiB`, you can add `--no-mmap` in your command.
#### How to set `-ngl` parameter
`-ngl` means the number of layers to store in VRAM. If your VRAM is enough, we recommend putting all layers on GPU, you can just set `-ngl` to a large number like 999 to achieve this goal.
If `-ngl` is set to 0, it means that the entire model will run on CPU. If `-ngl` is set to greater than 0 and less than model layers, then it's mixed GPU + CPU scenario.
#### How to specificy GPU
If your machine has multi GPUs, `llama.cpp` will default use all GPUs which may slow down your inference for model which can run on single GPU. You can add `-sm none` in your command to use one GPU only.
Also, you can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select device before excuting your command, more details can refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.html#oneapi-device-selector).
#### Program crash with Chinese prompt
If you run the llama.cpp program on Windows and find that your program crashes or outputs abnormally when accepting Chinese prompts, you can open `Region->Administrative->Change System locale..`, check `Beta: Use Unicode UTF-8 for worldwide language support` option and then restart your computer.
For detailed instructions on how to do this, see [this issue](https://github.com/intel-analytics/ipex-llm/issues/10989#issuecomment-2105600469).

View file

@ -0,0 +1,204 @@
# Run Ollama with IPEX-LLM on Intel GPU
[ollama/ollama](https://github.com/ollama/ollama) is popular framework designed to build and run language models on a local machine; you can now use the C++ interface of [`ipex-llm`](https://github.com/intel-analytics/ipex-llm) as an accelerated backend for `ollama` running on Intel **GPU** *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)*.
See the demo of running LLaMA2-7B on Intel Arc GPU below.
<video src="https://llm-assets.readthedocs.io/en/latest/_images/ollama-linux-arc.mp4" width="100%" controls></video>
```eval_rst
.. note::
`ipex-llm[cpp]==2.5.0b20240527` is consistent with `v0.1.34 <https://github.com/ollama/ollama/releases/tag/v0.1.34>`_ of ollama.
Our current version is consistent with `v0.1.39 <https://github.com/ollama/ollama/releases/tag/v0.1.39>`_ of ollama.
```
## Quickstart
### 1 Install IPEX-LLM for Ollama
IPEX-LLM's support for `ollama` now is available for Linux system and Windows system.
Visit [Run llama.cpp with IPEX-LLM on Intel GPU Guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html), and follow the instructions in section [Prerequisites](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#prerequisites) to setup and section [Install IPEX-LLM cpp](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#install-ipex-llm-for-llama-cpp) to install the IPEX-LLM with Ollama binaries.
**After the installation, you should have created a conda environment, named `llm-cpp` for instance, for running `ollama` commands with IPEX-LLM.**
### 2. Initialize Ollama
Activate the `llm-cpp` conda environment and initialize Ollama by executing the commands below. A symbolic link to `ollama` will appear in your current directory.
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
conda activate llm-cpp
init-ollama
.. tab:: Windows
Please run the following command with **administrator privilege in Miniforge Prompt**.
.. code-block:: bash
conda activate llm-cpp
init-ollama.bat
```
```eval_rst
.. note::
If you have installed higher version ``ipex-llm[cpp]`` and want to upgrade your ollama binary file, don't forget to remove old binary files first and initialize again with ``init-ollama`` or ``init-ollama.bat``.
```
**Now you can use this executable file by standard ollama's usage.**
### 3 Run Ollama Serve
You may launch the Ollama service as below:
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
./ollama serve
.. tab:: Windows
Please run the following command in Miniforge Prompt.
.. code-block:: bash
set OLLAMA_NUM_GPU=999
set no_proxy=localhost,127.0.0.1
set ZES_ENABLE_SYSMAN=1
set SYCL_CACHE_PERSISTENT=1
ollama serve
```
```eval_rst
.. note::
Please set environment variable ``OLLAMA_NUM_GPU`` to ``999`` to make sure all layers of your model are running on Intel GPU, otherwise, some layers may run on CPU.
```
```eval_rst
.. tip::
If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
.. code-block:: bash
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
```
```eval_rst
.. note::
To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`.
```
The console will display messages similar to the following:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/ollama_serve.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama_serve.png" width=100%; />
</a>
### 4 Pull Model
Keep the Ollama service on and open another terminal and run `./ollama pull <model_name>` in Linux (`ollama.exe pull <model_name>` in Windows) to automatically pull a model. e.g. `dolphin-phi:latest`:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/ollama_pull.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama_pull.png" width=100%; />
</a>
### 5 Using Ollama
#### Using Curl
Using `curl` is the easiest way to verify the API service and model. Execute the following commands in a terminal. **Replace the <model_name> with your pulled
model**, e.g. `dolphin-phi`.
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
curl http://localhost:11434/api/generate -d '
{
"model": "<model_name>",
"prompt": "Why is the sky blue?",
"stream": false
}'
.. tab:: Windows
Please run the following command in Miniforge Prompt.
.. code-block:: bash
curl http://localhost:11434/api/generate -d "
{
\"model\": \"<model_name>\",
\"prompt\": \"Why is the sky blue?\",
\"stream\": false
}"
```
#### Using Ollama Run GGUF models
Ollama supports importing GGUF models in the Modelfile, for example, suppose you have downloaded a `mistral-7b-instruct-v0.1.Q4_K_M.gguf` from [Mistral-7B-Instruct-v0.1-GGUF](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/tree/main), then you can create a file named `Modelfile`:
```bash
FROM ./mistral-7b-instruct-v0.1.Q4_K_M.gguf
TEMPLATE [INST] {{ .Prompt }} [/INST]
PARAMETER num_predict 64
```
Then you can create the model in Ollama by `ollama create example -f Modelfile` and use `ollama run` to run the model directly on console.
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
export no_proxy=localhost,127.0.0.1
./ollama create example -f Modelfile
./ollama run example
.. tab:: Windows
Please run the following command in Miniforge Prompt.
.. code-block:: bash
set no_proxy=localhost,127.0.0.1
ollama create example -f Modelfile
ollama run example
```
An example process of interacting with model with `ollama run example` looks like the following:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" width=100%; />
</a>

View file

@ -0,0 +1,208 @@
# Run Open WebUI with Intel GPU
[Open WebUI](https://github.com/open-webui/open-webui) is a user friendly GUI for running LLM locally; by porting it to [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily run LLM in [Open WebUI](https://github.com/open-webui/open-webui) on Intel **GPU** *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)*.
*See the demo of running Mistral:7B on Intel Arc A770 below.*
<video src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_demo.mp4" width="100%" controls></video>
## Quickstart
This quickstart guide walks you through setting up and using [Open WebUI](https://github.com/open-webui/open-webui) with Ollama (using the C++ interface of [`ipex-llm`](https://github.com/intel-analytics/ipex-llm) as an accelerated backend).
### 1 Run Ollama with Intel GPU
Follow the instructions on the [Run Ollama with Intel GPU](ollama_quickstart.html) to install and run "Ollama Serve". Please ensure that the Ollama server continues to run while you're using the Open WebUI.
### 2 Install the Open-Webui
#### Install Node.js & npm
```eval_rst
.. note::
Package version requirements for running Open WebUI: Node.js (>= 20.10) or Bun (>= 1.0.21), Python (>= 3.11)
```
Please install Node.js & npm as below:
```eval_rst
.. tabs::
.. tab:: Linux
Run below commands to install Node.js & npm. Once the installation is complete, verify the installation by running ```node -v``` and ```npm -v``` to check the versions of Node.js and npm, respectively.
.. code-block:: bash
sudo apt update
sudo apt install nodejs
sudo apt install npm
.. tab:: Windows
You may download Node.js installation package from https://nodejs.org/dist/v20.12.2/node-v20.12.2-x64.msi, which will install both Node.js & npm on your system.
Once the installation is complete, verify the installation by running ```node -v``` and ```npm -v``` to check the versions of Node.js and npm, respectively.
```
#### Download the Open-Webui
Use `git` to clone the [open-webui repo](https://github.com/open-webui/open-webui.git), or download the open-webui source code zip from [this link](https://github.com/open-webui/open-webui/archive/refs/heads/main.zip) and unzip it to a directory, e.g. `~/open-webui`.
#### Install Dependencies
You may run below commands to install Open WebUI dependencies:
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
cd ~/open-webui/
cp -RPp .env.example .env # Copy required .env file
# Build frontend
npm i
npm run build
# Install Dependencies
cd ./backend
pip install -r requirements.txt -U
.. tab:: Windows
.. code-block:: bash
cd ~\open-webui\
copy .env.example .env
# Build frontend
npm install
npm run build
# Install Dependencies
cd .\backend
pip install -r requirements.txt -U
```
### 3. Start the Open-WebUI
#### Start the service
Run below commands to start the service:
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
export no_proxy=localhost,127.0.0.1
bash start.sh
.. note:
If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add `export HF_ENDPOINT=https://hf-mirror.com` before running `bash start.sh`.
.. tab:: Windows
.. code-block:: bash
set no_proxy=localhost,127.0.0.1
start_windows.bat
.. note:
If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add `set HF_ENDPOINT=https://hf-mirror.com` before running `start_windows.bat`.
```
#### Access the WebUI
Upon successful launch, URLs to access the WebUI will be displayed in the terminal. Open the provided local URL in your browser to interact with the WebUI, e.g. http://localhost:8080/.
### 4. Using the Open-Webui
```eval_rst
.. note::
For detailed information about how to use Open WebUI, visit the README of `open-webui official repository <https://github.com/open-webui/open-webui>`_.
```
#### Log-in
If this is your first time using it, you need to register. After registering, log in with the registered account to access the interface.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_signup.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_signup.png" width="100%" />
</a>
<a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_login.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_login.png" width="100%" />
</a>
#### Configure `Ollama` service URL
Access the Ollama settings through **Settings -> Connections** in the menu. By default, the **Ollama Base URL** is preset to https://localhost:11434, as illustrated in the snapshot below. To verify the status of the Ollama service connection, click the **Refresh button** located next to the textbox. If the WebUI is unable to establish a connection with the Ollama server, you will see an error message stating, `WebUI could not connect to Ollama`.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_settings_0.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_settings_0.png" width="100%" />
</a>
If the connection is successful, you will see a message stating `Service Connection Verified`, as illustrated below.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_settings.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_settings.png" width="100%" />
</a>
```eval_rst
.. note::
If you want to use an Ollama server hosted at a different URL, simply update the **Ollama Base URL** to the new URL and press the **Refresh** button to re-confirm the connection to Ollama.
```
#### Pull Model
Go to **Settings -> Models** in the menu, choose a model under **Pull a model from Ollama.com** using the drop-down menu, and then hit the **Download** button on the right. Ollama will automatically download the selected model for you.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_pull_models.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_pull_models.png" width="100%" />
</a>
#### Chat with the Model
Start new conversations with **New chat** in the left-side menu.
On the right-side, choose a downloaded model from the **Select a model** drop-down menu at the top, input your questions into the **Send a Message** textbox at the bottom, and click the button on the right to get responses.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_select_model.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_select_model.png" width="100%" />
</a>
<br/>
Additionally, you can drag and drop a document into the textbox, allowing the LLM to access its contents. The LLM will then generate answers based on the document provided.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_chat_2.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_chat_2.png" width="100%" />
</a>
#### Exit Open-Webui
To shut down the open-webui server, use **Ctrl+C** in the terminal where the open-webui server is runing, then close your browser tab.
### 5. Troubleshooting
##### Error `No module named 'torch._C`
When you encounter the error ``ModuleNotFoundError: No module named 'torch._C'`` after executing ```bash start.sh```, you can resolve it by reinstalling PyTorch. First, use ```pip uninstall torch``` to remove the existing PyTorch installation, and then reinstall it along with its dependencies by running ```pip install torch torchvision torchaudio```.

View file

@ -0,0 +1,129 @@
# Run PrivateGPT with IPEX-LLM on Intel GPU
[PrivateGPT](https://github.com/zylon-ai/private-gpt) is a production-ready AI project that allows users to chat over documents, etc.; by integrating it with [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily leverage local LLMs running on Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max).
*See the demo of privateGPT running Mistral:7B on Intel Arc A770 below.*
<video src="https://llm-assets.readthedocs.io/en/latest/_images/PrivateGPT-ARC.mp4" width="100%" controls></video>
## Quickstart
### 1. Install and Start `Ollama` Service on Intel GPU
Follow the steps in [Run Ollama on Intel GPU Guide](./ollama_quickstart.md) to install and run Ollama on Intel GPU. Ensure that `ollama serve` is running correctly and can be accessed through a local URL (e.g., `https://127.0.0.1:11434`) or a remote URL (e.g., `http://your_ip:11434`).
We recommend pulling the desired model before proceeding with PrivateGPT. For instance, to pull the Mistral:7B model, you can use the following command:
```bash
ollama pull mistral:7b
```
### 2. Install PrivateGPT
#### Download PrivateGPT
You can either clone the repository or download the source zip from [github](https://github.com/zylon-ai/private-gpt/archive/refs/heads/main.zip):
```bash
git clone https://github.com/zylon-ai/private-gpt
```
#### Install Dependencies
Execute the following commands in a terminal to install the dependencies of PrivateGPT:
```cmd
cd private-gpt
pip install poetry
pip install ffmpy==0.3.1
poetry install --extras "ui llms-ollama embeddings-ollama vector-stores-qdrant"
```
For more details, refer to the [PrivateGPT installation Guide](https://docs.privategpt.dev/installation/getting-started/main-concepts).
### 3. Start PrivateGPT
#### Configure PrivateGPT
To configure PrivateGPT to use Ollama for running local LLMs, you should edit the `private-gpt/settings-ollama.yaml` file. Modify the `ollama` section by setting the `llm_model` and `embedding_model` you wish to use, and updating the `api_base` and `embedding_api_base` to direct to your Ollama URL.
Below is an example of how `settings-ollama.yaml` should look.
<p align="center"><a href="https://llm-assets.readthedocs.io/en/latest/_images/privateGPT-ollama-setting.png" target="_blank" align="center">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/privateGPT-ollama-setting.png" alt="image-p1" width=100%; />
</a></p>
```eval_rst
.. note::
`settings-ollama.yaml` is loaded when the Ollama profile is specified in the PGPT_PROFILES environment variable. This can override configurations from the default `settings.yaml`.
```
For more information on configuring PrivateGPT, please visit the [PrivateGPT Main Concepts](https://docs.privategpt.dev/installation/getting-started/main-concepts) page.
#### Start the service
Please ensure that the Ollama server continues to run in a terminal while you're using the PrivateGPT.
Run below commands to start the service in another terminal:
```eval_rst
.. tabs::
.. tab:: Linux
.. code-block:: bash
export no_proxy=localhost,127.0.0.1
PGPT_PROFILES=ollama make run
.. note:
Setting ``PGPT_PROFILES=ollama`` will load the configuration from ``settings.yaml`` and ``settings-ollama.yaml``.
.. tab:: Windows
.. code-block:: bash
set no_proxy=localhost,127.0.0.1
set PGPT_PROFILES=ollama
make run
.. note:
Setting ``PGPT_PROFILES=ollama`` will load the configuration from ``settings.yaml`` and ``settings-ollama.yaml``.
```
Upon successful deployment, you will see logs in the terminal similar to the following:
<p align="center"><a href="https://llm-assets.readthedocs.io/en/latest/_images/privateGPT-service-success.png" target="_blank" align="center">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/privateGPT-service-success.png" alt="image-p1" width=100%; />
</a></p>
Open a browser (if it doesn't open automatically) and navigate to the URL displayed in the terminal. If it shows http://0.0.0.0:8001, you can access it locally via `http://127.0.0.1:8001` or remotely via `http://your_ip:8001`.
### 4. Using PrivateGPT
#### Chat with the Model
To chat with the LLM, select the "LLM Chat" option located in the upper left corner of the page. Type your messages at the bottom of the page and click the "Submit" button to receive responses from the model.
<p align="center"><a href="https://llm-assets.readthedocs.io/en/latest/_images/privateGPT-LLM-Chat.png" target="_blank" align="center">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/privateGPT-LLM-Chat.png" alt="image-p1" width=100%; />
</a></p>
#### Chat over Documents (RAG)
To interact with documents, select the "Query Files" option in the upper left corner of the page. Click the "Upload File(s)" button to upload documents. After the documents have been vectorized, you can type your messages at the bottom of the page and click the "Submit" button to receive responses from the model based on the uploaded content.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/privateGPT-Query-Files.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/privateGPT-Query-Files.png" width=100%; />
</a>

View file

@ -0,0 +1,276 @@
# Serving using IPEX-LLM and vLLM on Intel GPU
vLLM is a fast and easy-to-use library for LLM inference and serving. You can find the detailed information at their [homepage](https://github.com/vllm-project/vllm).
IPEX-LLM can be integrated into vLLM so that user can use `IPEX-LLM` to boost the performance of vLLM engine on Intel **GPUs** *(e.g., local PC with descrete GPU such as Arc, Flex and Max)*.
Currently, IPEX-LLM integrated vLLM only supports the following models:
- Qwen series models
- Llama series models
- ChatGLM series models
- Baichuan series models
## Quick Start
This quickstart guide walks you through installing and running `vLLM` with `ipex-llm`.
### 1. Install IPEX-LLM for vLLM
IPEX-LLM's support for `vLLM` now is available for only Linux system.
Visit [Install IPEX-LLM on Linux with Intel GPU](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html) and follow the instructions in section [Install Prerequisites](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-prerequisites) to isntall prerequisites that are needed for running code on Intel GPUs.
Then,follow instructions in section [Install ipex-llm](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-ipex-llm) to install `ipex-llm[xpu]` and setup the recommended runtime configurations.
**After the installation, you should have created a conda environment, named `ipex-vllm` for instance, for running `vLLM` commands with IPEX-LLM.**
### 2. Install vLLM
Currently, we maintain a specific branch of vLLM, which only works on Intel GPUs.
Activate the `ipex-vllm` conda environment and install vLLM by execcuting the commands below.
```bash
conda activate ipex-vllm
source /opt/intel/oneapi/setvars.sh
git clone -b sycl_xpu https://github.com/analytics-zoo/vllm.git
cd vllm
pip install -r requirements-xpu.txt
pip install --no-deps xformers
VLLM_BUILD_XPU_OPS=1 pip install --no-build-isolation -v -e .
pip install outlines==0.0.34 --no-deps
pip install interegular cloudpickle diskcache joblib lark nest-asyncio numba scipy
# For Qwen model support
pip install transformers_stream_generator einops tiktoken
```
**Now you are all set to use vLLM with IPEX-LLM**
## 3. Offline inference/Service
### Offline inference
To run offline inference using vLLM for a quick impression, use the following example.
```eval_rst
.. note::
Please modify the MODEL_PATH in offline_inference.py to use your chosen model.
You can try modify load_in_low_bit to different values in **[sym_int4, fp6, fp8, fp8_e4m3, fp16]** to use different quantization dtype.
```
```bash
#!/bin/bash
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/vLLM-Serving/offline_inference.py
python offline_inference.py
```
For instructions on how to change the `load_in_low_bit` value in `offline_inference.py`, check the following example:
```bash
llm = LLM(model="YOUR_MODEL",
device="xpu",
dtype="float16",
enforce_eager=True,
# Simply change here for the desired load_in_low_bit value
load_in_low_bit="sym_int4",
tensor_parallel_size=1,
trust_remote_code=True)
```
The result of executing `Baichuan2-7B-Chat` model with `sym_int4` low-bit format is shown as follows:
```
Prompt: 'Hello, my name is', Generated text: ' [Your Name] and I am a [Your Job Title] at [Your'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government in the United States. The president leads'
Prompt: 'The capital of France is', Generated text: ' Paris.\nThe capital of France is Paris.'
Prompt: 'The future of AI is', Generated text: " bright, but it's not without challenges. As AI continues to evolve,"
```
### Service
```eval_rst
.. note::
Because of using JIT compilation for kernels. We recommend to send a few requests for warmup before using the service for the best performance.
```
To fully utilize the continuous batching feature of the `vLLM`, you can send requests to the service using `curl` or other similar methods. The requests sent to the engine will be batched at token level. Queries will be executed in the same `forward` step of the LLM and be removed when they are finished instead of waiting for all sequences to be finished.
For vLLM, you can start the service using the following command:
```bash
#!/bin/bash
model="YOUR_MODEL_PATH"
served_model_name="YOUR_MODEL_NAME"
# You may need to adjust the value of
# --max-model-len, --max-num-batched-tokens, --max-num-seqs
# to acquire the best performance
# Change value --load-in-low-bit to [fp6, fp8, fp8_e4m3, fp16] to use different low-bit formats
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.75 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit sym_int4 \
--max-model-len 4096 \
--max-num-batched-tokens 10240 \
--max-num-seqs 12 \
--tensor-parallel-size 1
```
You can tune the service using these four arguments:
1. `--gpu-memory-utilization`: The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9.
2. `--max-model-len`: Model context length. If unspecified, will be automatically derived from the model config.
3. `--max-num-batched-token`: Maximum number of batched tokens per iteration.
4. `--max-num-seq`: Maximum number of sequences per iteration. Default: 256
For longer input prompt, we would suggest to use `--max-num-batched-token` to restrict the service. The reason behind this logic is that the `peak GPU memory usage` will appear when generating first token. By using `--max-num-batched-token`, we can restrict the input size when generating first token.
`--max-num-seqs` will restrict the generation for both first token and rest token. It will restrict the maximum batch size to the value set by `--max-num-seqs`.
When out-of-memory error occurs, the most obvious solution is to reduce the `gpu-memory-utilization`. Other ways to resolve this error is to set `--max-num-batched-token` if peak memory occurs when generating first token or using `--max-num-seq` if peak memory occurs when generating rest tokens.
If the service have been booted successfully, the console will display messages similar to the following:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" width=100%; />
</a>
After the service has been booted successfully, you can send a test request using `curl`. Here, `YOUR_MODEL` should be set equal to `$served_model_name` in your booting script, e.g. `Qwen1.5`.
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "YOUR_MODEL",
"prompt": "San Francisco is a",
"max_tokens": 128,
"temperature": 0
}' | jq '.choices[0].text'
```
Below shows an example output using `Qwen1.5-7B-Chat` with low-bit format `sym_int4`:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/vllm-curl-result.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/vllm-curl-result.png" width=100%; />
</a>
```eval_rst
.. tip::
If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before starting the service:
.. code-block:: bash
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
```
## 4. About Tensor parallel
> Note: We recommend to use docker for tensor parallel deployment. Check our serving docker image `intelanalytics/ipex-llm-serving-xpu`.
We have also supported tensor parallel by using multiple Intel GPU cards. To enable tensor parallel, you will need to install `libfabric-dev` in your environment. In ubuntu, you can install it by:
```bash
sudo apt-get install libfabric-dev
```
To deploy your model across multiple cards, simplely change the value of `--tensor-parallel-size` to the desired value.
For instance, if you have two Arc A770 cards in your environment, then you can set this value to 2. Some OneCCL environment variable settings are also needed, check the following example:
```bash
#!/bin/bash
model="YOUR_MODEL_PATH"
served_model_name="YOUR_MODEL_NAME"
# CCL needed environment variables
export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
# You may need to adjust the value of
# --max-model-len, --max-num-batched-tokens, --max-num-seqs
# to acquire the best performance
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.75 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit sym_int4 \
--max-model-len 4096 \
--max-num-batched-tokens 10240 \
--max-num-seqs 12 \
--tensor-parallel-size 2
```
If the service have booted successfully, you should see the output similar to the following figure:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" width=100%; />
</a>
## 5.Performing benchmark
To perform benchmark, you can use the **benchmark_throughput** script that is originally provided by vLLM repo.
```bash
conda activate ipex-vllm
source /opt/intel/oneapi/setvars.sh
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/docker/llm/serving/xpu/docker/benchmark_vllm_throughput.py -O benchmark_throughput.py
export MODEL="YOUR_MODEL"
# You can change load-in-low-bit from values in [sym_int4, fp6, fp8, fp8_e4m3, fp16]
python3 ./benchmark_throughput.py \
--backend vllm \
--dataset ./ShareGPT_V3_unfiltered_cleaned_split.json \
--model $MODEL \
--num-prompts 1000 \
--seed 42 \
--trust-remote-code \
--enforce-eager \
--dtype float16 \
--device xpu \
--load-in-low-bit sym_int4 \
--gpu-memory-utilization 0.85
```
The following figure shows the result of benchmarking `Llama-2-7b-chat-hf` using 50 prompts:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/vllm-benchmark-result.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/vllm-benchmark-result.png" width=100%; />
</a>
```eval_rst
.. tip::
To find the best config that fits your workload, you may need to start the service and use tools like `wrk` or `jmeter` to perform a stress tests.
```

View file

@ -0,0 +1,217 @@
# Run Text Generation WebUI on Intel GPU
The [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) provides a user friendly GUI for anyone to run LLM locally; by porting it to [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily run LLM in [Text Generation WebUI](https://github.com/intel-analytics/text-generation-webui) on Intel **GPU** *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)*.
See the demo of running LLaMA2-7B on an Intel Core Ultra laptop below.
<video src="https://llm-assets.readthedocs.io/en/latest/_images/webui-mtl.mp4" width="100%" controls></video>
## Quickstart
This quickstart guide walks you through setting up and using the [Text Generation WebUI](https://github.com/intel-analytics/text-generation-webui) with `ipex-llm`.
A preview of the WebUI in action is shown below:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_chat.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_chat.png" width=80%; />
</a>
### 1 Install IPEX-LLM
To use the WebUI, first ensure that IPEX-LLM is installed. Follow the instructions on the [IPEX-LLM Installation Quickstart for Windows with Intel GPU](install_windows_gpu.html).
**After the installation, you should have created a conda environment, named `llm` for instance, for running `ipex-llm` applications.**
### 2 Install the WebUI
#### Download the WebUI
Download the `text-generation-webui` with IPEX-LLM integrations from [this link](https://github.com/intel-analytics/text-generation-webui/archive/refs/heads/ipex-llm.zip). Unzip the content into a directory, e.g.,`C:\text-generation-webui`.
#### Install Dependencies
Open **Miniforge Prompt** and activate the conda environment you have created in [section 1](#1-install-ipex-llm), e.g., `llm`.
```
conda activate llm
```
Then, change to the directory of WebUI (e.g.,`C:\text-generation-webui`) and install the necessary dependencies:
```cmd
cd C:\text-generation-webui
pip install -r requirements_cpu_only.txt
pip install -r extensions/openai/requirements.txt
```
```eval_rst
.. note::
`extensions/openai/requirements.txt` is for API service. If you don't need the API service, you can omit this command.
```
### 3 Start the WebUI Server
#### Set Environment Variables
Configure oneAPI variables by running the following command in **Miniforge Prompt**:
```eval_rst
.. note::
For more details about runtime configurations, refer to `this guide <https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration>`_
```
```cmd
set SYCL_CACHE_PERSISTENT=1
```
If you're running on iGPU, set additional environment variables by running the following commands:
```cmd
set BIGDL_LLM_XMX_DISABLED=1
```
#### Launch the Server
In **Miniforge Prompt** with the conda environment `llm` activated, navigate to the `text-generation-webui` folder and execute the following commands (You can optionally lanch the server with or without the API service):
##### without API service
```cmd
python server.py --load-in-4bit
```
##### with API service
```
python server.py --load-in-4bit --api --api-port 5000 --listen
```
```eval_rst
.. note::
with ``--load-in-4bit`` option, the models will be optimized and run at 4-bit precision. For configuration for other formats and precisions, refer to `this link <https://github.com/intel-analytics/text-generation-webui?tab=readme-ov-file#32-optimizations-for-other-percisions>`_
```
```eval_rst
.. note::
The API service allows user to access models using OpenAI-compatible API. For usage examples, refer to [this link](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#examples)
```
```eval_rst
.. note::
The API server will by default use port ``5000``. To change the port, use ``--api-port 1234`` in the command above. You can also specify using SSL or API Key in the command. Please see `this guide <https://github.com/intel-analytics/text-generation-webui/blob/ipex-llm/docs/12%20-%20OpenAI%20API.md>`_ for the full list of arguments.
```
#### Access the WebUI
Upon successful launch, URLs to access the WebUI will be displayed in the terminal as shown below. Open the provided local URL in your browser to interact with the WebUI.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_launch_server.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_launch_server.png" width=100%; />
</a>
### 4. Using the WebUI
#### Model Download
Place Huggingface models in `C:\text-generation-webui\models` by either copying locally or downloading via the WebUI. To download, navigate to the **Model** tab, enter the model's huggingface id (for instance, `microsoft/phi-1_5`) in the **Download model or LoRA** section, and click **Download**, as illustrated below.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_download_model.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_download_model.png" width=100%; />
</a>
After copying or downloading the models, click on the blue **refresh** button to update the **Model** drop-down menu. Then, choose your desired model from the newly updated list.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_select_model.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_select_model.png" width=100%; />
</a>
#### Load Model
Default settings are recommended for most users. Click **Load** to activate the model. Address any errors by installing missing packages as prompted, and ensure compatibility with your version of the transformers package. Refer to [troubleshooting section](#troubleshooting) for more details.
If everything goes well, you will get a message as shown below.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_load_model_success.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_load_model_success.png" width=100%; />
</a>
```eval_rst
.. note::
Model loading might take a few minutes as it includes a **warm-up** phase. This `warm-up` step is used to improve the speed of subsequent model uses.
```
#### Chat with the Model
In the **Chat** tab, start new conversations with **New chat**.
Enter prompts into the textbox at the bottom and press the **Generate** button to receive responses.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_chat.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_chat.png" width=100%; />
</a>
<!-- Notes:
* Multi-turn conversations may consume GPU memory. You may specify the `Truncate the prompt up to this length` value in `Parameters` tab to reduce the GPU memory usage.
* You may switch to a single-turn conversation mode by turning off `Activate text streaming` in the Parameters tab.
* Please see [Chat-Tab Wiki](https://github.com/oobabooga/text-generation-webui/wiki/01-%E2%80%90-Chat-Tab) for more details. -->
#### Exit the WebUI
To shut down the WebUI server, use **Ctrl+C** in the **Miniforge Prompt** terminal where the WebUI Server is runing, then close your browser tab.
### 5. Advanced Usage
#### Using Instruct mode
Instruction-following models are models that are fine-tuned with specific prompt formats.
For these models, you should ideally use the `instruct` chat mode.
Under this mode, the model receives user prompts that are formatted according to prompt formats it was trained with.
To use `instruct` chat mode, select `chat` tab, scroll down the page, and then select `instruct` under `Mode`.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_chat_mode_instruct.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_chat_mode_instruct.png" width=100%; />
</a>
When a model is loaded, its corresponding instruction template, which contains prompt formatting, is automatically loaded.
If chat responses are poor, the loaded instruction template might be incorrect.
In this case, go to `Parameters` tab and then `Instruction template` tab.
<a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_instruction_template.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_instruction_template.png" width=100%; />
</a>
You can verify and edit the loaded instruction template in the `Instruction template` field.
You can also manually select an instruction template from `Saved instruction templates` and click `load` to load it into `Instruction template`.
You can add custom template files to this list in `/instruction-templates/` [folder](https://github.com/intel-analytics/text-generation-webui/tree/ipex-llm/instruction-templates).
<!-- For instance, the automatically loaded instruction template for `chatGLM3` model is incorrect, and you should manually select the `chatGLM3` instruction template. -->
#### Tested models
We have tested the following models with `ipex-llm` using Text Generation WebUI.
| Model | Notes |
|-------|-------|
| llama-2-7b-chat-hf | |
| chatglm3-6b | Manually load ChatGLM3 template for Instruct chat mode |
| Mistral-7B-v0.1 | |
| qwen-7B-Chat | |
### Troubleshooting
### Potentially slower first response
The first response to user prompt might be slower than expected, with delays of up to several minutes before the response is generated. This delay occurs because the GPU kernels require compilation and initialization, which varies across different GPU types.
#### Missing Required Dependencies
During model loading, you may encounter an **ImportError** like `ImportError: This modeling file requires the following packages that were not found in your environment`. This indicates certain packages required by the model are absent from your environment. Detailed instructions for installing these necessary packages can be found at the bottom of the error messages. Take the following steps to fix these errors:
- Exit the WebUI Server by pressing **Ctrl+C** in the **Miniforge Prompt** terminal.
- Install the missing pip packages as specified in the error message
- Restart the WebUI Server.
If there are still errors on missing packages, repeat the installation process for any additional required packages.
#### Compatiblity issues
If you encounter **AttributeError** errors like `AttributeError: 'BaichuanTokenizer' object has no attribute 'sp_model'`, it may be due to some models being incompatible with the current version of the transformers package because the models are outdated. In such instances, using a more recent model is recommended.
<!--
<a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_load_model_error.png">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_load_model_error.png" width=100%; />
</a> -->