Update part of Quickstart guide in mddocs (2/2) (#11376)

* axolotl_quickstart.md

* benchmark_quickstart.md

* bigdl_llm_migration.md

* chatchat_quickstart.md

* continue_quickstart.md

* deepspeed_autotp_fastapi_quickstart.md

* dify_quickstart.md

* fastchat_quickstart.md

* adjust tab style

* fix link

* fix link

* add video preview

* Small fixes

* Small fix

---------

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
This commit is contained in:
Jin Qiao 2024-06-20 19:03:06 +08:00 committed by GitHub
parent 8c9f877171
commit 9a3a21e4fc
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
8 changed files with 154 additions and 193 deletions

View file

@ -4,7 +4,7 @@
See the demo of finetuning LLaMA2-7B on Intel Arc GPU below. See the demo of finetuning LLaMA2-7B on Intel Arc GPU below.
<video src="https://llm-assets.readthedocs.io/en/latest/_images/axolotl-qlora-linux-arc.mp4" width="100%" controls></video> [![Demo video](https://llm-assets.readthedocs.io/en/latest/_images/axolotl-qlora-linux-arc.png)](https://llm-assets.readthedocs.io/en/latest/_images/axolotl-qlora-linux-arc.mp4)
## Quickstart ## Quickstart
@ -12,13 +12,13 @@ See the demo of finetuning LLaMA2-7B on Intel Arc GPU below.
IPEX-LLM's support for [Axolotl v0.4.0](https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0) is only available for Linux system. We recommend Ubuntu 20.04 or later (Ubuntu 22.04 is preferred). IPEX-LLM's support for [Axolotl v0.4.0](https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0) is only available for Linux system. We recommend Ubuntu 20.04 or later (Ubuntu 22.04 is preferred).
Visit the [Install IPEX-LLM on Linux with Intel GPU](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html), follow [Install Intel GPU Driver](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-intel-gpu-driver) and [Install oneAPI](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-oneapi) to install GPU driver and Intel® oneAPI Base Toolkit 2024.0. Visit the [Install IPEX-LLM on Linux with Intel GPU](./install_linux_gpu.md), follow [Install Intel GPU Driver](./install_linux_gpu.md#install-gpu-driver) and [Install oneAPI](./install_linux_gpu.md#install-oneapi) to install GPU driver and Intel® oneAPI Base Toolkit 2024.0.
### 1. Install IPEX-LLM for Axolotl ### 1. Install IPEX-LLM for Axolotl
Create a new conda env, and install `ipex-llm[xpu]`. Create a new conda env, and install `ipex-llm[xpu]`.
```cmd ```bash
conda create -n axolotl python=3.11 conda create -n axolotl python=3.11
conda activate axolotl conda activate axolotl
# install ipex-llm # install ipex-llm
@ -27,7 +27,7 @@ pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-exte
Install [axolotl v0.4.0](https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0) from git. Install [axolotl v0.4.0](https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0) from git.
```cmd ```bash
# install axolotl v0.4.0 # install axolotl v0.4.0
git clone https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0 git clone https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0
cd axolotl cd axolotl
@ -62,46 +62,37 @@ For more technical details, please refer to [Llama 2](https://arxiv.org/abs/2307
By default, Axolotl will automatically download models and datasets from Huggingface. Please ensure you have login to Huggingface. By default, Axolotl will automatically download models and datasets from Huggingface. Please ensure you have login to Huggingface.
```cmd ```bash
huggingface-cli login huggingface-cli login
``` ```
If you prefer offline models and datasets, please download [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) and [alpaca_2k_test](https://huggingface.co/datasets/mhenrichsen/alpaca_2k_test). Then, set `HF_HUB_OFFLINE=1` to avoid connecting to Huggingface. If you prefer offline models and datasets, please download [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) and [alpaca_2k_test](https://huggingface.co/datasets/mhenrichsen/alpaca_2k_test). Then, set `HF_HUB_OFFLINE=1` to avoid connecting to Huggingface.
```cmd ```bash
export HF_HUB_OFFLINE=1 export HF_HUB_OFFLINE=1
``` ```
#### 2.2 Set Environment Variables #### 2.2 Set Environment Variables
```eval_rst > [!NOTE]
.. note:: > This is a required step on for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
This is a required step on for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
```
Configure oneAPI variables by running the following command: Configure oneAPI variables by running the following command:
```eval_rst ```bash
.. tabs::
.. tab:: Linux
.. code-block:: bash
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
``` ```
Configure accelerate to avoid training with CPU. You can download a default `default_config.yaml` with `use_cpu: false`. Configure accelerate to avoid training with CPU. You can download a default `default_config.yaml` with `use_cpu: false`.
```cmd ```bash
mkdir -p ~/.cache/huggingface/accelerate/ mkdir -p ~/.cache/huggingface/accelerate/
wget -O ~/.cache/huggingface/accelerate/default_config.yaml https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/default_config.yaml wget -O ~/.cache/huggingface/accelerate/default_config.yaml https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/default_config.yaml
``` ```
As an alternative, you can config accelerate based on your requirements. As an alternative, you can config accelerate based on your requirements.
```cmd ```bash
accelerate config accelerate config
``` ```
@ -113,7 +104,7 @@ After finishing accelerate config, check if `use_cpu` is disabled (i.e., `use_cp
Prepare `lora.yml` for Axolotl LoRA finetune. You can download a template from github. Prepare `lora.yml` for Axolotl LoRA finetune. You can download a template from github.
```cmd ```bash
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/lora.yml wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/lora.yml
``` ```
@ -143,13 +134,13 @@ lora_fan_in_fan_out:
Launch LoRA training with the following command. Launch LoRA training with the following command.
```cmd ```bash
accelerate launch finetune.py lora.yml accelerate launch finetune.py lora.yml
``` ```
In Axolotl v0.4.0, you can use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`. In Axolotl v0.4.0, you can use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`.
```cmd ```bash
accelerate launch train.py lora.yml accelerate launch train.py lora.yml
``` ```
@ -157,7 +148,7 @@ accelerate launch train.py lora.yml
Prepare `lora.yml` for QLoRA finetune. You can download a template from github. Prepare `lora.yml` for QLoRA finetune. You can download a template from github.
```cmd ```bash
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/qlora.yml wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/qlora.yml
``` ```
@ -188,13 +179,13 @@ lora_fan_in_fan_out:
Launch LoRA training with the following command. Launch LoRA training with the following command.
```cmd ```bash
accelerate launch finetune.py qlora.yml accelerate launch finetune.py qlora.yml
``` ```
In Axolotl v0.4.0, you can use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`. In Axolotl v0.4.0, you can use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`.
```cmd ```bash
accelerate launch train.py qlora.yml accelerate launch train.py qlora.yml
``` ```
@ -206,7 +197,7 @@ Warning: this section will install axolotl main ([796a085](https://github.com/Op
Axolotl main has lots of new dependencies. Please setup a new conda env for this version. Axolotl main has lots of new dependencies. Please setup a new conda env for this version.
```cmd ```bash
conda create -n llm python=3.11 conda create -n llm python=3.11
conda activate llm conda activate llm
# install axolotl main # install axolotl main
@ -229,7 +220,7 @@ Based on [axolotl Llama-3 QLoRA example](https://github.com/OpenAccess-AI-Collec
Prepare `llama3-qlora.yml` for QLoRA finetune. You can download a template from github. Prepare `llama3-qlora.yml` for QLoRA finetune. You can download a template from github.
```cmd ```bash
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/llama3-qlora.yml wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/llama3-qlora.yml
``` ```
@ -262,19 +253,19 @@ lora_target_linear: true
lora_fan_in_fan_out: lora_fan_in_fan_out:
``` ```
```cmd ```bash
accelerate launch finetune.py llama3-qlora.yml accelerate launch finetune.py llama3-qlora.yml
``` ```
You can also use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`. You can also use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`.
```cmd ```bash
accelerate launch train.py llama3-qlora.yml accelerate launch train.py llama3-qlora.yml
``` ```
Expected output Expected output
```cmd ```bash
{'loss': 0.237, 'learning_rate': 1.2254711850265387e-06, 'epoch': 3.77} {'loss': 0.237, 'learning_rate': 1.2254711850265387e-06, 'epoch': 3.77}
{'loss': 0.6068, 'learning_rate': 1.1692453482951115e-06, 'epoch': 3.77} {'loss': 0.6068, 'learning_rate': 1.1692453482951115e-06, 'epoch': 3.77}
{'loss': 0.2926, 'learning_rate': 1.1143322458989303e-06, 'epoch': 3.78} {'loss': 0.2926, 'learning_rate': 1.1143322458989303e-06, 'epoch': 3.78}
@ -291,24 +282,24 @@ Expected output
## Troubleshooting ## Troubleshooting
#### TypeError: PosixPath ### TypeError: PosixPath
Error message: `TypeError: argument of type 'PosixPath' is not iterable` Error message: `TypeError: argument of type 'PosixPath' is not iterable`
This issue is related to [axolotl #1544](https://github.com/OpenAccess-AI-Collective/axolotl/issues/1544). It can be fixed by downgrading datasets to 2.15.0. This issue is related to [axolotl #1544](https://github.com/OpenAccess-AI-Collective/axolotl/issues/1544). It can be fixed by downgrading datasets to 2.15.0.
```cmd ```bash
pip install datasets==2.15.0 pip install datasets==2.15.0
``` ```
#### RuntimeError: out of device memory ### RuntimeError: out of device memory
Error message: `RuntimeError: Allocation is out of device memory on current platform.` Error message: `RuntimeError: Allocation is out of device memory on current platform.`
This issue is caused by running out of GPU memory. Please reduce `lora_r` or `micro_batch_size` in `qlora.yml` or `lora.yml`, or reduce data using in training. This issue is caused by running out of GPU memory. Please reduce `lora_r` or `micro_batch_size` in `qlora.yml` or `lora.yml`, or reduce data using in training.
#### OSError: libmkl_intel_lp64.so.2 ### OSError: libmkl_intel_lp64.so.2
Error message: `OSError: libmkl_intel_lp64.so.2: cannot open shared object file: No such file or directory` Error message: `OSError: libmkl_intel_lp64.so.2: cannot open shared object file: No such file or directory`
oneAPI environment is not correctly set. Please refer to [Set Environment Variables](#set-environment-variables). oneAPI environment is not correctly set. Please refer to [Set Environment Variables](#22-set-environment-variables).

View file

@ -4,7 +4,7 @@ We can perform benchmarking for IPEX-LLM on Intel CPUs and GPUs using the benchm
## Prepare The Environment ## Prepare The Environment
You can refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install.html) to install IPEX-LLM in your environment. The following dependencies are also needed to run the benchmark scripts. You can refer to [here](../Overview/install.md) to install IPEX-LLM in your environment. The following dependencies are also needed to run the benchmark scripts.
``` ```
pip install pandas pip install pandas
@ -65,110 +65,99 @@ Some parameters in the yaml file that you can configure:
- `task`: There are three tasks: `continuation`, `QA` and `summarize`. `continuation` refers to writing additional content based on prompt. `QA` refers to answering questions based on prompt. `summarize` refers to summarizing the prompt. - `task`: There are three tasks: `continuation`, `QA` and `summarize`. `continuation` refers to writing additional content based on prompt. `QA` refers to answering questions based on prompt. `summarize` refers to summarizing the prompt.
```eval_rst > [!NOTE]
.. note:: > If you want to benchmark the performance without warmup, you can set ``warm_up: 0`` and ``num_trials: 1`` in ``config.yaml``, and run each single model and in_out_pair separately.
If you want to benchmark the performance without warmup, you can set ``warm_up: 0`` and ``num_trials: 1`` in ``config.yaml``, and run each single model and in_out_pair separately.
```
## Run on Windows ## Run on Windows
Please refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration) to configure oneAPI environment variables. Please refer to [here](../Overview/install_gpu.md#runtime-configuration) to configure oneAPI environment variables. Choose corresponding commands base on your device.
```eval_rst - For **Intel iGPU**:
.. tabs::
.. tab:: Intel iGPU
.. code-block:: bash
```bash
set SYCL_CACHE_PERSISTENT=1 set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1 set BIGDL_LLM_XMX_DISABLED=1
python run.py python run.py
```
.. tab:: Intel Arc™ A300-Series or Pro A60 - For **Intel Arc™ A300-Series or Pro A60**:
.. code-block:: bash
```bash
set SYCL_CACHE_PERSISTENT=1 set SYCL_CACHE_PERSISTENT=1
python run.py python run.py
```
.. tab:: Other Intel dGPU Series - For **Other Intel dGPU Series**:
.. code-block:: bash
```bash
# e.g. Arc™ A770 # e.g. Arc™ A770
python run.py python run.py
``` ```
## Run on Linux ## Run on Linux
```eval_rst Please choose corresponding commands base on your device.
.. tabs::
.. tab:: Intel Arc™ A-Series and Intel Data Center GPU Flex - For **Intel Arc™ A-Series and Intel Data Center GPU Flex**:
For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, we recommend: For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, we recommend:
.. code-block:: bash ```bash
./run-arc.sh ./run-arc.sh
```
.. tab:: Intel iGPU - For **Intel iGPU**:
For Intel iGPU, we recommend: For Intel iGPU, we recommend:
.. code-block:: bash ```bash
./run-igpu.sh ./run-igpu.sh
```
.. tab:: Intel Data Center GPU Max - For **Intel Data Center GPU Max**:
Please note that you need to run ``conda install -c conda-forge -y gperftools=2.10`` before running the benchmark script on Intel Data Center GPU Max Series. Please note that you need to run ``conda install -c conda-forge -y gperftools=2.10`` before running the benchmark script on Intel Data Center GPU Max Series.
.. code-block:: bash ```bash
./run-max-gpu.sh ./run-max-gpu.sh
```
.. tab:: Intel SPR - For **Intel SPR**:
For Intel SPR machine, we recommend: For Intel SPR machine, we recommend:
.. code-block:: bash ```bash
./run-spr.sh ./run-spr.sh
```
The scipt uses a default numactl strategy. If you want to customize it, please use ``lscpu`` or ``numactl -H`` to check how cpu indexs are assigned to numa node, and make sure the run command is binded to only one socket. The scipt uses a default numactl strategy. If you want to customize it, please use ``lscpu`` or ``numactl -H`` to check how cpu indexs are assigned to numa node, and make sure the run command is binded to only one socket.
.. tab:: Intel HBM - For **Intel HBM**:
For Intel HBM machine, we recommend: For Intel HBM machine, we recommend:
.. code-block:: bash ```bash
./run-hbm.sh ./run-hbm.sh
```
The scipt uses a default numactl strategy. If you want to customize it, please use ``numactl -H`` to check how the index of hbm node and cpu are assigned. The scipt uses a default numactl strategy. If you want to customize it, please use ``numactl -H`` to check how the index of hbm node and cpu are assigned.
For example: For example:
```bash
.. code-block:: bash
node 0 1 2 3 node 0 1 2 3
0: 10 21 13 23 0: 10 21 13 23
1: 21 10 23 13 1: 21 10 23 13
2: 13 23 10 23 2: 13 23 10 23
3: 23 13 23 10 3: 23 13 23 10
```
here hbm node is the node whose distance from the checked node is 13, node 2 is node 0's hbm node. here hbm node is the node whose distance from the checked node is 13, node 2 is node 0's hbm node.
And make sure the run command is binded to only one socket. And make sure the run command is binded to only one socket.
```
## Result ## Result
After the benchmarking is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns `1st token avg latency (ms)` and `2+ avg latency (ms/token)` for the benchmark results. You can also check whether the column `actual input/output tokens` is consistent with the column `input/output tokens` and whether the parameters you specified in `config.yaml` have been successfully applied in the benchmarking. After the benchmarking is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns `1st token avg latency (ms)` and `2+ avg latency (ms/token)` for the benchmark results. You can also check whether the column `actual input/output tokens` is consistent with the column `input/output tokens` and whether the parameters you specified in `config.yaml` have been successfully applied in the benchmarking.

View file

@ -4,10 +4,9 @@ This guide helps you migrate your `bigdl-llm` application to use `ipex-llm`.
## Upgrade `bigdl-llm` package to `ipex-llm` ## Upgrade `bigdl-llm` package to `ipex-llm`
```eval_rst > [!NOTE]
.. note:: > This step assumes you have already installed `bigdl-llm`.
This step assumes you have already installed `bigdl-llm`.
```
You need to uninstall `bigdl-llm` and install `ipex-llm`With your `bigdl-llm` conda environment activated, execute the following command according to your device type and location: You need to uninstall `bigdl-llm` and install `ipex-llm`With your `bigdl-llm` conda environment activated, execute the following command according to your device type and location:
### For CPU ### For CPU
@ -19,20 +18,17 @@ pip install --pre --upgrade ipex-llm[all] # for cpu
### For GPU ### For GPU
Choose either US or CN website for `extra-index-url`: Choose either US or CN website for `extra-index-url`:
```eval_rst
.. tabs::
.. tab:: US - For **US**:
.. code-block:: cmd
```bash
pip uninstall -y bigdl-llm pip uninstall -y bigdl-llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
```
.. tab:: CN - For **CN**:
.. code-block:: cmd
```bash
pip uninstall -y bigdl-llm pip uninstall -y bigdl-llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
``` ```

View file

@ -10,11 +10,12 @@
<td align="center" width="50%">简体中文</td> <td align="center" width="50%">简体中文</td>
</tr> </tr>
<tr> <tr>
<td><video src="https://llm-assets.readthedocs.io/en/latest/_images/langchain-chatchat-en.mp4" width="100%" controls></video></td> <td><a href="https://llm-assets.readthedocs.io/en/latest/_images/langchain-chatchat-en.mp4"><img src="https://llm-assets.readthedocs.io/en/latest/_images/langchain-chatchat-en.png"/></a></td>
<td><video src="https://llm-assets.readthedocs.io/en/latest/_images/langchain-chatchat-cn.mp4" width="100%" controls></video></td> <td><a href="https://llm-assets.readthedocs.io/en/latest/_images/langchain-chatchat-cn.mp4"><img src="https://llm-assets.readthedocs.io/en/latest/_images/langchain-chatchat-cn.png"/></a></td>
</tr> </tr>
</table> </table>
> [!NOTE]
> You can change the UI language in the left-side menu. We currently support **English** and **简体中文** (see video demos below). > You can change the UI language in the left-side menu. We currently support **English** and **简体中文** (see video demos below).
## Langchain-Chatchat Architecture ## Langchain-Chatchat Architecture
@ -66,7 +67,7 @@ You can now click `Dialogue` on the left-side menu to return to the chat UI. The
<br/> <br/>
For more information about how to use Langchain-Chatchat, refer to Official Quickstart guide in [English](./README_en.md#), [Chinese](./README_chs.md#), or the [Wiki](https://github.com/chatchat-space/Langchain-Chatchat/wiki/). For more information about how to use Langchain-Chatchat, refer to Official Quickstart guide in [English](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/README_en.md#), [Chinese](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/README.md#), or the [Wiki](https://github.com/chatchat-space/Langchain-Chatchat/wiki/).
### Trouble Shooting & Tips ### Trouble Shooting & Tips

View file

@ -5,30 +5,25 @@
Below is a demo of using `Continue` with [CodeQWen1.5-7B](https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat) running on Intel A770 GPU. This demo illustrates how a programmer used `Continue` to find a solution for the [Kaggle's _Titanic_ challenge](https://www.kaggle.com/competitions/titanic/), which involves asking `Continue` to complete the code for model fitting, evaluation, hyper parameter tuning, feature engineering, and explain generated code. Below is a demo of using `Continue` with [CodeQWen1.5-7B](https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat) running on Intel A770 GPU. This demo illustrates how a programmer used `Continue` to find a solution for the [Kaggle's _Titanic_ challenge](https://www.kaggle.com/competitions/titanic/), which involves asking `Continue` to complete the code for model fitting, evaluation, hyper parameter tuning, feature engineering, and explain generated code.
<video src="https://llm-assets.readthedocs.io/en/latest/_images/continue_demo_ollama_backend_arc.mp4" width="100%" controls></video> [![Demo video](https://llm-assets.readthedocs.io/en/latest/_images/continue_demo_ollama_backend_arc.png)](https://llm-assets.readthedocs.io/en/latest/_images/continue_demo_ollama_backend_arc.mp4)
## Quickstart ## Quickstart
This guide walks you through setting up and running **Continue** within _Visual Studio Code_, empowered by local large language models served via [Ollama](./ollama_quickstart.html) with `ipex-llm` optimizations. This guide walks you through setting up and running **Continue** within _Visual Studio Code_, empowered by local large language models served via [Ollama](./ollama_quickstart.md) with `ipex-llm` optimizations.
### 1. Install and Run Ollama Serve ### 1. Install and Run Ollama Serve
Visit [Run Ollama with IPEX-LLM on Intel GPU](./ollama_quickstart.html), and follow the steps 1) [Install IPEX-LLM for Ollama](./ollama_quickstart.html#install-ipex-llm-for-ollama), 2) [Initialize Ollama](./ollama_quickstart.html#initialize-ollama) 3) [Run Ollama Serve](./ollama_quickstart.html#run-ollama-serve) to install, init and start the Ollama Service. Visit [Run Ollama with IPEX-LLM on Intel GPU](./ollama_quickstart.md), and follow the steps 1) [Install IPEX-LLM for Ollama](./ollama_quickstart.md#1-install-ipex-llm-for-ollama), 2) [Initialize Ollama](./ollama_quickstart.md#2-initialize-ollama) 3) [Run Ollama Serve](./ollama_quickstart.md#3-run-ollama-serve) to install, init and start the Ollama Service.
> [!IMPORTANT]
> If the `Continue` plugin is not installed on the same machine where Ollama is running (which means `Continue` needs to connect to a remote Ollama service), you must configure the Ollama service to accept connections from any IP address. To achieve this, set or export the environment variable `OLLAMA_HOST=0.0.0.0` before executing the command `ollama serve`.
```eval_rst > [!TIP]
.. important:: > If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
>
If the `Continue` plugin is not installed on the same machine where Ollama is running (which means `Continue` needs to connect to a remote Ollama service), you must configure the Ollama service to accept connections from any IP address. To achieve this, set or export the environment variable `OLLAMA_HOST=0.0.0.0` before executing the command `ollama serve`. > ```bash
> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
.. tip:: > ```
If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
.. code-block:: bash
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
```
### 2. Pull and Prepare the Model ### 2. Pull and Prepare the Model
@ -36,30 +31,25 @@ Visit [Run Ollama with IPEX-LLM on Intel GPU](./ollama_quickstart.html), and fol
Now we need to pull a model for coding. Here we use [CodeQWen1.5-7B](https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat) model as an example. Open a new terminal window, run the following command to pull [`codeqwen:latest`](https://ollama.com/library/codeqwen). Now we need to pull a model for coding. Here we use [CodeQWen1.5-7B](https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat) model as an example. Open a new terminal window, run the following command to pull [`codeqwen:latest`](https://ollama.com/library/codeqwen).
- For **Linux users**:
```eval_rst ```bash
.. tabs::
.. tab:: Linux
.. code-block:: bash
export no_proxy=localhost,127.0.0.1 export no_proxy=localhost,127.0.0.1
./ollama pull codeqwen:latest ./ollama pull codeqwen:latest
```
.. tab:: Windows - For **Windows users**:
Please run the following command in Miniforge Prompt. Please run the following command in Miniforge Prompt.
.. code-block:: cmd ```cmd
set no_proxy=localhost,127.0.0.1 set no_proxy=localhost,127.0.0.1
ollama pull codeqwen:latest ollama pull codeqwen:latest
.. seealso::
Besides CodeQWen, there are other coding models you might want to explore, such as Magicoder, Wizardcoder, Codellama, Codegemma, Starcoder, Starcoder2, and etc. You can find these models in the `Ollama model library <https://ollama.com/library>`_. Simply search for the model, pull it in a similar manner, and give it a try.
``` ```
> [!NOTE]
> Besides CodeQWen, there are other coding models you might want to explore, such as Magicoder, Wizardcoder, Codellama, Codegemma, Starcoder, Starcoder2, and etc. You can find these models in the [`Ollama model library`](https://ollama.com/library). Simply search for the model, pull it in a similar manner, and give it a try.
#### 2.2 Prepare the Model and Pre-load #### 2.2 Prepare the Model and Pre-load
@ -72,8 +62,8 @@ Start by creating a file named `Modelfile` with the following content:
FROM codeqwen:latest FROM codeqwen:latest
PARAMETER num_ctx 4096 PARAMETER num_ctx 4096
``` ```
Next, use the following commands in the terminal (Linux) or Miniforge Prompt (Windows) to create a new model in Ollama named `codeqwen:latest-continue`:
Next, use the following commands in the terminal (Linux) or Miniforge Prompt (Windows) to create a new model in Ollama named `codeqwen:latest-continue`:
```bash ```bash
ollama create codeqwen:latest-continue -f Modelfile ollama create codeqwen:latest-continue -f Modelfile
@ -87,8 +77,6 @@ Finally, preload the new model by executing the following command in a new termi
ollama run codeqwen:latest-continue ollama run codeqwen:latest-continue
``` ```
### 3. Install `Continue` Extension ### 3. Install `Continue` Extension
Search for `Continue` in the VSCode `Extensions Marketplace` and install it just like any other extension. Search for `Continue` in the VSCode `Extensions Marketplace` and install it just like any other extension.

View file

@ -1,10 +1,10 @@
# Run IPEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi # Run IPEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi
This example demonstrates how to run IPEX-LLM serving on multiple [Intel GPUs](../README.md) by leveraging DeepSpeed AutoTP. This example demonstrates how to run IPEX-LLM serving on multiple [Intel GPUs](../../../python/llm/example/GPU/README.md) by leveraging DeepSpeed AutoTP.
## Requirements ## Requirements
To run this example with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. For this particular example, you will need at least two GPUs on your machine. To run this example with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../python/llm/example/GPU/README.md#requirements) for more information. For this particular example, you will need at least two GPUs on your machine.
## Example ## Example
@ -24,7 +24,8 @@ pip install mpi4py fastapi uvicorn
conda install -c conda-forge -y gperftools=2.10 # to enable tcmalloc conda install -c conda-forge -y gperftools=2.10 # to enable tcmalloc
``` ```
> **Important**: IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0. Please make sure you have installed the correct version. > [!IMPORTANT]
> IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0. Please make sure you have installed the correct version.
### 2. Run tensor parallel inference on multiple GPUs ### 2. Run tensor parallel inference on multiple GPUs
@ -35,7 +36,6 @@ We provide example usage for `Llama-2-7b-chat-hf` model running on Arc A770
Run Llama-2-7b-chat-hf on two Intel Arc A770: Run Llama-2-7b-chat-hf on two Intel Arc A770:
```bash ```bash
# Before run this script, you should adjust the YOUR_REPO_ID_OR_MODEL_PATH in last line # Before run this script, you should adjust the YOUR_REPO_ID_OR_MODEL_PATH in last line
# If you want to change server port, you can set port parameter in last line # If you want to change server port, you can set port parameter in last line
@ -52,7 +52,8 @@ If you successfully run the serving, you can get output like this:
[0] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) [0] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
``` ```
> **Note**: You could change `NUM_GPUS` to the number of GPUs you have on your machine. And you could also specify other low bit optimizations through `--low-bit`. > [!NOTE]
> You could change `NUM_GPUS` to the number of GPUs you have on your machine. And you could also specify other low bit optimizations through `--low-bit`.
### 3. Sample Input and Output ### 3. Sample Input and Output
@ -83,7 +84,8 @@ And you should get output like this:
``` ```
**Important**: The first token latency is much larger than rest token latency, you could use [our benchmark tool](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/dev/benchmark/README.md) to obtain more details about first and rest token latency. > [!IMPORTANT]
> The first token latency is much larger than rest token latency, you could use [our benchmark tool](../../../python/llm/dev/benchmark/README.md) to obtain more details about first and rest token latency.
### 4. Benchmark with wrk ### 4. Benchmark with wrk

View file

@ -6,7 +6,7 @@
*See the demo of a RAG workflow in Dify running LLaMA2-7B on Intel A770 GPU below.* *See the demo of a RAG workflow in Dify running LLaMA2-7B on Intel A770 GPU below.*
<video src="https://llm-assets.readthedocs.io/en/latest/_images/dify-rag-small.mp4" width="100%" controls></video> [![Demo video](https://llm-assets.readthedocs.io/en/latest/_images/dify-rag-small.png)](https://llm-assets.readthedocs.io/en/latest/_images/dify-rag-small.mp4)
## Quickstart ## Quickstart
@ -99,13 +99,8 @@ NEXT_PUBLIC_PUBLIC_API_PREFIX=http://localhost:5001/api
NEXT_PUBLIC_SENTRY_DSN= NEXT_PUBLIC_SENTRY_DSN=
``` ```
```eval_rst > [!NOTE]
> If you encounter connection problems, you may run `export no_proxy=localhost,127.0.0.1` before starting API servcie, Worker service and frontend.
.. note::
If you encounter connection problems, you may run `export no_proxy=localhost,127.0.0.1` before starting API servcie, Worker service and frontend.
```
### 3. How to Use `Dify` ### 3. How to Use `Dify`

View file

@ -20,7 +20,6 @@ To add GPU support for FastChat, you may install **`ipex-llm`** as follows:
```bash ```bash
pip install --pre --upgrade ipex-llm[xpu,serving] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ pip install --pre --upgrade ipex-llm[xpu,serving] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
``` ```
## 2. Start the service ## 2. Start the service
@ -61,7 +60,7 @@ export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu" python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu"
``` ```
We have also provided an option `--load-low-bit-model` to load models that have been converted and saved into disk using the `save_low_bit` interface as introduced in this [document](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/KeyFeatures/hugging_face_format.html#save-load). We have also provided an option `--load-low-bit-model` to load models that have been converted and saved into disk using the `save_low_bit` interface as introduced in this [document](../Overview/KeyFeatures/hugging_face_format.md#save--load).
Check the following examples: Check the following examples:
@ -72,7 +71,7 @@ python -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path /Low/Bit/Model/
#### For self-speculative decoding example: #### For self-speculative decoding example:
You can use IPEX-LLM to run `self-speculative decoding` example. Refer to [here](https://github.com/intel-analytics/ipex-llm/tree/c9fac8c26bf1e1e8f7376fa9a62b32951dd9e85d/python/llm/example/GPU/Speculative-Decoding) for more details on intel MAX GPUs. Refer to [here](https://github.com/intel-analytics/ipex-llm/tree/c9fac8c26bf1e1e8f7376fa9a62b32951dd9e85d/python/llm/example/GPU/Speculative-Decoding) for more details on intel CPUs. You can use IPEX-LLM to run `self-speculative decoding` example. Refer to [here](../../../python/llm/example/GPU/Speculative-Decoding) for more details on intel MAX GPUs. Refer to [here](../../../python/llm/example/CPU/Speculative-Decoding) for more details on intel CPUs.
```bash ```bash
# Available low_bit format only including bf16 on CPU. # Available low_bit format only including bf16 on CPU.
@ -102,7 +101,7 @@ For a full list of accepted arguments, you can refer to the main method of the `
#### IPEX-LLM vLLM worker #### IPEX-LLM vLLM worker
We also provide the `vllm_worker` which uses the [vLLM](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/vLLM-Serving) engine for better hardware utilization. We also provide the `vllm_worker` which uses the vLLM engine (on [CPU](../../../python/llm/example/CPU/vLLM-Serving) / [GPU](../../../python/llm/example/GPU/vLLM-Serving)) for better hardware utilization.
To run using the `vLLM_worker`, we don't need to change model name, just simply uses the following command: To run using the `vLLM_worker`, we don't need to change model name, just simply uses the following command: