Update part of Quickstart guide in mddocs (2/2) (#11376)
* axolotl_quickstart.md * benchmark_quickstart.md * bigdl_llm_migration.md * chatchat_quickstart.md * continue_quickstart.md * deepspeed_autotp_fastapi_quickstart.md * dify_quickstart.md * fastchat_quickstart.md * adjust tab style * fix link * fix link * add video preview * Small fixes * Small fix --------- Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
This commit is contained in:
parent
8c9f877171
commit
9a3a21e4fc
8 changed files with 154 additions and 193 deletions
|
|
@ -4,7 +4,7 @@
|
||||||
|
|
||||||
See the demo of finetuning LLaMA2-7B on Intel Arc GPU below.
|
See the demo of finetuning LLaMA2-7B on Intel Arc GPU below.
|
||||||
|
|
||||||
<video src="https://llm-assets.readthedocs.io/en/latest/_images/axolotl-qlora-linux-arc.mp4" width="100%" controls></video>
|
[](https://llm-assets.readthedocs.io/en/latest/_images/axolotl-qlora-linux-arc.mp4)
|
||||||
|
|
||||||
## Quickstart
|
## Quickstart
|
||||||
|
|
||||||
|
|
@ -12,13 +12,13 @@ See the demo of finetuning LLaMA2-7B on Intel Arc GPU below.
|
||||||
|
|
||||||
IPEX-LLM's support for [Axolotl v0.4.0](https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0) is only available for Linux system. We recommend Ubuntu 20.04 or later (Ubuntu 22.04 is preferred).
|
IPEX-LLM's support for [Axolotl v0.4.0](https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0) is only available for Linux system. We recommend Ubuntu 20.04 or later (Ubuntu 22.04 is preferred).
|
||||||
|
|
||||||
Visit the [Install IPEX-LLM on Linux with Intel GPU](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html), follow [Install Intel GPU Driver](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-intel-gpu-driver) and [Install oneAPI](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-oneapi) to install GPU driver and Intel® oneAPI Base Toolkit 2024.0.
|
Visit the [Install IPEX-LLM on Linux with Intel GPU](./install_linux_gpu.md), follow [Install Intel GPU Driver](./install_linux_gpu.md#install-gpu-driver) and [Install oneAPI](./install_linux_gpu.md#install-oneapi) to install GPU driver and Intel® oneAPI Base Toolkit 2024.0.
|
||||||
|
|
||||||
### 1. Install IPEX-LLM for Axolotl
|
### 1. Install IPEX-LLM for Axolotl
|
||||||
|
|
||||||
Create a new conda env, and install `ipex-llm[xpu]`.
|
Create a new conda env, and install `ipex-llm[xpu]`.
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
conda create -n axolotl python=3.11
|
conda create -n axolotl python=3.11
|
||||||
conda activate axolotl
|
conda activate axolotl
|
||||||
# install ipex-llm
|
# install ipex-llm
|
||||||
|
|
@ -27,7 +27,7 @@ pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-exte
|
||||||
|
|
||||||
Install [axolotl v0.4.0](https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0) from git.
|
Install [axolotl v0.4.0](https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0) from git.
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
# install axolotl v0.4.0
|
# install axolotl v0.4.0
|
||||||
git clone https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0
|
git clone https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0
|
||||||
cd axolotl
|
cd axolotl
|
||||||
|
|
@ -62,46 +62,37 @@ For more technical details, please refer to [Llama 2](https://arxiv.org/abs/2307
|
||||||
|
|
||||||
By default, Axolotl will automatically download models and datasets from Huggingface. Please ensure you have login to Huggingface.
|
By default, Axolotl will automatically download models and datasets from Huggingface. Please ensure you have login to Huggingface.
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
huggingface-cli login
|
huggingface-cli login
|
||||||
```
|
```
|
||||||
|
|
||||||
If you prefer offline models and datasets, please download [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) and [alpaca_2k_test](https://huggingface.co/datasets/mhenrichsen/alpaca_2k_test). Then, set `HF_HUB_OFFLINE=1` to avoid connecting to Huggingface.
|
If you prefer offline models and datasets, please download [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) and [alpaca_2k_test](https://huggingface.co/datasets/mhenrichsen/alpaca_2k_test). Then, set `HF_HUB_OFFLINE=1` to avoid connecting to Huggingface.
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
export HF_HUB_OFFLINE=1
|
export HF_HUB_OFFLINE=1
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 2.2 Set Environment Variables
|
#### 2.2 Set Environment Variables
|
||||||
|
|
||||||
```eval_rst
|
> [!NOTE]
|
||||||
.. note::
|
> This is a required step on for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
|
||||||
|
|
||||||
This is a required step on for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
|
|
||||||
```
|
|
||||||
|
|
||||||
Configure oneAPI variables by running the following command:
|
Configure oneAPI variables by running the following command:
|
||||||
|
|
||||||
```eval_rst
|
```bash
|
||||||
.. tabs::
|
|
||||||
.. tab:: Linux
|
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
source /opt/intel/oneapi/setvars.sh
|
source /opt/intel/oneapi/setvars.sh
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Configure accelerate to avoid training with CPU. You can download a default `default_config.yaml` with `use_cpu: false`.
|
Configure accelerate to avoid training with CPU. You can download a default `default_config.yaml` with `use_cpu: false`.
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
mkdir -p ~/.cache/huggingface/accelerate/
|
mkdir -p ~/.cache/huggingface/accelerate/
|
||||||
wget -O ~/.cache/huggingface/accelerate/default_config.yaml https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/default_config.yaml
|
wget -O ~/.cache/huggingface/accelerate/default_config.yaml https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/default_config.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
As an alternative, you can config accelerate based on your requirements.
|
As an alternative, you can config accelerate based on your requirements.
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
accelerate config
|
accelerate config
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
@ -113,7 +104,7 @@ After finishing accelerate config, check if `use_cpu` is disabled (i.e., `use_cp
|
||||||
|
|
||||||
Prepare `lora.yml` for Axolotl LoRA finetune. You can download a template from github.
|
Prepare `lora.yml` for Axolotl LoRA finetune. You can download a template from github.
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/lora.yml
|
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/lora.yml
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
@ -143,13 +134,13 @@ lora_fan_in_fan_out:
|
||||||
|
|
||||||
Launch LoRA training with the following command.
|
Launch LoRA training with the following command.
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
accelerate launch finetune.py lora.yml
|
accelerate launch finetune.py lora.yml
|
||||||
```
|
```
|
||||||
|
|
||||||
In Axolotl v0.4.0, you can use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`.
|
In Axolotl v0.4.0, you can use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`.
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
accelerate launch train.py lora.yml
|
accelerate launch train.py lora.yml
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
@ -157,7 +148,7 @@ accelerate launch train.py lora.yml
|
||||||
|
|
||||||
Prepare `lora.yml` for QLoRA finetune. You can download a template from github.
|
Prepare `lora.yml` for QLoRA finetune. You can download a template from github.
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/qlora.yml
|
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/qlora.yml
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
@ -188,13 +179,13 @@ lora_fan_in_fan_out:
|
||||||
|
|
||||||
Launch LoRA training with the following command.
|
Launch LoRA training with the following command.
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
accelerate launch finetune.py qlora.yml
|
accelerate launch finetune.py qlora.yml
|
||||||
```
|
```
|
||||||
|
|
||||||
In Axolotl v0.4.0, you can use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`.
|
In Axolotl v0.4.0, you can use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`.
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
accelerate launch train.py qlora.yml
|
accelerate launch train.py qlora.yml
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
@ -206,7 +197,7 @@ Warning: this section will install axolotl main ([796a085](https://github.com/Op
|
||||||
|
|
||||||
Axolotl main has lots of new dependencies. Please setup a new conda env for this version.
|
Axolotl main has lots of new dependencies. Please setup a new conda env for this version.
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
conda create -n llm python=3.11
|
conda create -n llm python=3.11
|
||||||
conda activate llm
|
conda activate llm
|
||||||
# install axolotl main
|
# install axolotl main
|
||||||
|
|
@ -229,7 +220,7 @@ Based on [axolotl Llama-3 QLoRA example](https://github.com/OpenAccess-AI-Collec
|
||||||
|
|
||||||
Prepare `llama3-qlora.yml` for QLoRA finetune. You can download a template from github.
|
Prepare `llama3-qlora.yml` for QLoRA finetune. You can download a template from github.
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/llama3-qlora.yml
|
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/llama3-qlora.yml
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
@ -262,19 +253,19 @@ lora_target_linear: true
|
||||||
lora_fan_in_fan_out:
|
lora_fan_in_fan_out:
|
||||||
```
|
```
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
accelerate launch finetune.py llama3-qlora.yml
|
accelerate launch finetune.py llama3-qlora.yml
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`.
|
You can also use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`.
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
accelerate launch train.py llama3-qlora.yml
|
accelerate launch train.py llama3-qlora.yml
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected output
|
Expected output
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
{'loss': 0.237, 'learning_rate': 1.2254711850265387e-06, 'epoch': 3.77}
|
{'loss': 0.237, 'learning_rate': 1.2254711850265387e-06, 'epoch': 3.77}
|
||||||
{'loss': 0.6068, 'learning_rate': 1.1692453482951115e-06, 'epoch': 3.77}
|
{'loss': 0.6068, 'learning_rate': 1.1692453482951115e-06, 'epoch': 3.77}
|
||||||
{'loss': 0.2926, 'learning_rate': 1.1143322458989303e-06, 'epoch': 3.78}
|
{'loss': 0.2926, 'learning_rate': 1.1143322458989303e-06, 'epoch': 3.78}
|
||||||
|
|
@ -291,24 +282,24 @@ Expected output
|
||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
|
|
||||||
#### TypeError: PosixPath
|
### TypeError: PosixPath
|
||||||
|
|
||||||
Error message: `TypeError: argument of type 'PosixPath' is not iterable`
|
Error message: `TypeError: argument of type 'PosixPath' is not iterable`
|
||||||
|
|
||||||
This issue is related to [axolotl #1544](https://github.com/OpenAccess-AI-Collective/axolotl/issues/1544). It can be fixed by downgrading datasets to 2.15.0.
|
This issue is related to [axolotl #1544](https://github.com/OpenAccess-AI-Collective/axolotl/issues/1544). It can be fixed by downgrading datasets to 2.15.0.
|
||||||
|
|
||||||
```cmd
|
```bash
|
||||||
pip install datasets==2.15.0
|
pip install datasets==2.15.0
|
||||||
```
|
```
|
||||||
|
|
||||||
#### RuntimeError: out of device memory
|
### RuntimeError: out of device memory
|
||||||
|
|
||||||
Error message: `RuntimeError: Allocation is out of device memory on current platform.`
|
Error message: `RuntimeError: Allocation is out of device memory on current platform.`
|
||||||
|
|
||||||
This issue is caused by running out of GPU memory. Please reduce `lora_r` or `micro_batch_size` in `qlora.yml` or `lora.yml`, or reduce data using in training.
|
This issue is caused by running out of GPU memory. Please reduce `lora_r` or `micro_batch_size` in `qlora.yml` or `lora.yml`, or reduce data using in training.
|
||||||
|
|
||||||
#### OSError: libmkl_intel_lp64.so.2
|
### OSError: libmkl_intel_lp64.so.2
|
||||||
|
|
||||||
Error message: `OSError: libmkl_intel_lp64.so.2: cannot open shared object file: No such file or directory`
|
Error message: `OSError: libmkl_intel_lp64.so.2: cannot open shared object file: No such file or directory`
|
||||||
|
|
||||||
oneAPI environment is not correctly set. Please refer to [Set Environment Variables](#set-environment-variables).
|
oneAPI environment is not correctly set. Please refer to [Set Environment Variables](#22-set-environment-variables).
|
||||||
|
|
|
||||||
|
|
@ -4,7 +4,7 @@ We can perform benchmarking for IPEX-LLM on Intel CPUs and GPUs using the benchm
|
||||||
|
|
||||||
## Prepare The Environment
|
## Prepare The Environment
|
||||||
|
|
||||||
You can refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install.html) to install IPEX-LLM in your environment. The following dependencies are also needed to run the benchmark scripts.
|
You can refer to [here](../Overview/install.md) to install IPEX-LLM in your environment. The following dependencies are also needed to run the benchmark scripts.
|
||||||
|
|
||||||
```
|
```
|
||||||
pip install pandas
|
pip install pandas
|
||||||
|
|
@ -65,110 +65,99 @@ Some parameters in the yaml file that you can configure:
|
||||||
- `task`: There are three tasks: `continuation`, `QA` and `summarize`. `continuation` refers to writing additional content based on prompt. `QA` refers to answering questions based on prompt. `summarize` refers to summarizing the prompt.
|
- `task`: There are three tasks: `continuation`, `QA` and `summarize`. `continuation` refers to writing additional content based on prompt. `QA` refers to answering questions based on prompt. `summarize` refers to summarizing the prompt.
|
||||||
|
|
||||||
|
|
||||||
```eval_rst
|
> [!NOTE]
|
||||||
.. note::
|
> If you want to benchmark the performance without warmup, you can set ``warm_up: 0`` and ``num_trials: 1`` in ``config.yaml``, and run each single model and in_out_pair separately.
|
||||||
|
|
||||||
If you want to benchmark the performance without warmup, you can set ``warm_up: 0`` and ``num_trials: 1`` in ``config.yaml``, and run each single model and in_out_pair separately.
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
## Run on Windows
|
## Run on Windows
|
||||||
|
|
||||||
Please refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration) to configure oneAPI environment variables.
|
Please refer to [here](../Overview/install_gpu.md#runtime-configuration) to configure oneAPI environment variables. Choose corresponding commands base on your device.
|
||||||
|
|
||||||
```eval_rst
|
- For **Intel iGPU**:
|
||||||
.. tabs::
|
|
||||||
.. tab:: Intel iGPU
|
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
|
```bash
|
||||||
set SYCL_CACHE_PERSISTENT=1
|
set SYCL_CACHE_PERSISTENT=1
|
||||||
set BIGDL_LLM_XMX_DISABLED=1
|
set BIGDL_LLM_XMX_DISABLED=1
|
||||||
|
|
||||||
python run.py
|
python run.py
|
||||||
|
```
|
||||||
|
|
||||||
.. tab:: Intel Arc™ A300-Series or Pro A60
|
- For **Intel Arc™ A300-Series or Pro A60**:
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
|
```bash
|
||||||
set SYCL_CACHE_PERSISTENT=1
|
set SYCL_CACHE_PERSISTENT=1
|
||||||
python run.py
|
python run.py
|
||||||
|
```
|
||||||
|
|
||||||
.. tab:: Other Intel dGPU Series
|
- For **Other Intel dGPU Series**:
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
|
```bash
|
||||||
# e.g. Arc™ A770
|
# e.g. Arc™ A770
|
||||||
python run.py
|
python run.py
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Run on Linux
|
## Run on Linux
|
||||||
|
|
||||||
```eval_rst
|
Please choose corresponding commands base on your device.
|
||||||
.. tabs::
|
|
||||||
.. tab:: Intel Arc™ A-Series and Intel Data Center GPU Flex
|
- For **Intel Arc™ A-Series and Intel Data Center GPU Flex**:
|
||||||
|
|
||||||
For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, we recommend:
|
For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, we recommend:
|
||||||
|
|
||||||
.. code-block:: bash
|
```bash
|
||||||
|
|
||||||
./run-arc.sh
|
./run-arc.sh
|
||||||
|
```
|
||||||
|
|
||||||
.. tab:: Intel iGPU
|
- For **Intel iGPU**:
|
||||||
|
|
||||||
For Intel iGPU, we recommend:
|
For Intel iGPU, we recommend:
|
||||||
|
|
||||||
.. code-block:: bash
|
```bash
|
||||||
|
|
||||||
./run-igpu.sh
|
./run-igpu.sh
|
||||||
|
```
|
||||||
|
|
||||||
.. tab:: Intel Data Center GPU Max
|
- For **Intel Data Center GPU Max**:
|
||||||
|
|
||||||
Please note that you need to run ``conda install -c conda-forge -y gperftools=2.10`` before running the benchmark script on Intel Data Center GPU Max Series.
|
Please note that you need to run ``conda install -c conda-forge -y gperftools=2.10`` before running the benchmark script on Intel Data Center GPU Max Series.
|
||||||
|
|
||||||
.. code-block:: bash
|
```bash
|
||||||
|
|
||||||
./run-max-gpu.sh
|
./run-max-gpu.sh
|
||||||
|
```
|
||||||
|
|
||||||
.. tab:: Intel SPR
|
- For **Intel SPR**:
|
||||||
|
|
||||||
For Intel SPR machine, we recommend:
|
For Intel SPR machine, we recommend:
|
||||||
|
|
||||||
.. code-block:: bash
|
```bash
|
||||||
|
|
||||||
./run-spr.sh
|
./run-spr.sh
|
||||||
|
```
|
||||||
|
|
||||||
The scipt uses a default numactl strategy. If you want to customize it, please use ``lscpu`` or ``numactl -H`` to check how cpu indexs are assigned to numa node, and make sure the run command is binded to only one socket.
|
The scipt uses a default numactl strategy. If you want to customize it, please use ``lscpu`` or ``numactl -H`` to check how cpu indexs are assigned to numa node, and make sure the run command is binded to only one socket.
|
||||||
|
|
||||||
.. tab:: Intel HBM
|
- For **Intel HBM**:
|
||||||
|
|
||||||
For Intel HBM machine, we recommend:
|
For Intel HBM machine, we recommend:
|
||||||
|
|
||||||
.. code-block:: bash
|
```bash
|
||||||
|
|
||||||
./run-hbm.sh
|
./run-hbm.sh
|
||||||
|
```
|
||||||
|
|
||||||
The scipt uses a default numactl strategy. If you want to customize it, please use ``numactl -H`` to check how the index of hbm node and cpu are assigned.
|
The scipt uses a default numactl strategy. If you want to customize it, please use ``numactl -H`` to check how the index of hbm node and cpu are assigned.
|
||||||
|
|
||||||
For example:
|
For example:
|
||||||
|
|
||||||
|
```bash
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
node 0 1 2 3
|
node 0 1 2 3
|
||||||
0: 10 21 13 23
|
0: 10 21 13 23
|
||||||
1: 21 10 23 13
|
1: 21 10 23 13
|
||||||
2: 13 23 10 23
|
2: 13 23 10 23
|
||||||
3: 23 13 23 10
|
3: 23 13 23 10
|
||||||
|
```
|
||||||
|
|
||||||
here hbm node is the node whose distance from the checked node is 13, node 2 is node 0's hbm node.
|
here hbm node is the node whose distance from the checked node is 13, node 2 is node 0's hbm node.
|
||||||
|
|
||||||
And make sure the run command is binded to only one socket.
|
And make sure the run command is binded to only one socket.
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
## Result
|
## Result
|
||||||
|
|
||||||
After the benchmarking is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns `1st token avg latency (ms)` and `2+ avg latency (ms/token)` for the benchmark results. You can also check whether the column `actual input/output tokens` is consistent with the column `input/output tokens` and whether the parameters you specified in `config.yaml` have been successfully applied in the benchmarking.
|
After the benchmarking is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns `1st token avg latency (ms)` and `2+ avg latency (ms/token)` for the benchmark results. You can also check whether the column `actual input/output tokens` is consistent with the column `input/output tokens` and whether the parameters you specified in `config.yaml` have been successfully applied in the benchmarking.
|
||||||
|
|
|
||||||
|
|
@ -4,10 +4,9 @@ This guide helps you migrate your `bigdl-llm` application to use `ipex-llm`.
|
||||||
|
|
||||||
## Upgrade `bigdl-llm` package to `ipex-llm`
|
## Upgrade `bigdl-llm` package to `ipex-llm`
|
||||||
|
|
||||||
```eval_rst
|
> [!NOTE]
|
||||||
.. note::
|
> This step assumes you have already installed `bigdl-llm`.
|
||||||
This step assumes you have already installed `bigdl-llm`.
|
|
||||||
```
|
|
||||||
You need to uninstall `bigdl-llm` and install `ipex-llm`With your `bigdl-llm` conda environment activated, execute the following command according to your device type and location:
|
You need to uninstall `bigdl-llm` and install `ipex-llm`With your `bigdl-llm` conda environment activated, execute the following command according to your device type and location:
|
||||||
|
|
||||||
### For CPU
|
### For CPU
|
||||||
|
|
@ -19,20 +18,17 @@ pip install --pre --upgrade ipex-llm[all] # for cpu
|
||||||
|
|
||||||
### For GPU
|
### For GPU
|
||||||
Choose either US or CN website for `extra-index-url`:
|
Choose either US or CN website for `extra-index-url`:
|
||||||
```eval_rst
|
|
||||||
.. tabs::
|
|
||||||
|
|
||||||
.. tab:: US
|
- For **US**:
|
||||||
|
|
||||||
.. code-block:: cmd
|
|
||||||
|
|
||||||
|
```bash
|
||||||
pip uninstall -y bigdl-llm
|
pip uninstall -y bigdl-llm
|
||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||||
|
```
|
||||||
|
|
||||||
.. tab:: CN
|
- For **CN**:
|
||||||
|
|
||||||
.. code-block:: cmd
|
|
||||||
|
|
||||||
|
```bash
|
||||||
pip uninstall -y bigdl-llm
|
pip uninstall -y bigdl-llm
|
||||||
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
|
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
|
||||||
```
|
```
|
||||||
|
|
|
||||||
|
|
@ -10,11 +10,12 @@
|
||||||
<td align="center" width="50%">简体中文</td>
|
<td align="center" width="50%">简体中文</td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td><video src="https://llm-assets.readthedocs.io/en/latest/_images/langchain-chatchat-en.mp4" width="100%" controls></video></td>
|
<td><a href="https://llm-assets.readthedocs.io/en/latest/_images/langchain-chatchat-en.mp4"><img src="https://llm-assets.readthedocs.io/en/latest/_images/langchain-chatchat-en.png"/></a></td>
|
||||||
<td><video src="https://llm-assets.readthedocs.io/en/latest/_images/langchain-chatchat-cn.mp4" width="100%" controls></video></td>
|
<td><a href="https://llm-assets.readthedocs.io/en/latest/_images/langchain-chatchat-cn.mp4"><img src="https://llm-assets.readthedocs.io/en/latest/_images/langchain-chatchat-cn.png"/></a></td>
|
||||||
</tr>
|
</tr>
|
||||||
</table>
|
</table>
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
> You can change the UI language in the left-side menu. We currently support **English** and **简体中文** (see video demos below).
|
> You can change the UI language in the left-side menu. We currently support **English** and **简体中文** (see video demos below).
|
||||||
|
|
||||||
## Langchain-Chatchat Architecture
|
## Langchain-Chatchat Architecture
|
||||||
|
|
@ -66,7 +67,7 @@ You can now click `Dialogue` on the left-side menu to return to the chat UI. The
|
||||||
|
|
||||||
<br/>
|
<br/>
|
||||||
|
|
||||||
For more information about how to use Langchain-Chatchat, refer to Official Quickstart guide in [English](./README_en.md#), [Chinese](./README_chs.md#), or the [Wiki](https://github.com/chatchat-space/Langchain-Chatchat/wiki/).
|
For more information about how to use Langchain-Chatchat, refer to Official Quickstart guide in [English](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/README_en.md#), [Chinese](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/README.md#), or the [Wiki](https://github.com/chatchat-space/Langchain-Chatchat/wiki/).
|
||||||
|
|
||||||
### Trouble Shooting & Tips
|
### Trouble Shooting & Tips
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -5,30 +5,25 @@
|
||||||
|
|
||||||
Below is a demo of using `Continue` with [CodeQWen1.5-7B](https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat) running on Intel A770 GPU. This demo illustrates how a programmer used `Continue` to find a solution for the [Kaggle's _Titanic_ challenge](https://www.kaggle.com/competitions/titanic/), which involves asking `Continue` to complete the code for model fitting, evaluation, hyper parameter tuning, feature engineering, and explain generated code.
|
Below is a demo of using `Continue` with [CodeQWen1.5-7B](https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat) running on Intel A770 GPU. This demo illustrates how a programmer used `Continue` to find a solution for the [Kaggle's _Titanic_ challenge](https://www.kaggle.com/competitions/titanic/), which involves asking `Continue` to complete the code for model fitting, evaluation, hyper parameter tuning, feature engineering, and explain generated code.
|
||||||
|
|
||||||
<video src="https://llm-assets.readthedocs.io/en/latest/_images/continue_demo_ollama_backend_arc.mp4" width="100%" controls></video>
|
[](https://llm-assets.readthedocs.io/en/latest/_images/continue_demo_ollama_backend_arc.mp4)
|
||||||
|
|
||||||
## Quickstart
|
## Quickstart
|
||||||
|
|
||||||
This guide walks you through setting up and running **Continue** within _Visual Studio Code_, empowered by local large language models served via [Ollama](./ollama_quickstart.html) with `ipex-llm` optimizations.
|
This guide walks you through setting up and running **Continue** within _Visual Studio Code_, empowered by local large language models served via [Ollama](./ollama_quickstart.md) with `ipex-llm` optimizations.
|
||||||
|
|
||||||
### 1. Install and Run Ollama Serve
|
### 1. Install and Run Ollama Serve
|
||||||
|
|
||||||
Visit [Run Ollama with IPEX-LLM on Intel GPU](./ollama_quickstart.html), and follow the steps 1) [Install IPEX-LLM for Ollama](./ollama_quickstart.html#install-ipex-llm-for-ollama), 2) [Initialize Ollama](./ollama_quickstart.html#initialize-ollama) 3) [Run Ollama Serve](./ollama_quickstart.html#run-ollama-serve) to install, init and start the Ollama Service.
|
Visit [Run Ollama with IPEX-LLM on Intel GPU](./ollama_quickstart.md), and follow the steps 1) [Install IPEX-LLM for Ollama](./ollama_quickstart.md#1-install-ipex-llm-for-ollama), 2) [Initialize Ollama](./ollama_quickstart.md#2-initialize-ollama) 3) [Run Ollama Serve](./ollama_quickstart.md#3-run-ollama-serve) to install, init and start the Ollama Service.
|
||||||
|
|
||||||
|
> [!IMPORTANT]
|
||||||
|
> If the `Continue` plugin is not installed on the same machine where Ollama is running (which means `Continue` needs to connect to a remote Ollama service), you must configure the Ollama service to accept connections from any IP address. To achieve this, set or export the environment variable `OLLAMA_HOST=0.0.0.0` before executing the command `ollama serve`.
|
||||||
|
|
||||||
```eval_rst
|
> [!TIP]
|
||||||
.. important::
|
> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
|
||||||
|
>
|
||||||
If the `Continue` plugin is not installed on the same machine where Ollama is running (which means `Continue` needs to connect to a remote Ollama service), you must configure the Ollama service to accept connections from any IP address. To achieve this, set or export the environment variable `OLLAMA_HOST=0.0.0.0` before executing the command `ollama serve`.
|
> ```bash
|
||||||
|
> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||||
.. tip::
|
> ```
|
||||||
|
|
||||||
If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
|
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Pull and Prepare the Model
|
### 2. Pull and Prepare the Model
|
||||||
|
|
||||||
|
|
@ -36,30 +31,25 @@ Visit [Run Ollama with IPEX-LLM on Intel GPU](./ollama_quickstart.html), and fol
|
||||||
|
|
||||||
Now we need to pull a model for coding. Here we use [CodeQWen1.5-7B](https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat) model as an example. Open a new terminal window, run the following command to pull [`codeqwen:latest`](https://ollama.com/library/codeqwen).
|
Now we need to pull a model for coding. Here we use [CodeQWen1.5-7B](https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat) model as an example. Open a new terminal window, run the following command to pull [`codeqwen:latest`](https://ollama.com/library/codeqwen).
|
||||||
|
|
||||||
|
- For **Linux users**:
|
||||||
|
|
||||||
```eval_rst
|
```bash
|
||||||
.. tabs::
|
|
||||||
.. tab:: Linux
|
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
export no_proxy=localhost,127.0.0.1
|
export no_proxy=localhost,127.0.0.1
|
||||||
./ollama pull codeqwen:latest
|
./ollama pull codeqwen:latest
|
||||||
|
```
|
||||||
|
|
||||||
.. tab:: Windows
|
- For **Windows users**:
|
||||||
|
|
||||||
Please run the following command in Miniforge Prompt.
|
Please run the following command in Miniforge Prompt.
|
||||||
|
|
||||||
.. code-block:: cmd
|
```cmd
|
||||||
|
|
||||||
set no_proxy=localhost,127.0.0.1
|
set no_proxy=localhost,127.0.0.1
|
||||||
ollama pull codeqwen:latest
|
ollama pull codeqwen:latest
|
||||||
|
|
||||||
.. seealso::
|
|
||||||
|
|
||||||
Besides CodeQWen, there are other coding models you might want to explore, such as Magicoder, Wizardcoder, Codellama, Codegemma, Starcoder, Starcoder2, and etc. You can find these models in the `Ollama model library <https://ollama.com/library>`_. Simply search for the model, pull it in a similar manner, and give it a try.
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> Besides CodeQWen, there are other coding models you might want to explore, such as Magicoder, Wizardcoder, Codellama, Codegemma, Starcoder, Starcoder2, and etc. You can find these models in the [`Ollama model library`](https://ollama.com/library). Simply search for the model, pull it in a similar manner, and give it a try.
|
||||||
|
|
||||||
|
|
||||||
#### 2.2 Prepare the Model and Pre-load
|
#### 2.2 Prepare the Model and Pre-load
|
||||||
|
|
||||||
|
|
@ -72,8 +62,8 @@ Start by creating a file named `Modelfile` with the following content:
|
||||||
FROM codeqwen:latest
|
FROM codeqwen:latest
|
||||||
PARAMETER num_ctx 4096
|
PARAMETER num_ctx 4096
|
||||||
```
|
```
|
||||||
Next, use the following commands in the terminal (Linux) or Miniforge Prompt (Windows) to create a new model in Ollama named `codeqwen:latest-continue`:
|
|
||||||
|
|
||||||
|
Next, use the following commands in the terminal (Linux) or Miniforge Prompt (Windows) to create a new model in Ollama named `codeqwen:latest-continue`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
ollama create codeqwen:latest-continue -f Modelfile
|
ollama create codeqwen:latest-continue -f Modelfile
|
||||||
|
|
@ -87,8 +77,6 @@ Finally, preload the new model by executing the following command in a new termi
|
||||||
ollama run codeqwen:latest-continue
|
ollama run codeqwen:latest-continue
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
### 3. Install `Continue` Extension
|
### 3. Install `Continue` Extension
|
||||||
|
|
||||||
Search for `Continue` in the VSCode `Extensions Marketplace` and install it just like any other extension.
|
Search for `Continue` in the VSCode `Extensions Marketplace` and install it just like any other extension.
|
||||||
|
|
|
||||||
|
|
@ -1,10 +1,10 @@
|
||||||
# Run IPEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi
|
# Run IPEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi
|
||||||
|
|
||||||
This example demonstrates how to run IPEX-LLM serving on multiple [Intel GPUs](../README.md) by leveraging DeepSpeed AutoTP.
|
This example demonstrates how to run IPEX-LLM serving on multiple [Intel GPUs](../../../python/llm/example/GPU/README.md) by leveraging DeepSpeed AutoTP.
|
||||||
|
|
||||||
## Requirements
|
## Requirements
|
||||||
|
|
||||||
To run this example with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. For this particular example, you will need at least two GPUs on your machine.
|
To run this example with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../python/llm/example/GPU/README.md#requirements) for more information. For this particular example, you will need at least two GPUs on your machine.
|
||||||
|
|
||||||
## Example
|
## Example
|
||||||
|
|
||||||
|
|
@ -24,7 +24,8 @@ pip install mpi4py fastapi uvicorn
|
||||||
conda install -c conda-forge -y gperftools=2.10 # to enable tcmalloc
|
conda install -c conda-forge -y gperftools=2.10 # to enable tcmalloc
|
||||||
```
|
```
|
||||||
|
|
||||||
> **Important**: IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0. Please make sure you have installed the correct version.
|
> [!IMPORTANT]
|
||||||
|
> IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0. Please make sure you have installed the correct version.
|
||||||
|
|
||||||
### 2. Run tensor parallel inference on multiple GPUs
|
### 2. Run tensor parallel inference on multiple GPUs
|
||||||
|
|
||||||
|
|
@ -35,7 +36,6 @@ We provide example usage for `Llama-2-7b-chat-hf` model running on Arc A770
|
||||||
Run Llama-2-7b-chat-hf on two Intel Arc A770:
|
Run Llama-2-7b-chat-hf on two Intel Arc A770:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|
||||||
# Before run this script, you should adjust the YOUR_REPO_ID_OR_MODEL_PATH in last line
|
# Before run this script, you should adjust the YOUR_REPO_ID_OR_MODEL_PATH in last line
|
||||||
# If you want to change server port, you can set port parameter in last line
|
# If you want to change server port, you can set port parameter in last line
|
||||||
|
|
||||||
|
|
@ -52,7 +52,8 @@ If you successfully run the serving, you can get output like this:
|
||||||
[0] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
|
[0] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
|
||||||
```
|
```
|
||||||
|
|
||||||
> **Note**: You could change `NUM_GPUS` to the number of GPUs you have on your machine. And you could also specify other low bit optimizations through `--low-bit`.
|
> [!NOTE]
|
||||||
|
> You could change `NUM_GPUS` to the number of GPUs you have on your machine. And you could also specify other low bit optimizations through `--low-bit`.
|
||||||
|
|
||||||
### 3. Sample Input and Output
|
### 3. Sample Input and Output
|
||||||
|
|
||||||
|
|
@ -83,7 +84,8 @@ And you should get output like this:
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Important**: The first token latency is much larger than rest token latency, you could use [our benchmark tool](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/dev/benchmark/README.md) to obtain more details about first and rest token latency.
|
> [!IMPORTANT]
|
||||||
|
> The first token latency is much larger than rest token latency, you could use [our benchmark tool](../../../python/llm/dev/benchmark/README.md) to obtain more details about first and rest token latency.
|
||||||
|
|
||||||
### 4. Benchmark with wrk
|
### 4. Benchmark with wrk
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -6,7 +6,7 @@
|
||||||
|
|
||||||
*See the demo of a RAG workflow in Dify running LLaMA2-7B on Intel A770 GPU below.*
|
*See the demo of a RAG workflow in Dify running LLaMA2-7B on Intel A770 GPU below.*
|
||||||
|
|
||||||
<video src="https://llm-assets.readthedocs.io/en/latest/_images/dify-rag-small.mp4" width="100%" controls></video>
|
[](https://llm-assets.readthedocs.io/en/latest/_images/dify-rag-small.mp4)
|
||||||
|
|
||||||
|
|
||||||
## Quickstart
|
## Quickstart
|
||||||
|
|
@ -99,13 +99,8 @@ NEXT_PUBLIC_PUBLIC_API_PREFIX=http://localhost:5001/api
|
||||||
NEXT_PUBLIC_SENTRY_DSN=
|
NEXT_PUBLIC_SENTRY_DSN=
|
||||||
```
|
```
|
||||||
|
|
||||||
```eval_rst
|
> [!NOTE]
|
||||||
|
> If you encounter connection problems, you may run `export no_proxy=localhost,127.0.0.1` before starting API servcie, Worker service and frontend.
|
||||||
.. note::
|
|
||||||
|
|
||||||
If you encounter connection problems, you may run `export no_proxy=localhost,127.0.0.1` before starting API servcie, Worker service and frontend.
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
### 3. How to Use `Dify`
|
### 3. How to Use `Dify`
|
||||||
|
|
|
||||||
|
|
@ -20,7 +20,6 @@ To add GPU support for FastChat, you may install **`ipex-llm`** as follows:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install --pre --upgrade ipex-llm[xpu,serving] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
pip install --pre --upgrade ipex-llm[xpu,serving] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## 2. Start the service
|
## 2. Start the service
|
||||||
|
|
@ -61,7 +60,7 @@ export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||||
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu"
|
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu"
|
||||||
```
|
```
|
||||||
|
|
||||||
We have also provided an option `--load-low-bit-model` to load models that have been converted and saved into disk using the `save_low_bit` interface as introduced in this [document](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/KeyFeatures/hugging_face_format.html#save-load).
|
We have also provided an option `--load-low-bit-model` to load models that have been converted and saved into disk using the `save_low_bit` interface as introduced in this [document](../Overview/KeyFeatures/hugging_face_format.md#save--load).
|
||||||
|
|
||||||
Check the following examples:
|
Check the following examples:
|
||||||
|
|
||||||
|
|
@ -72,7 +71,7 @@ python -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path /Low/Bit/Model/
|
||||||
|
|
||||||
#### For self-speculative decoding example:
|
#### For self-speculative decoding example:
|
||||||
|
|
||||||
You can use IPEX-LLM to run `self-speculative decoding` example. Refer to [here](https://github.com/intel-analytics/ipex-llm/tree/c9fac8c26bf1e1e8f7376fa9a62b32951dd9e85d/python/llm/example/GPU/Speculative-Decoding) for more details on intel MAX GPUs. Refer to [here](https://github.com/intel-analytics/ipex-llm/tree/c9fac8c26bf1e1e8f7376fa9a62b32951dd9e85d/python/llm/example/GPU/Speculative-Decoding) for more details on intel CPUs.
|
You can use IPEX-LLM to run `self-speculative decoding` example. Refer to [here](../../../python/llm/example/GPU/Speculative-Decoding) for more details on intel MAX GPUs. Refer to [here](../../../python/llm/example/CPU/Speculative-Decoding) for more details on intel CPUs.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Available low_bit format only including bf16 on CPU.
|
# Available low_bit format only including bf16 on CPU.
|
||||||
|
|
@ -102,7 +101,7 @@ For a full list of accepted arguments, you can refer to the main method of the `
|
||||||
|
|
||||||
#### IPEX-LLM vLLM worker
|
#### IPEX-LLM vLLM worker
|
||||||
|
|
||||||
We also provide the `vllm_worker` which uses the [vLLM](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/vLLM-Serving) engine for better hardware utilization.
|
We also provide the `vllm_worker` which uses the vLLM engine (on [CPU](../../../python/llm/example/CPU/vLLM-Serving) / [GPU](../../../python/llm/example/GPU/vLLM-Serving)) for better hardware utilization.
|
||||||
|
|
||||||
To run using the `vLLM_worker`, we don't need to change model name, just simply uses the following command:
|
To run using the `vLLM_worker`, we don't need to change model name, just simply uses the following command:
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue