Update installation guide for pipeline parallel inference (#11224)

* Update installation guide for pipeline parallel inference

* Small fix

* further fix

* Small fix

* Small fix

* Update based on comments

* Small fix

* Small fix

* Small fix
This commit is contained in:
Yuwen Hu 2024-06-05 17:54:29 +08:00 committed by GitHub
parent ed67435491
commit af96579c76
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 45 additions and 28 deletions

View file

@ -1,4 +1,4 @@
# Serve IPEX-LLM on Multiple Intel GPUs in multi-stage pipeline parallel fashion # Serve IPEX-LLM on Multiple Intel GPUs in Multi-Stage Pipeline Parallel Fashion
This example demonstrates how to run IPEX-LLM serving on multiple [Intel GPUs](../README.md) with Pipeline Parallel. This example demonstrates how to run IPEX-LLM serving on multiple [Intel GPUs](../README.md) with Pipeline Parallel.

View file

@ -1,55 +1,70 @@
# Run IPEX-LLM on Multiple Intel GPUs in pipeline parallel fashion # Run IPEX-LLM on Multiple Intel GPUs in Pipeline Parallel Fashion
This example demonstrates how to run IPEX-LLM optimized low-bit model vertically partitioned on two [Intel GPUs](../README.md). This example demonstrates how to run IPEX-LLM optimized low-bit model vertically partitioned on multiple [Intel GPUs](../README.md) for Linux users.
## Requirements ## Requirements
To run this example with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. For this particular example, you will need at least two GPUs on your machine. To run this example with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. For this particular example, you will need at least two GPUs on your machine.
## Example: > [!NOTE]
> To run IPEX-LLM on multiple Intel GPUs in pipeline parallel fashion, you will need to install **Intel® oneAPI Base Toolkit 2024.1**, which could be done through an offline installer:
> ```bash
> wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/fdc7a2bc-b7a8-47eb-8876-de6201297144/l_BaseKit_p_2024.1.0.596_offline.sh
>
> sudo sh ./l_BaseKit_p_2024.1.0.596_offline.sh
> ```
### 1.1 Install IPEX-LLM ## Example: Run pipeline parallel inference on multiple GPUs
### 1. Installation
```bash ```bash
conda create -n llm python=3.11 conda create -n llm python=3.11
conda activate llm conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
# you can install specific ipex/torch version for your need
pip install --pre --upgrade ipex-llm[xpu_2.1] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# configures OneAPI environment variables
source /opt/intel/oneapi/setvars.sh
conda install -c conda-forge -y gperftools=2.10 # to enable tcmalloc pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
pip install torch==2.1.0.post2 torchvision==0.16.0.post2 torchaudio==2.1.0.post2 intel-extension-for-pytorch==2.1.30+xpu oneccl_bind_pt==2.1.300+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
``` ```
### 1.2 Build and install patched version of Intel Extension for PyTorch (IPEX) ### 2. Configures OneAPI environment variables
```bash ```bash
conda activate llm
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
git clone https://github.com/intel/intel-extension-for-pytorch.git
cd intel-extension-for-pytorch
git checkout v2.1.10+xpu
git submodule update --init --recursive
git cherry-pick be8ea24078d8a271e53d2946ac533383f7a2aa78
export USE_AOT_DEVLIST='ats-m150,pvc'
python setup.py install
``` ```
> [!NOTE]
> Please make sure you configure the environment variables for **Intel® oneAPI Base Toolkit's version == 2024.1.**.
> **Important**: IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0. Please make sure you have installed the correct version. ### 3 Runtime Configurations
### 2. Run pipeline parallel inference on multiple GPUs For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
Here, we provide example usages on different models and different hardwares. Please refer to the appropriate script based on your model and device:
### 3. Run <details>
For optimal performance on Arc, it is recommended to set several environment variables. <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash ```bash
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
``` ```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
export ENABLE_SDP_FUSION=1
```
> [!NOTE]
> Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
### 4. Running examples
``` ```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT --gpu-num GPU_NUM python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT --gpu-num GPU_NUM
``` ```
@ -61,7 +76,7 @@ Arguments info:
- `--gpu-num GPU_NUM`: argument defining the number of GPU to use. It is default to be `2`. - `--gpu-num GPU_NUM`: argument defining the number of GPU to use. It is default to be `2`.
#### Sample Output #### Sample Output
#### [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) ##### [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
```log ```log
Inference time: xxxx s Inference time: xxxx s
-------------------- Prompt -------------------- -------------------- Prompt --------------------

View file

@ -7,12 +7,14 @@ This folder contains examples of running IPEX-LLM on Intel GPU:
- [LLM-Finetuning](LLM-Finetuning): running ***finetuning*** (such as LoRA, QLoRA, QA-LoRA, etc) using IPEX-LLM on Intel GPUs - [LLM-Finetuning](LLM-Finetuning): running ***finetuning*** (such as LoRA, QLoRA, QA-LoRA, etc) using IPEX-LLM on Intel GPUs
- [vLLM-Serving](vLLM-Serving): running ***vLLM*** serving framework on intel GPUs (with IPEX-LLM low-bit optimized models) - [vLLM-Serving](vLLM-Serving): running ***vLLM*** serving framework on intel GPUs (with IPEX-LLM low-bit optimized models)
- [Deepspeed-AutoTP](Deepspeed-AutoTP): running distributed inference using ***DeepSpeed AutoTP*** (with IPEX-LLM low-bit optimized models) on Intel GPUs - [Deepspeed-AutoTP](Deepspeed-AutoTP): running distributed inference using ***DeepSpeed AutoTP*** (with IPEX-LLM low-bit optimized models) on Intel GPUs
- [Deepspeed-AutoTP-FastApi](Deepspeed-AutoTP-FastApi): running distributed inference using ***DeepSpeed AutoTP*** and start serving with ***FastApi***(with IPEX-LLM low-bit optimized models) on Intel GPUs - [Deepspeed-AutoTP-FastAPI](Deepspeed-AutoTP-FastAPI): running distributed inference using ***DeepSpeed AutoTP*** and start serving with ***FastAPI***(with IPEX-LLM low-bit optimized models) on Intel GPUs
- [Pipeline-Parallel-Inference](Pipeline-Parallel-Inference): running IPEX-LLM optimized low-bit model vertically partitioned on multiple Intel GPUs
- [Pipeline-Parallel-FastAPI](Pipeline-Parallel-FastAPI): running IPEX-LLM serving with **FastAPI** on multiple Intel GPUs in pipeline parallel fasion
- [LangChain](LangChain): running ***LangChain*** applications on IPEX-LLM - [LangChain](LangChain): running ***LangChain*** applications on IPEX-LLM
- [PyTorch-Models](PyTorch-Models): running any PyTorch model on IPEX-LLM (with "one-line code change") - [PyTorch-Models](PyTorch-Models): running any PyTorch model on IPEX-LLM (with "one-line code change")
- [Speculative-Decoding](Speculative-Decoding): running any ***Hugging Face Transformers*** model with ***self-speculative decoding*** on Intel GPUs - [Speculative-Decoding](Speculative-Decoding): running any ***Hugging Face Transformers*** model with ***self-speculative decoding*** on Intel GPUs
- [ModelScope-Models](ModelScope-Models): running ***ModelScope*** model with IPEX-LLM on Intel GPUs - [ModelScope-Models](ModelScope-Models): running ***ModelScope*** model with IPEX-LLM on Intel GPUs
- [Long-Context](Long-Context): running **long-context** generation with IPEX-LLM on Intel Arc™ A770 Graphics. - [Long-Context](Long-Context): running **long-context** generation with IPEX-LLM on Intel Arc™ A770 Graphics
## System Support ## System Support