* Update installation guide for pipeline parallel inference * Small fix * further fix * Small fix * Small fix * Update based on comments * Small fix * Small fix * Small fix  | 
			||
|---|---|---|
| .. | ||
| generate.py | ||
| README.md | ||
Run IPEX-LLM on Multiple Intel GPUs in Pipeline Parallel Fashion
This example demonstrates how to run IPEX-LLM optimized low-bit model vertically partitioned on multiple Intel GPUs for Linux users.
Requirements
To run this example with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to here for more information. For this particular example, you will need at least two GPUs on your machine.
Note
To run IPEX-LLM on multiple Intel GPUs in pipeline parallel fashion, you will need to install Intel® oneAPI Base Toolkit 2024.1, which could be done through an offline installer:
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/fdc7a2bc-b7a8-47eb-8876-de6201297144/l_BaseKit_p_2024.1.0.596_offline.sh sudo sh ./l_BaseKit_p_2024.1.0.596_offline.sh
Example: Run pipeline parallel inference on multiple GPUs
1. Installation
conda create -n llm python=3.11
conda activate llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
pip install torch==2.1.0.post2 torchvision==0.16.0.post2 torchaudio==2.1.0.post2 intel-extension-for-pytorch==2.1.30+xpu oneccl_bind_pt==2.1.300+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
2. Configures OneAPI environment variables
source /opt/intel/oneapi/setvars.sh
Note
Please make sure you configure the environment variables for Intel® oneAPI Base Toolkit's version == 2024.1..
3 Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
For Intel Data Center GPU Max Series
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
export ENABLE_SDP_FUSION=1
Note
Please note that
libtcmalloc.socan be installed byconda install -c conda-forge -y gperftools=2.10.
4. Running examples
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT --gpu-num GPU_NUM
Arguments info:
--repo-id-or-model-path REPO_ID_OR_MODEL_PATH: argument defining the huggingface repo id for the Llama2 model (e.g.meta-llama/Llama-2-7b-chat-hfandmeta-llama/Llama-2-13b-chat-hf) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be'meta-llama/Llama-2-7b-chat-hf'.--prompt PROMPT: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be'What is AI?'.--n-predict N_PREDICT: argument defining the max number of tokens to predict. It is default to be32.--gpu-num GPU_NUM: argument defining the number of GPU to use. It is default to be2.
Sample Output
meta-llama/Llama-2-7b-chat-hf
Inference time: xxxx s
-------------------- Prompt --------------------
<s>[INST] <<SYS>>
<</SYS>>
What is AI? [/INST]
-------------------- Output --------------------
[INST] <<SYS>>
<</SYS>>
What is AI? [/INST]  Artificial intelligence (AI) is the broader field of research and development aimed at creating machines that can perform tasks that typically require human intelligence,