71 lines
2.2 KiB
Markdown
71 lines
2.2 KiB
Markdown
### Run Tensor-Parallel IPEX-LLM Transformers INT4 Inference with Deepspeed
|
|
|
|
#### 1. Install Dependencies
|
|
|
|
Install necessary packages (here Python 3.9 is our test environment):
|
|
|
|
```bash
|
|
bash install.sh
|
|
```
|
|
|
|
The first step in the script is to install oneCCL (wrapper for Intel MPI) to enable distributed communication between deepspeed instances, which can be skipped if Inte MPI/oneCCL/oneAPI has already been prepared on your machine. Please refer to [oneCCL](https://github.com/oneapi-src/oneCCL) if any related issue when install or import.
|
|
|
|
#### 2. Initialize Deepspeed Distributed Context
|
|
|
|
Like shown in example code `deepspeed_autotp.py`, you can construct parallel model with Python API:
|
|
|
|
```python
|
|
# Load in HuggingFace Transformers' model
|
|
from transformers import AutoModelForCausalLM
|
|
|
|
model = AutoModelForCausalLM.from_pretrained(...)
|
|
|
|
|
|
# Parallelize model on deepspeed
|
|
import deepspeed
|
|
|
|
model = deepspeed.init_inference(
|
|
model, # an AutoModel of Transformers
|
|
mp_size = world_size, # instance (process) count
|
|
dtype=torch.float16,
|
|
replace_method="auto")
|
|
```
|
|
|
|
Then, returned model is converted into a deepspeed InferenceEnginee type.
|
|
|
|
#### 3. Optimize Model with IPEX-LLM Low Bit
|
|
|
|
Distributed model managed by deepspeed can be further optimized with IPEX low-bit Python API, e.g. sym_int4:
|
|
|
|
```python
|
|
# Apply IPEX-LLM INT4 optimizations on transformers
|
|
from ipex_llm import optimize_model
|
|
|
|
model = optimize_model(model.module.to(f'cpu'), low_bit='sym_int4')
|
|
model = model.to(f'cpu:{local_rank}') # move partial model to local rank
|
|
```
|
|
|
|
Then, a ipex-llm transformers is returned, which in the following, can serve in parallel with native APIs.
|
|
|
|
#### 4. Start Python Code
|
|
|
|
You can try deepspeed with IPEX LLM by:
|
|
|
|
```bash
|
|
bash run.sh
|
|
```
|
|
|
|
If you want to run your own application, there are **necessary configurations in the script** which can also be ported to run your custom deepspeed application:
|
|
|
|
```bash
|
|
# run.sh
|
|
source ipex-llm-init
|
|
unset OMP_NUM_THREADS # deepspeed will set it for each instance automatically
|
|
source /opt/intel/oneccl/env/setvars.sh
|
|
......
|
|
export FI_PROVIDER=tcp
|
|
export CCL_ATL_TRANSPORT=ofi
|
|
export CCL_PROCESS_LAUNCHER=none
|
|
```
|
|
|
|
Set the above configurations before running `deepspeed` please to ensure right parallel communication and high performance.
|