ipex-llm/python/llm/example/CPU/Deepspeed-AutoTP/README.md
Heyang Sun af94058203 [LLM] Support CPU deepspeed distributed inference (#9259)
* [LLM] Support CPU Deepspeed distributed inference

* Update run_deepspeed.py

* Rename

* fix style

* add new codes

* refine

* remove annotated codes

* refine

* Update README.md

* refine doc and example code
2023-11-06 17:56:42 +08:00

69 lines
1.9 KiB
Markdown

### Run Tensor-Parallel BigDL Transformers INT4 Inference with Deepspeed
#### 1. Install Dependencies
Install necessary packages (here Python 3.9 is our test environment):
```bash
bash install.sh
```
#### 2. Initialize Deepspeed Distributed Context
Like shown in example code `deepspeed_autotp.py`, you can construct parallel model with Python API:
```python
# Load in HuggingFace Transformers' model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(...)
# Parallelize model on deepspeed
import deepspeed
model = deepspeed.init_inference(
model, # an AutoModel of Transformers
mp_size = world_size, # instance (process) count
dtype=torch.float16,
replace_method="auto")
```
Then, returned model is converted into a deepspeed InferenceEnginee type.
#### 3. Optimize Model with BigDL-LLM Low Bit
Distributed model managed by deepspeed can be further optimized with BigDL low-bit Python API, e.g. sym_int4:
```python
# Apply BigDL-LLM INT4 optimizations on transformers
from bigdl.llm import optimize_model
model = optimize_model(model.module.to(f'cpu'), low_bit='sym_int4')
model = model.to(f'cpu:{local_rank}') # move partial model to local rank
```
Then, a bigdl-llm transformers is returned, which in the following, can serve in parallel with native APIs.
#### 4. Start Python Code
You can try deepspeed with BigDL LLM by:
```bash
bash run.sh
```
If you want to run your own application, there are **necessary configurations in the script** which can also be ported to run your custom deepspeed application:
```bash
# run.sh
source bigdl-nano-init
unset OMP_NUM_THREADS # deepspeed will set it for each instance automatically
source /opt/intel/oneccl/env/setvars.sh
......
export FI_PROVIDER=tcp
export CCL_ATL_TRANSPORT=ofi
export CCL_PROCESS_LAUNCHER=none
```
Set the above configurations before running `deepspeed` please to ensure right parallel communication and high performance.