ipex-llm/python/llm/example/CPU/Deepspeed-AutoTP
Heyang Sun da6bbc8c11 fix deepspeed dependencies to install (#9400)
* remove reductant parameter from deepspeed install

* Update install.sh

* Update install.sh
2023-11-13 16:42:50 +08:00
..
deepspeed_autotp.py [LLM] Support CPU deepspeed distributed inference (#9259) 2023-11-06 17:56:42 +08:00
install.sh fix deepspeed dependencies to install (#9400) 2023-11-13 16:42:50 +08:00
README.md [LLM] Support CPU deepspeed distributed inference (#9259) 2023-11-06 17:56:42 +08:00
run.sh fix llm-init on deepspeed missing lib (#9419) 2023-11-10 13:51:24 +08:00

Run Tensor-Parallel BigDL Transformers INT4 Inference with Deepspeed

1. Install Dependencies

Install necessary packages (here Python 3.9 is our test environment):

bash install.sh

2. Initialize Deepspeed Distributed Context

Like shown in example code deepspeed_autotp.py, you can construct parallel model with Python API:

# Load in HuggingFace Transformers' model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(...)


# Parallelize model on deepspeed
import deepspeed

model = deepspeed.init_inference(
    model, # an AutoModel of Transformers
    mp_size = world_size, # instance (process) count
    dtype=torch.float16,
    replace_method="auto")

Then, returned model is converted into a deepspeed InferenceEnginee type.

3. Optimize Model with BigDL-LLM Low Bit

Distributed model managed by deepspeed can be further optimized with BigDL low-bit Python API, e.g. sym_int4:

# Apply BigDL-LLM INT4 optimizations on transformers
from bigdl.llm import optimize_model

model = optimize_model(model.module.to(f'cpu'), low_bit='sym_int4')
model = model.to(f'cpu:{local_rank}') # move partial model to local rank

Then, a bigdl-llm transformers is returned, which in the following, can serve in parallel with native APIs.

4. Start Python Code

You can try deepspeed with BigDL LLM by:

bash run.sh

If you want to run your own application, there are necessary configurations in the script which can also be ported to run your custom deepspeed application:

# run.sh
source bigdl-nano-init
unset OMP_NUM_THREADS # deepspeed will set it for each instance automatically
source /opt/intel/oneccl/env/setvars.sh
......
export FI_PROVIDER=tcp
export CCL_ATL_TRANSPORT=ofi
export CCL_PROCESS_LAUNCHER=none

Set the above configurations before running deepspeed please to ensure right parallel communication and high performance.