* [LLM] Support CPU Deepspeed distributed inference * Update run_deepspeed.py * Rename * fix style * add new codes * refine * remove annotated codes * refine * Update README.md * refine doc and example code
		
			
				
	
	
		
			69 lines
		
	
	
	
		
			1.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			69 lines
		
	
	
	
		
			1.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
### Run Tensor-Parallel BigDL Transformers INT4 Inference with Deepspeed
 | 
						|
 | 
						|
#### 1. Install Dependencies
 | 
						|
 | 
						|
Install necessary packages (here Python 3.9 is our test environment):
 | 
						|
 | 
						|
```bash
 | 
						|
bash install.sh
 | 
						|
```
 | 
						|
 | 
						|
#### 2. Initialize Deepspeed Distributed Context
 | 
						|
 | 
						|
Like shown in example code `deepspeed_autotp.py`, you can construct parallel model with Python API:
 | 
						|
 | 
						|
```python
 | 
						|
# Load in HuggingFace Transformers' model
 | 
						|
from transformers import AutoModelForCausalLM
 | 
						|
 | 
						|
model = AutoModelForCausalLM.from_pretrained(...)
 | 
						|
 | 
						|
 | 
						|
# Parallelize model on deepspeed
 | 
						|
import deepspeed
 | 
						|
 | 
						|
model = deepspeed.init_inference(
 | 
						|
    model, # an AutoModel of Transformers
 | 
						|
    mp_size = world_size, # instance (process) count
 | 
						|
    dtype=torch.float16,
 | 
						|
    replace_method="auto")
 | 
						|
```
 | 
						|
 | 
						|
Then, returned model is converted into a deepspeed InferenceEnginee type.
 | 
						|
 | 
						|
#### 3. Optimize Model with BigDL-LLM Low Bit
 | 
						|
 | 
						|
Distributed model managed by deepspeed can be further optimized with BigDL low-bit Python API, e.g. sym_int4:
 | 
						|
 | 
						|
```python
 | 
						|
# Apply BigDL-LLM INT4 optimizations on transformers
 | 
						|
from bigdl.llm import optimize_model
 | 
						|
 | 
						|
model = optimize_model(model.module.to(f'cpu'), low_bit='sym_int4')
 | 
						|
model = model.to(f'cpu:{local_rank}') # move partial model to local rank
 | 
						|
```
 | 
						|
 | 
						|
Then, a bigdl-llm transformers is returned, which in the following, can serve in parallel with native APIs.
 | 
						|
 | 
						|
#### 4. Start Python Code
 | 
						|
 | 
						|
You can try deepspeed with BigDL LLM by:
 | 
						|
 | 
						|
```bash
 | 
						|
bash run.sh
 | 
						|
```
 | 
						|
 | 
						|
If you want to run your own application, there are **necessary configurations in the script** which can also be ported to run your custom deepspeed application:
 | 
						|
 | 
						|
```bash
 | 
						|
# run.sh
 | 
						|
source bigdl-nano-init
 | 
						|
unset OMP_NUM_THREADS # deepspeed will set it for each instance automatically
 | 
						|
source /opt/intel/oneccl/env/setvars.sh
 | 
						|
......
 | 
						|
export FI_PROVIDER=tcp
 | 
						|
export CCL_ATL_TRANSPORT=ofi
 | 
						|
export CCL_PROCESS_LAUNCHER=none
 | 
						|
```
 | 
						|
 | 
						|
Set the above configurations before running `deepspeed` please to ensure right parallel communication and high performance.
 |