*ipex-llm's accelerate has been upgraded to 0.23.0. Remove accelerate 0.23.0 install command in README and docker。
		
			
				
	
	
	
	
		
			3.9 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			3.9 KiB
		
	
	
	
	
	
	
	
Finetuning LLAMA Using QLoRA (experimental support)
This example demonstrates how to finetune a llama2-7b model using Big-LLM 4bit optimizations on Intel CPUs.
Distributed Training Guide
- Single node with single socket: simple example or alpaca example
 - Single node with multiple sockets
 - multiple nodes with multiple sockets
 
Example: Finetune llama2-7b using QLoRA
This example is ported from bnb-4bit-training.
1. Install
conda create -n llm python=3.11
conda activate llm
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
pip install transformers==4.36.0
pip install peft==0.10.0
pip install datasets
pip install bitsandbytes scipy
2. Finetune model
If the machine memory is not enough, you can try to set use_gradient_checkpointing=True in here. While gradient checkpointing may improve memory efficiency, it slows training by approximately 20%.
We Recommend using micro_batch_size of 8 for better performance using 48cores in this example. You can refer to this guide for more details.
And remember to use ipex-llm-init before you start finetuning, which can accelerate the job.
source ipex-llm-init -t
python ./qlora_finetuning_cpu.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --dataset DATASET
Sample Output
{'loss': 2.0251, 'learning_rate': 0.0002, 'epoch': 0.02}
{'loss': 1.2389, 'learning_rate': 0.00017777777777777779, 'epoch': 0.03}
{'loss': 1.032, 'learning_rate': 0.00015555555555555556, 'epoch': 0.05}
{'loss': 0.9141, 'learning_rate': 0.00013333333333333334, 'epoch': 0.06}
{'loss': 0.8505, 'learning_rate': 0.00011111111111111112, 'epoch': 0.08}
{'loss': 0.8713, 'learning_rate': 8.888888888888889e-05, 'epoch': 0.09}
{'loss': 0.8635, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.11}
{'loss': 0.8853, 'learning_rate': 4.4444444444444447e-05, 'epoch': 0.12}
{'loss': 0.859, 'learning_rate': 2.2222222222222223e-05, 'epoch': 0.14}
{'loss': 0.8608, 'learning_rate': 0.0, 'epoch': 0.15}
{'train_runtime': xxxx, 'train_samples_per_second': xxxx, 'train_steps_per_second': xxxx, 'train_loss': 1.0400420665740966, 'epoch': 0.15}
100%|███████████████████████████████████████████████████████████████████████████████████| 200/200 [07:16<00:00,  2.18s/it]
TrainOutput(global_step=200, training_loss=1.0400420665740966, metrics={'train_runtime': xxxx, 'train_samples_per_second': xxxx, 'train_steps_per_second': xxxx, 'train_loss': 1.0400420665740966, 'epoch': 0.15})
3. Merge the adapter into the original model
Using the export_merged_model.py to merge.
python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --adapter_path ./outputs/checkpoint-200 --output_path ./outputs/checkpoint-200-merged
Then you can use ./outputs/checkpoint-200-merged as a normal huggingface transformer model to do inference.