ipex-llm/python/llm/example/CPU/QLoRA-FineTuning
Wang, Jian4 9df70d95eb
Refactor bigdl.llm to ipex_llm (#24)
* Rename bigdl/llm to ipex_llm

* rm python/llm/src/bigdl

* from bigdl.llm to from ipex_llm
2024-03-22 15:41:21 +08:00
..
alpaca-qlora Refactor bigdl.llm to ipex_llm (#24) 2024-03-22 15:41:21 +08:00
qlora_finetuning_cpu.py Refactor bigdl.llm to ipex_llm (#24) 2024-03-22 15:41:21 +08:00
README.md LLM: remove CPU english_quotes dataset and update docker example (#10399) 2024-03-18 10:45:14 +08:00

Finetuning LLAMA Using QLoRA (experimental support)

This example demonstrates how to finetune a llama2-7b model using Big-LLM 4bit optimizations on Intel CPUs.

Distributed Training Guide

  1. Single node with single socket: simple example or alpaca example
  2. Single node with multiple sockets
  3. multiple nodes with multiple sockets

Example: Finetune llama2-7b using QLoRA

This example is ported from bnb-4bit-training.

1. Install

conda create -n llm python=3.9
conda activate llm
pip install --pre --upgrade bigdl-llm[all]
pip install transformers==4.34.0
pip install peft==0.5.0
pip install datasets
pip install accelerate==0.23.0
pip install bitsandbytes scipy

2. Finetune model

If the machine memory is not enough, you can try to set use_gradient_checkpointing=True in here. While gradient checkpointing may improve memory efficiency, it slows training by approximately 20%. We Recommend using micro_batch_size of 8 for better performance using 48cores in this example. You can refer to this guide for more details. And remember to use bigdl-llm-init before you start finetuning, which can accelerate the job.

source bigdl-llm-init -t
python ./qlora_finetuning_cpu.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --dataset DATASET

Sample Output

{'loss': 2.0251, 'learning_rate': 0.0002, 'epoch': 0.02}
{'loss': 1.2389, 'learning_rate': 0.00017777777777777779, 'epoch': 0.03}
{'loss': 1.032, 'learning_rate': 0.00015555555555555556, 'epoch': 0.05}
{'loss': 0.9141, 'learning_rate': 0.00013333333333333334, 'epoch': 0.06}
{'loss': 0.8505, 'learning_rate': 0.00011111111111111112, 'epoch': 0.08}
{'loss': 0.8713, 'learning_rate': 8.888888888888889e-05, 'epoch': 0.09}
{'loss': 0.8635, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.11}
{'loss': 0.8853, 'learning_rate': 4.4444444444444447e-05, 'epoch': 0.12}
{'loss': 0.859, 'learning_rate': 2.2222222222222223e-05, 'epoch': 0.14}
{'loss': 0.8608, 'learning_rate': 0.0, 'epoch': 0.15}
{'train_runtime': xxxx, 'train_samples_per_second': xxxx, 'train_steps_per_second': xxxx, 'train_loss': 1.0400420665740966, 'epoch': 0.15}
100%|███████████████████████████████████████████████████████████████████████████████████| 200/200 [07:16<00:00,  2.18s/it]
TrainOutput(global_step=200, training_loss=1.0400420665740966, metrics={'train_runtime': xxxx, 'train_samples_per_second': xxxx, 'train_steps_per_second': xxxx, 'train_loss': 1.0400420665740966, 'epoch': 0.15})

3. Merge the adapter into the original model

Using the export_merged_model.py to merge.

python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --adapter_path ./outputs/checkpoint-200 --output_path ./outputs/checkpoint-200-merged

Then you can use ./outputs/checkpoint-200-merged as a normal huggingface transformer model to do inference.