diff --git a/python/llm/example/CPU/QLoRA-FineTuning/README.md b/python/llm/example/CPU/QLoRA-FineTuning/README.md new file mode 100644 index 00000000..5acd255b --- /dev/null +++ b/python/llm/example/CPU/QLoRA-FineTuning/README.md @@ -0,0 +1,76 @@ +# Finetuning LLAMA Using QLoRA (experimental support) + +This example demonstrates how to finetune a llama2-7b model using Big-LLM 4bit optimizations on [Intel CPUs](../README.md). + + +## Example: Finetune llama2-7b using QLoRA + +This example is ported from [bnb-4bit-training](https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k). + +### 1. Install + +```bash +conda create -n llm python=3.9 +conda activate llm +pip install --pre --upgrade bigdl-llm[all] +pip install transformers==4.34.0 +pip install peft==0.5.0 +pip install datasets +``` + +### 2. Finetune model + +``` +python ./qlora_finetuning_cpu.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --dataset DATASET +``` + +#### Sample Output +```log +{'loss': 2.5668, 'learning_rate': 0.0002, 'epoch': 0.03} +{'loss': 1.6988, 'learning_rate': 0.00017777777777777779, 'epoch': 0.06} +{'loss': 1.3073, 'learning_rate': 0.00015555555555555556, 'epoch': 0.1} +{'loss': 1.3495, 'learning_rate': 0.00013333333333333334, 'epoch': 0.13} +{'loss': 1.1746, 'learning_rate': 0.00011111111111111112, 'epoch': 0.16} +{'loss': 1.0794, 'learning_rate': 8.888888888888889e-05, 'epoch': 0.19} +{'loss': 1.2214, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.22} +{'loss': 1.1698, 'learning_rate': 4.4444444444444447e-05, 'epoch': 0.26} +{'loss': 1.2044, 'learning_rate': 2.2222222222222223e-05, 'epoch': 0.29} +{'loss': 1.1516, 'learning_rate': 0.0, 'epoch': 0.32} +{'train_runtime': 474.3254, 'train_samples_per_second': 1.687, 'train_steps_per_second': 0.422, 'train_loss': 1.3923714351654053, 'epoch': 0.32} +100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [07:54<00:00, 2.37s/it] +TrainOutput(global_step=200, training_loss=1.3923714351654053, metrics={'train_runtime': 474.3254, 'train_samples_per_second': 1.687, 'train_steps_per_second': 0.422, 'train_loss': 1.3923714351654053, 'epoch': 0.32}) +``` + +### 3. Merge the adapter into the original model +Using the [export_merged_model.py](https://github.com/intel-analytics/BigDL/blob/main/python/llm/example/GPU/QLoRA-FineTuning/export_merged_model.py) to merge. +``` +python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --adapter_path ./outputs/checkpoint-200 --output_path ./outputs/checkpoint-200-merged +``` + +Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference. + +### 4. Use BigDL-LLM to verify the fine-tuning effect +Train more steps and try input sentence like `['quote'] -> [?]` to verify. For example, using `“QLoRA fine-tuning using BigDL-LLM 4bit optimizations on Intel CPU is Efficient and convenient” ->: ` to inference. +BigDL-LLM llama2 example [link](https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2). Update the `LLAMA2_PROMPT_FORMAT = "{prompt}"`. +```bash +python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt "“QLoRA fine-tuning using BigDL-LLM 4bit optimizations on Intel CPU is Efficient and convenient” ->:" --n-predict 20 +``` + +#### Sample Output +Base_model output +```log +Inference time: 1.7017452716827393 s +-------------------- Prompt -------------------- +“QLoRA fine-tuning using BigDL-LLM 4bit optimizations on Intel CPU is Efficient and convenient” ->: +-------------------- Output -------------------- +“QLoRA fine-tuning using BigDL-LLM 4bit optimizations on Intel CPU is Efficient and convenient” ->: 💻 Fine-tuning a language model on a powerful device like an Intel CPU +``` +Merged_model output +```log +Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. +Inference time: 2.864234209060669 s +-------------------- Prompt -------------------- +“QLoRA fine-tuning using BigDL-LLM 4bit optimizations on Intel CPU is Efficient and convenient” ->: +-------------------- Output -------------------- +“QLoRA fine-tuning using BigDL-LLM 4bit optimizations on Intel CPU is Efficient and convenient” ->: ['bigdl'] ['deep-learning'] ['distributed-computing'] ['intel'] ['optimization'] ['training'] ['training-speed'] +``` \ No newline at end of file diff --git a/python/llm/example/CPU/QLoRA-FineTuning/qlora_finetuning_cpu.py b/python/llm/example/CPU/QLoRA-FineTuning/qlora_finetuning_cpu.py new file mode 100644 index 00000000..2863251e --- /dev/null +++ b/python/llm/example/CPU/QLoRA-FineTuning/qlora_finetuning_cpu.py @@ -0,0 +1,86 @@ +# +# Copyright 2016 The BigDL Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import torch +import os + +import transformers +from transformers import LlamaTokenizer + +from peft import LoraConfig +from bigdl.llm.transformers.qlora import get_peft_model, prepare_model_for_kbit_training +from bigdl.llm.transformers import AutoModelForCausalLM +from datasets import load_dataset +import argparse + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for Llama2 model') + parser.add_argument('--repo-id-or-model-path', type=str, default="meta-llama/Llama-2-7b-hf", + help='The huggingface repo id for the Llama2 (e.g. `meta-llama/Llama-2-7b-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded' + ', or the path to the huggingface checkpoint folder') + parser.add_argument('--dataset', type=str, default="Abirate/english_quotes") + + args = parser.parse_args() + model_path = args.repo_id_or_model_path + dataset_path = args.dataset + tokenizer = LlamaTokenizer.from_pretrained(model_path, trust_remote_code=True) + + data = load_dataset(dataset_path) + def merge(row): + row['prediction'] = row['quote'] + ' ->: ' + str(row['tags']) + return row + data['train'] = data['train'].map(merge) + data = data.map(lambda samples: tokenizer(samples["prediction"]), batched=True) + model = AutoModelForCausalLM.from_pretrained(model_path, + load_in_low_bit="sym_int4", + optimize_model=False, + torch_dtype=torch.float16, + modules_to_not_convert=["lm_head"], ) + model = model.to('cpu') + model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=False) + model.enable_input_require_grads() + config = LoraConfig( + r=8, + lora_alpha=32, + target_modules=["q_proj", "k_proj", "v_proj"], + lora_dropout=0.05, + bias="none", + task_type="CAUSAL_LM" + ) + model = get_peft_model(model, config) + tokenizer.pad_token_id = 0 + tokenizer.padding_side = "left" + trainer = transformers.Trainer( + model=model, + train_dataset=data["train"], + args=transformers.TrainingArguments( + per_device_train_batch_size=4, + gradient_accumulation_steps=1, + warmup_steps=20, + max_steps=200, + learning_rate=2e-4, + save_steps=100, + bf16=True, + logging_steps=20, + output_dir="outputs", + optim="adamw_hf", # paged_adamw_8bit is not supported yet + # gradient_checkpointing=True, # can further reduce memory but slower + ), + data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False), + ) + model.config.use_cache = False # silence the warnings. Please re-enable for inference! + result = trainer.train() + print(result) diff --git a/python/llm/example/CPU/README.md b/python/llm/example/CPU/README.md index b6be45cd..f9fe7769 100644 --- a/python/llm/example/CPU/README.md +++ b/python/llm/example/CPU/README.md @@ -7,6 +7,7 @@ This folder contains examples of running BigDL-LLM on Intel CPU: - [Native-Models](Native-Models): converting & running LLM in `llama`/`chatglm`/`bloom`/`gptneox`/`starcoder` model family using native (cpp) implementation - [LangChain](LangChain): running LangChain applications on BigDL-LLM - [Applications](Applications): running Transformers applications on BigDl-LLM +- [QLoRA-FineTuning](QLoRA-FineTuning): running QLoRA finetuning using BigDL-LLM on intel CPUs ## System Support **Hardware**: