* updated link * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed, deleted some leftover texts * converted to md file type, need to be reviewed * converted to md file type, need to be reviewed * testing Github Tags * testing Github Tags * added Github Tags * added Github Tags * added Github Tags * Small fix * Small fix * Small fix * Small fix * Small fix * Further fix * Fix index * Small fix * Fix --------- Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
		
			
				
	
	
	
	
		
			2.6 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	Finetune (QLoRA)
We also support finetuning LLMs (large language models) using QLoRA with IPEX-LLM 4bit optimizations on Intel GPUs.
Note
Currently, only Hugging Face Transformers models are supported running QLoRA finetuning.
To help you better understand the finetuning process, here we use model Llama-2-7b-hf as an example.
Make sure you have prepared environment following instructions here.
Note
If you are using an older version of
ipex-llm(specifically, older than 2.5.0b20240104), you need to manually addimport intel_extension_for_pytorch as ipexat the beginning of your code.
First, load model using transformers-style API and set it to to('xpu'). We specify load_in_low_bit="nf4" here to apply 4-bit NormalFloat optimization. According to the QLoRA paper, using "nf4" could yield better model quality than "int4".
from ipex_llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf",
                                             load_in_low_bit="nf4",
                                             optimize_model=False,
                                             torch_dtype=torch.float16,
                                             modules_to_not_convert=["lm_head"],)
model = model.to('xpu')
Then, we have to apply some preprocessing to the model to prepare it for training.
from ipex_llm.transformers.qlora import prepare_model_for_kbit_training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
Next, we can obtain a Peft model from the optimized model and a configuration object containing the parameters as follows:
from ipex_llm.transformers.qlora import get_peft_model
from peft import LoraConfig
config = LoraConfig(r=8, 
                    lora_alpha=32, 
                    target_modules=["q_proj", "k_proj", "v_proj"], 
                    lora_dropout=0.05, 
                    bias="none", 
                    task_type="CAUSAL_LM")
model = get_peft_model(model, config)
Important
Instead of
from peft import prepare_model_for_kbit_training, get_peft_modelas we did for regular QLoRA using bitandbytes and cuda, we import them fromipex_llm.transformers.qlorahere to get a IPEX-LLM compatible Peft model. And the rest is just the same as regular LoRA finetuning process usingpeft.
Tip
See the complete examples here