4.2 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	IPEX-LLM transformers-style API
Hugging Face transformers AutoModel
You can apply IPEX-LLM optimizations on any Hugging Face Transformers models by using the standard AutoModel APIs.
Note
Here we take
ipex_llm.transformers.AutoModelForCausalLMas an example. The class method for the following class, includingipex_llm.transformers.AutoModel/AutoModelForSpeechSeq2Seq/AutoModelForSeq2SeqLM/AutoModelForSequenceClassification/AutoModelForMaskedLM/AutoModelForQuestionAnswering/AutoModelForNextSentencePrediction/AutoModelForMultipleChoice/AutoModelForTokenClassification, are the same.
class ipex_llm.transformers.AutoModelForCausalLM
classmethod from_pretrained(*args, **kwargs)
Load a model from a directory or the HF Hub. Use load_in_4bit or load_in_low_bit parameter the weight of model’s linears can be loaded to low-bit format, like int4, int5 and int8.
Three new arguments are added to extend Hugging Face’s from_pretrained method as follows:
- 
Parameters:
- 
load_in_4bit: boolean value, True means loading linear's weight to symmetric int 4 if the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4 if the model is GPTQ model. Default to be
False. - 
load_in_low_bit:
strvalue, options are'sym_int4','asym_int4','sym_int5','asym_int5','sym_int8','nf3','nf4','fp4','fp8','fp8_e4m3','fp8_e5m2','fp6','gguf_iq2_xxs','gguf_iq2_xs','gguf_iq1_s','gguf_q4k_m','gguf_q4k_s','fp16','bf16','fp6_k','sym_int4'means symmetric int 4,'asym_int4'means asymmetric int 4,'nf4'means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model. - 
optimize_model: boolean value, Whether to further optimize the low_bit llm model. Default to be
True. - 
modules_to_not_convert: list of str value, modules (
nn.Module) that are skipped when conducting model optimizations. Default to beNone. - 
speculative:
booleanvalue, Whether to use speculative decoding. Default to beFalse. - 
cpu_embedding: Whether to replace the Embedding layer, may need to set it to
Truewhen running IPEX-LLM on GPU. Default to beFalse. - 
imatrix:
strvalue, represent filename of importance matrix pretrained on specific datasets for use with the improved quantization methods recently added to llama.cpp. - 
model_hub:
strvalue, options are'huggingface'and'modelscope', specify the model hub. Default to be'huggingface'. - 
embedding_qtype:
strvalue, options are'q2_k','q4_k'now. Default to beNone. Relevant low bit optimizations will be applied tonn.Embeddinglayer. - 
mixed_precision:
booleanvalue, Whether to use mixed precision quantization. Default to beFalse. If set toTrue, we will usesym_int8for lm_head whenload_in_low_bitissym_int4orasym_int4. - 
pipeline_parallel_stages:
intvalue, the number of GPUs allocated for pipeline parallel. Default to be1. Please setpipeline_parallel_stages > 1to run pipeline parallel inference on multiple GPUs. 
 - 
 - 
Returns:A model instance
 
classmethod from_gguf(fpath, optimize_model=True, cpu_embedding=False, low_bit="sym_int4")
Load gguf model and tokenizer and convert it to bigdl-llm model and huggingface tokenzier
- 
Parameters:
- 
fpath: Path to gguf model file
 - 
optimize_model: Whether to further optimize llm model, defaults to
True - 
cpu_embedding: Whether to replace the Embedding layer, may need to set it to
Truewhen running IPEX-LLM on GPU, defaults toFalse 
 - 
 - 
Returns:An optimized ipex-llm model and a huggingface tokenizer
 
classmethod load_convert(q_k, optimize_model, *args, **kwargs)
classmethod load_low_bit(pretrained_model_name_or_path, *model_args, **kwargs)
Load a low bit optimized model (including INT4, INT5 and INT8) from a saved ckpt.
- 
Parameters:
- 
pretrained_model_name_or_path:
strvalue, Path to load the optimized model ckpt. - 
optimize_model:
booleanvalue, Whether to further optimize the low_bit llm model. Default to beTrue. 
 - 
 - 
Returns:A model instance