* updated link * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed, deleted some leftover texts * converted to md file type, need to be reviewed * converted to md file type, need to be reviewed * testing Github Tags * testing Github Tags * added Github Tags * added Github Tags * added Github Tags * Small fix * Small fix * Small fix * Small fix * Small fix * Further fix * Fix index * Small fix * Fix --------- Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
		
			
				
	
	
		
			23 lines
		
	
	
		
			No EOL
		
	
	
		
			2 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			23 lines
		
	
	
		
			No EOL
		
	
	
		
			2 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
# Self-Speculative Decoding
 | 
						|
 | 
						|
### Speculative Decoding in Practice
 | 
						|
In [speculative](https://arxiv.org/abs/2302.01318) [decoding](https://arxiv.org/abs/2211.17192), a small (draft) model quickly generates multiple draft tokens, which are then verified in parallel by the large (target) model. While speculative decoding can effectively speed up the target model, ***in practice it is difficult to maintain or even obtain a proper draft model***, especially when the target model is finetuned with customized data.
 | 
						|
 | 
						|
### Self-Speculative Decoding
 | 
						|
Built on top of the concept of “[self-speculative decoding](https://arxiv.org/abs/2309.08168)”, IPEX-LLM can now accelerate the original FP16 or BF16 model ***without the need of a separate draft model or model finetuning***; instead, it automatically converts the original model to INT4, and uses the INT4 model as the draft model behind the scene. In practice, this brings ***~30% speedup*** for FP16 and BF16 LLM inference latency on Intel GPU and CPU respectively.
 | 
						|
 | 
						|
### Using IPEX-LLM Self-Speculative Decoding
 | 
						|
Please refer to IPEX-LLM self-speculative decoding code snippets below, and the detailed [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Speculative-Decoding) and [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Speculative-Decoding) examples in the project repo.
 | 
						|
 | 
						|
```python 
 | 
						|
model = AutoModelForCausalLM.from_pretrained(model_path,
 | 
						|
                                             optimize_model=True,
 | 
						|
                                             torch_dtype=torch.float16, #use bfloat16 on cpu
 | 
						|
                                             load_in_low_bit="fp16", #use bf16 on cpu
 | 
						|
                                             speculative=True, #set speculative to true
 | 
						|
                                             trust_remote_code=True,
 | 
						|
                                             use_cache=True)
 | 
						|
output = model.generate(input_ids,
 | 
						|
                        max_new_tokens=args.n_predict,
 | 
						|
                        do_sample=False)
 | 
						|
``` |