ipex-llm/docs/mddocs/Inference/Self_Speculative_Decoding.md

# Self-Speculative Decoding

### Speculative Decoding in Practice
In [speculative](https://arxiv.org/abs/2302.01318) [decoding](https://arxiv.org/abs/2211.17192), a small (draft) model quickly generates multiple draft tokens, which are then verified in parallel by the large (target) model. While speculative decoding can effectively speed up the target model, ***in practice it is difficult to maintain or even obtain a proper draft model***, especially when the target model is finetuned with customized data.

### Self-Speculative Decoding
Built on top of the concept of “[self-speculative decoding](https://arxiv.org/abs/2309.08168)”, IPEX-LLM can now accelerate the original FP16 or BF16 model ***without the need of a separate draft model or model finetuning***; instead, it automatically converts the original model to INT4, and uses the INT4 model as the draft model behind the scene. In practice, this brings ***~30% speedup*** for FP16 and BF16 LLM inference latency on Intel GPU and CPU respectively.

### Using IPEX-LLM Self-Speculative Decoding
Please refer to IPEX-LLM self-speculative decoding code snippets below, and the detailed [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Speculative-Decoding) and [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Speculative-Decoding) examples in the project repo.

```python
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             optimize_model=True,
                                             torch_dtype=torch.float16, #use bfloat16 on cpu
                                             load_in_low_bit="fp16", #use bf16 on cpu
                                             speculative=True, #set speculative to true
                                             trust_remote_code=True,
                                             use_cache=True)
output = model.generate(input_ids,
                        max_new_tokens=args.n_predict,
                        do_sample=False)
```