From 93146b943351f873077aac77067106650839ab06 Mon Sep 17 00:00:00 2001 From: Jiao Wang Date: Wed, 29 May 2024 13:15:27 -0700 Subject: [PATCH] Reconstruct Speculative Decoding example directory (#11136) * update * update * update --- .../{Eagle => EAGLE}/README.md | 4 +-- .../data/mt_bench/question.jsonl | 0 .../evaluation/gen_ea_answer_llama2chat.py | 2 +- .../{Eagle => EAGLE}/evaluation/speed.py | 0 .../{Eagle => EAGLE}/requirements.txt | 0 .../CPU/Speculative-Decoding/README.md | 17 +++-------- .../Self-Speculation/README.md | 15 ++++++++++ .../baichuan2/README.md | 2 +- .../modeling_baichuan.ipex | 0 .../tokenization_baichuan.ipex | 0 .../baichuan2/speculative.py | 0 .../{ => Self-Speculation}/chatglm3/README.md | 2 +- .../chatglm3/speculative.py | 0 .../{ => Self-Speculation}/llama2/README.md | 2 +- .../llama2/speculative.py | 0 .../{ => Self-Speculation}/llama3/README.md | 2 +- .../llama3/speculative.py | 0 .../{ => Self-Speculation}/mistral/README.md | 2 +- .../mistral/speculative.py | 0 .../{ => Self-Speculation}/mixtral/README.md | 2 +- .../mixtral/speculative.py | 0 .../{ => Self-Speculation}/qwen/README.md | 2 +- .../qwen/speculative.py | 0 .../starcoder/README.md | 2 +- .../starcoder/speculative.py | 0 .../{ => Self-Speculation}/vicuna/README.md | 2 +- .../vicuna/speculative.py | 0 .../{ => Self-Speculation}/ziya/README.md | 2 +- .../ziya/speculative.py | 0 .../{Eagle => EAGLE}/README.md | 12 ++++++-- .../data/mt_bench/question.jsonl | 0 .../evaluation/gen_ea_answer_llama2chat.py | 2 +- .../{Eagle => EAGLE}/evaluation/speed.py | 0 .../{Eagle => EAGLE}/requirements.txt | 0 .../GPU/Speculative-Decoding/README.md | 28 +++---------------- .../Self-Speculation/README.md | 26 +++++++++++++++++ .../baichuan2/README.md | 2 +- .../baichuan2/speculative.py | 0 .../{ => Self-Speculation}/chatglm3/README.md | 2 +- .../chatglm3/speculative.py | 0 .../{ => Self-Speculation}/gpt-j/README.md | 2 +- .../gpt-j/speculative.py | 0 .../{ => Self-Speculation}/llama2/README.md | 2 +- .../llama2/speculative.py | 0 .../{ => Self-Speculation}/mistral/README.md | 2 +- .../mistral/speculative.py | 0 .../{ => Self-Speculation}/qwen/README.md | 2 +- .../qwen/speculative.py | 0 48 files changed, 79 insertions(+), 59 deletions(-) rename python/llm/example/CPU/Speculative-Decoding/{Eagle => EAGLE}/README.md (91%) rename python/llm/example/CPU/Speculative-Decoding/{Eagle => EAGLE}/data/mt_bench/question.jsonl (100%) rename python/llm/example/CPU/Speculative-Decoding/{Eagle => EAGLE}/evaluation/gen_ea_answer_llama2chat.py (99%) rename python/llm/example/CPU/Speculative-Decoding/{Eagle => EAGLE}/evaluation/speed.py (100%) rename python/llm/example/CPU/Speculative-Decoding/{Eagle => EAGLE}/requirements.txt (100%) create mode 100644 python/llm/example/CPU/Speculative-Decoding/Self-Speculation/README.md rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/baichuan2/README.md (98%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/baichuan2/baichaun2_7b_opt_ipex/modeling_baichuan.ipex (100%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/baichuan2/baichaun2_7b_opt_ipex/tokenization_baichuan.ipex (100%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/baichuan2/speculative.py (100%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/chatglm3/README.md (96%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/chatglm3/speculative.py (100%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/llama2/README.md (98%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/llama2/speculative.py (100%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/llama3/README.md (97%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/llama3/speculative.py (100%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/mistral/README.md (98%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/mistral/speculative.py (100%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/mixtral/README.md (98%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/mixtral/speculative.py (100%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/qwen/README.md (99%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/qwen/speculative.py (100%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/starcoder/README.md (93%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/starcoder/speculative.py (100%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/vicuna/README.md (98%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/vicuna/speculative.py (100%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/ziya/README.md (92%) rename python/llm/example/CPU/Speculative-Decoding/{ => Self-Speculation}/ziya/speculative.py (100%) rename python/llm/example/GPU/Speculative-Decoding/{Eagle => EAGLE}/README.md (80%) rename python/llm/example/GPU/Speculative-Decoding/{Eagle => EAGLE}/data/mt_bench/question.jsonl (100%) rename python/llm/example/GPU/Speculative-Decoding/{Eagle => EAGLE}/evaluation/gen_ea_answer_llama2chat.py (99%) rename python/llm/example/GPU/Speculative-Decoding/{Eagle => EAGLE}/evaluation/speed.py (100%) rename python/llm/example/GPU/Speculative-Decoding/{Eagle => EAGLE}/requirements.txt (100%) create mode 100644 python/llm/example/GPU/Speculative-Decoding/Self-Speculation/README.md rename python/llm/example/GPU/Speculative-Decoding/{ => Self-Speculation}/baichuan2/README.md (97%) rename python/llm/example/GPU/Speculative-Decoding/{ => Self-Speculation}/baichuan2/speculative.py (100%) rename python/llm/example/GPU/Speculative-Decoding/{ => Self-Speculation}/chatglm3/README.md (96%) rename python/llm/example/GPU/Speculative-Decoding/{ => Self-Speculation}/chatglm3/speculative.py (100%) rename python/llm/example/GPU/Speculative-Decoding/{ => Self-Speculation}/gpt-j/README.md (97%) rename python/llm/example/GPU/Speculative-Decoding/{ => Self-Speculation}/gpt-j/speculative.py (100%) rename python/llm/example/GPU/Speculative-Decoding/{ => Self-Speculation}/llama2/README.md (98%) rename python/llm/example/GPU/Speculative-Decoding/{ => Self-Speculation}/llama2/speculative.py (100%) rename python/llm/example/GPU/Speculative-Decoding/{ => Self-Speculation}/mistral/README.md (98%) rename python/llm/example/GPU/Speculative-Decoding/{ => Self-Speculation}/mistral/speculative.py (100%) rename python/llm/example/GPU/Speculative-Decoding/{ => Self-Speculation}/qwen/README.md (97%) rename python/llm/example/GPU/Speculative-Decoding/{ => Self-Speculation}/qwen/speculative.py (100%) diff --git a/python/llm/example/CPU/Speculative-Decoding/Eagle/README.md b/python/llm/example/CPU/Speculative-Decoding/EAGLE/README.md similarity index 91% rename from python/llm/example/CPU/Speculative-Decoding/Eagle/README.md rename to python/llm/example/CPU/Speculative-Decoding/EAGLE/README.md index 9720c1df..f51c9ac3 100644 --- a/python/llm/example/CPU/Speculative-Decoding/Eagle/README.md +++ b/python/llm/example/CPU/Speculative-Decoding/EAGLE/README.md @@ -1,8 +1,8 @@ -# Eagle - Speculative Sampling using IPEX-LLM on Intel CPUs +# EAGLE - Speculative Sampling using IPEX-LLM on Intel CPUs In this directory, you will find the examples on how IPEX-LLM accelerate inference with speculative sampling using EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a speculative sampling method that improves text generation speed) on Intel CPUs. See [here](https://arxiv.org/abs/2401.15077) to view the paper and [here](https://github.com/SafeAILab/EAGLE) for more info on EAGLE code. ## Requirements -To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. +To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../../README.md#system-support) for more information. Make sure you have installed `ipex-llm` before: ## Example - EAGLE Speculative Sampling with IPEX-LLM on MT-bench In this example, we run inference for a Llama2 model to showcase the speed of EAGLE with IPEX-LLM on MT-bench data on Intel CPUs. diff --git a/python/llm/example/CPU/Speculative-Decoding/Eagle/data/mt_bench/question.jsonl b/python/llm/example/CPU/Speculative-Decoding/EAGLE/data/mt_bench/question.jsonl similarity index 100% rename from python/llm/example/CPU/Speculative-Decoding/Eagle/data/mt_bench/question.jsonl rename to python/llm/example/CPU/Speculative-Decoding/EAGLE/data/mt_bench/question.jsonl diff --git a/python/llm/example/CPU/Speculative-Decoding/Eagle/evaluation/gen_ea_answer_llama2chat.py b/python/llm/example/CPU/Speculative-Decoding/EAGLE/evaluation/gen_ea_answer_llama2chat.py similarity index 99% rename from python/llm/example/CPU/Speculative-Decoding/Eagle/evaluation/gen_ea_answer_llama2chat.py rename to python/llm/example/CPU/Speculative-Decoding/EAGLE/evaluation/gen_ea_answer_llama2chat.py index 2e1c30bd..ec4b930f 100755 --- a/python/llm/example/CPU/Speculative-Decoding/Eagle/evaluation/gen_ea_answer_llama2chat.py +++ b/python/llm/example/CPU/Speculative-Decoding/EAGLE/evaluation/gen_ea_answer_llama2chat.py @@ -199,7 +199,7 @@ def get_model_answers( if enable_ipex_llm: # single line of change to enable ipex-llm - model = optimize_model(model, optimize_llm=False) + model = optimize_model(model, low_bit='sym_int4', optimize_llm=False) tokenizer = model.get_tokenizer() diff --git a/python/llm/example/CPU/Speculative-Decoding/Eagle/evaluation/speed.py b/python/llm/example/CPU/Speculative-Decoding/EAGLE/evaluation/speed.py similarity index 100% rename from python/llm/example/CPU/Speculative-Decoding/Eagle/evaluation/speed.py rename to python/llm/example/CPU/Speculative-Decoding/EAGLE/evaluation/speed.py diff --git a/python/llm/example/CPU/Speculative-Decoding/Eagle/requirements.txt b/python/llm/example/CPU/Speculative-Decoding/EAGLE/requirements.txt similarity index 100% rename from python/llm/example/CPU/Speculative-Decoding/Eagle/requirements.txt rename to python/llm/example/CPU/Speculative-Decoding/EAGLE/requirements.txt diff --git a/python/llm/example/CPU/Speculative-Decoding/README.md b/python/llm/example/CPU/Speculative-Decoding/README.md index 8d603d2a..eff869b5 100644 --- a/python/llm/example/CPU/Speculative-Decoding/README.md +++ b/python/llm/example/CPU/Speculative-Decoding/README.md @@ -1,15 +1,6 @@ -# Self-Speculative Decoding for Large Language Model BF16 Inference using IPEX-LLM on Intel CPUs -You can use IPEX-LLM to run BF16 inference for any Huggingface Transformer model with ***self-speculative decoding*** on Intel CPUs. This directory contains example scripts to help you quickly get started to run some popular open-source models using self-speculative decoding. Each model has its own dedicated folder, where you can find detailed instructions on how to install and run it. +# Speculative-Decoding Examples on Intel CPU -## Verified Hardware Platforms +This folder contains examples of running Speculative-Decoding Examples with IPEX-LLM on Intel CPU: -- Intel Xeon SPR server - -## Recommended Requirements -To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#system-support) for more information. Make sure you have installed `ipex-llm` before: - -```bash -pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu -``` - -Moreover, install IPEX 2.1.0, which can be done through `pip install intel_extension_for_pytorch==2.1.0`. +- [Self-Speculation](Self-Speculation): running BF16 inference for Huggingface Transformer model with ***self-speculative decoding*** with IPEX-LLM on Intel CPUs +- [EAGLE](EAGLE): running speculative sampling using ***EAGLE*** (Extrapolation Algorithm for Greater Language-model Efficiency) with IPEX-LLM on Intel CPUs diff --git a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/README.md b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/README.md new file mode 100644 index 00000000..f2abd8f8 --- /dev/null +++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/README.md @@ -0,0 +1,15 @@ +# Self-Speculative Decoding for Large Language Model BF16 Inference using IPEX-LLM on Intel CPUs +You can use IPEX-LLM to run BF16 inference for any Huggingface Transformer model with ***self-speculative decoding*** on Intel CPUs. This directory contains example scripts to help you quickly get started to run some popular open-source models using self-speculative decoding. Each model has its own dedicated folder, where you can find detailed instructions on how to install and run it. + +## Verified Hardware Platforms + +- Intel Xeon SPR server + +## Recommended Requirements +To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../../README.md#system-support) for more information. Make sure you have installed `ipex-llm` before: + +```bash +pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu +``` + +Moreover, install IPEX 2.1.0, which can be done through `pip install intel_extension_for_pytorch==2.1.0`. diff --git a/python/llm/example/CPU/Speculative-Decoding/baichuan2/README.md b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/README.md similarity index 98% rename from python/llm/example/CPU/Speculative-Decoding/baichuan2/README.md rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/README.md index 95a1320c..ece64c34 100644 --- a/python/llm/example/CPU/Speculative-Decoding/baichuan2/README.md +++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/README.md @@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Baichuan2 BF16 in To run these examples with IPEX-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. ## Example: Predict Tokens using `generate()` API -In the example [speculative.py](./speculative.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. +In the example [speculative.py](speculative.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. ### 1. Install We suggest using conda to manage environment: ```bash diff --git a/python/llm/example/CPU/Speculative-Decoding/baichuan2/baichaun2_7b_opt_ipex/modeling_baichuan.ipex b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/baichaun2_7b_opt_ipex/modeling_baichuan.ipex similarity index 100% rename from python/llm/example/CPU/Speculative-Decoding/baichuan2/baichaun2_7b_opt_ipex/modeling_baichuan.ipex rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/baichaun2_7b_opt_ipex/modeling_baichuan.ipex diff --git a/python/llm/example/CPU/Speculative-Decoding/baichuan2/baichaun2_7b_opt_ipex/tokenization_baichuan.ipex b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/baichaun2_7b_opt_ipex/tokenization_baichuan.ipex similarity index 100% rename from python/llm/example/CPU/Speculative-Decoding/baichuan2/baichaun2_7b_opt_ipex/tokenization_baichuan.ipex rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/baichaun2_7b_opt_ipex/tokenization_baichuan.ipex diff --git a/python/llm/example/CPU/Speculative-Decoding/baichuan2/speculative.py b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/speculative.py similarity index 100% rename from python/llm/example/CPU/Speculative-Decoding/baichuan2/speculative.py rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/speculative.py diff --git a/python/llm/example/CPU/Speculative-Decoding/chatglm3/README.md b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/chatglm3/README.md similarity index 96% rename from python/llm/example/CPU/Speculative-Decoding/chatglm3/README.md rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/chatglm3/README.md index 9dfe58fd..6574232f 100644 --- a/python/llm/example/CPU/Speculative-Decoding/chatglm3/README.md +++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/chatglm3/README.md @@ -3,7 +3,7 @@ In this directory, you will find examples on how you could run ChatGLM3 BF16 inf ## Example: Predict Tokens using `generate()` API -In the example [speculative.py](./speculative.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. +In the example [speculative.py](speculative.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. ### 1. Install We suggest using conda to manage environment: ```bash diff --git a/python/llm/example/CPU/Speculative-Decoding/chatglm3/speculative.py b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/chatglm3/speculative.py similarity index 100% rename from python/llm/example/CPU/Speculative-Decoding/chatglm3/speculative.py rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/chatglm3/speculative.py diff --git a/python/llm/example/CPU/Speculative-Decoding/llama2/README.md b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama2/README.md similarity index 98% rename from python/llm/example/CPU/Speculative-Decoding/llama2/README.md rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama2/README.md index 418b59e5..e5702c12 100644 --- a/python/llm/example/CPU/Speculative-Decoding/llama2/README.md +++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama2/README.md @@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run LLaMA2 BF16 infer To run these examples with IPEX-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. ## Example: Predict Tokens using `generate()` API -In the example [speculative.py](./speculative.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. +In the example [speculative.py](speculative.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. ### 1. Install We suggest using conda to manage environment: ```bash diff --git a/python/llm/example/CPU/Speculative-Decoding/llama2/speculative.py b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama2/speculative.py similarity index 100% rename from python/llm/example/CPU/Speculative-Decoding/llama2/speculative.py rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama2/speculative.py diff --git a/python/llm/example/CPU/Speculative-Decoding/llama3/README.md b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama3/README.md similarity index 97% rename from python/llm/example/CPU/Speculative-Decoding/llama3/README.md rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama3/README.md index 0e0a83bb..84a0df2b 100644 --- a/python/llm/example/CPU/Speculative-Decoding/llama3/README.md +++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama3/README.md @@ -8,7 +8,7 @@ To run these examples with IPEX-LLM on Intel CPUs, we have some recommended requ ## Example: Predict Tokens using `generate()` API -In the example [speculative.py](./speculative.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. +In the example [speculative.py](speculative.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. ### 1. Install diff --git a/python/llm/example/CPU/Speculative-Decoding/llama3/speculative.py b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama3/speculative.py similarity index 100% rename from python/llm/example/CPU/Speculative-Decoding/llama3/speculative.py rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama3/speculative.py diff --git a/python/llm/example/CPU/Speculative-Decoding/mistral/README.md b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mistral/README.md similarity index 98% rename from python/llm/example/CPU/Speculative-Decoding/mistral/README.md rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mistral/README.md index 5cb56942..b58007a6 100644 --- a/python/llm/example/CPU/Speculative-Decoding/mistral/README.md +++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mistral/README.md @@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Mistral BF16 infe To run these examples with IPEX-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. ## Example: Predict Tokens using `generate()` API -In the example [speculative.py](./speculative.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. +In the example [speculative.py](speculative.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. ### 1. Install We suggest using conda to manage environment: ```bash diff --git a/python/llm/example/CPU/Speculative-Decoding/mistral/speculative.py b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mistral/speculative.py similarity index 100% rename from python/llm/example/CPU/Speculative-Decoding/mistral/speculative.py rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mistral/speculative.py diff --git a/python/llm/example/CPU/Speculative-Decoding/mixtral/README.md b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mixtral/README.md similarity index 98% rename from python/llm/example/CPU/Speculative-Decoding/mixtral/README.md rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mixtral/README.md index fa1ccd3b..ad0950a6 100644 --- a/python/llm/example/CPU/Speculative-Decoding/mixtral/README.md +++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mixtral/README.md @@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Mixtral BF16 infe To run these examples with IPEX-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. ## Example: Predict Tokens using `generate()` API -In the example [speculative.py](./speculative.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. +In the example [speculative.py](speculative.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. ### 1. Install We suggest using conda to manage environment: ```bash diff --git a/python/llm/example/CPU/Speculative-Decoding/mixtral/speculative.py b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mixtral/speculative.py similarity index 100% rename from python/llm/example/CPU/Speculative-Decoding/mixtral/speculative.py rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mixtral/speculative.py diff --git a/python/llm/example/CPU/Speculative-Decoding/qwen/README.md b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/qwen/README.md similarity index 99% rename from python/llm/example/CPU/Speculative-Decoding/qwen/README.md rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/qwen/README.md index f6582640..5983a57d 100644 --- a/python/llm/example/CPU/Speculative-Decoding/qwen/README.md +++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/qwen/README.md @@ -3,7 +3,7 @@ In this directory, you will find examples on how you could run Qwen BF16 inferne self-speculative decoding using IPEX-LLM on Intel CPUs. For illustration purposes, we utilize the [Qwen/Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat) and [Qwen/Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat) and [Qwen/Qwen-72B-Chat](https://huggingface.co/Qwen/Qwen-72B-Chat) as reference Qwen models. ## Example: Predict Tokens using `generate()` API -In the example [speculative.py](./speculative.py), we show a basic use case for a Qwen model to +In the example [speculative.py](speculative.py), we show a basic use case for a Qwen model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. ### 1. Install We suggest using conda to manage environment: diff --git a/python/llm/example/CPU/Speculative-Decoding/qwen/speculative.py b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/qwen/speculative.py similarity index 100% rename from python/llm/example/CPU/Speculative-Decoding/qwen/speculative.py rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/qwen/speculative.py diff --git a/python/llm/example/CPU/Speculative-Decoding/starcoder/README.md b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/starcoder/README.md similarity index 93% rename from python/llm/example/CPU/Speculative-Decoding/starcoder/README.md rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/starcoder/README.md index d061bdad..6c1cabdc 100644 --- a/python/llm/example/CPU/Speculative-Decoding/starcoder/README.md +++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/starcoder/README.md @@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Starcoder BF16 in To run these examples with IPEX-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. ## Example: Predict Tokens using `generate()` API -In the example [speculative.py](./speculative.py), we show a basic use case for a Starcoder model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. +In the example [speculative.py](speculative.py), we show a basic use case for a Starcoder model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. ### 1. Install We suggest using conda to manage environment: ```bash diff --git a/python/llm/example/CPU/Speculative-Decoding/starcoder/speculative.py b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/starcoder/speculative.py similarity index 100% rename from python/llm/example/CPU/Speculative-Decoding/starcoder/speculative.py rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/starcoder/speculative.py diff --git a/python/llm/example/CPU/Speculative-Decoding/vicuna/README.md b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/vicuna/README.md similarity index 98% rename from python/llm/example/CPU/Speculative-Decoding/vicuna/README.md rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/vicuna/README.md index c97e6baa..687483a7 100644 --- a/python/llm/example/CPU/Speculative-Decoding/vicuna/README.md +++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/vicuna/README.md @@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Vicuna BF16 infer To run these examples with IPEX-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. ## Example: Predict Tokens using `generate()` API -In the example [speculative.py](./speculative.py), we show a basic use case for a Vicuna model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. +In the example [speculative.py](speculative.py), we show a basic use case for a Vicuna model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. ### 1. Install We suggest using conda to manage environment: ```bash diff --git a/python/llm/example/CPU/Speculative-Decoding/vicuna/speculative.py b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/vicuna/speculative.py similarity index 100% rename from python/llm/example/CPU/Speculative-Decoding/vicuna/speculative.py rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/vicuna/speculative.py diff --git a/python/llm/example/CPU/Speculative-Decoding/ziya/README.md b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/ziya/README.md similarity index 92% rename from python/llm/example/CPU/Speculative-Decoding/ziya/README.md rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/ziya/README.md index 6fabf672..9849d8d2 100644 --- a/python/llm/example/CPU/Speculative-Decoding/ziya/README.md +++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/ziya/README.md @@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Ziya BF16 inferen To run the example with IPEX-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. ## Example: Predict Tokens using `generate()` API -In the example [speculative.py](./speculative.py), we show a basic use case for a Ziya model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. +In the example [speculative.py](speculative.py), we show a basic use case for a Ziya model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs. ### 1. Install We suggest using conda to manage environment: ```bash diff --git a/python/llm/example/CPU/Speculative-Decoding/ziya/speculative.py b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/ziya/speculative.py similarity index 100% rename from python/llm/example/CPU/Speculative-Decoding/ziya/speculative.py rename to python/llm/example/CPU/Speculative-Decoding/Self-Speculation/ziya/speculative.py diff --git a/python/llm/example/GPU/Speculative-Decoding/Eagle/README.md b/python/llm/example/GPU/Speculative-Decoding/EAGLE/README.md similarity index 80% rename from python/llm/example/GPU/Speculative-Decoding/Eagle/README.md rename to python/llm/example/GPU/Speculative-Decoding/EAGLE/README.md index 16c238e7..611c86c6 100644 --- a/python/llm/example/GPU/Speculative-Decoding/Eagle/README.md +++ b/python/llm/example/GPU/Speculative-Decoding/EAGLE/README.md @@ -1,8 +1,16 @@ -# Eagle - Speculative Sampling using IPEX-LLM on Intel GPUs +# EAGLE - Speculative Sampling using IPEX-LLM on Intel GPUs In this directory, you will find the examples on how IPEX-LLM accelerate inference with speculative sampling using EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a speculative sampling method that improves text generation speed) on Intel GPUs. See [here](https://arxiv.org/abs/2401.15077) to view the paper and [here](https://github.com/SafeAILab/EAGLE) for more info on EAGLE code. ## Requirements -To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. +To apply Intel GPU acceleration, there’re several steps for tools installation and environment preparation. See the [GPU installation guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html) for more details. + +Step 1, only Linux system is supported now, Ubuntu 22.04 is prefered. + +Step 2, please refer to our [driver installation](https://dgpu-docs.intel.com/driver/installation.html) for general purpose GPU capabilities. +> **Note**: IPEX 2.1.10+xpu requires Intel GPU Driver version >= stable_775_20_20231219. + +Step 3, you also need to download and install [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html). OneMKL and DPC++ compiler are needed, others are optional. +> **Note**: IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0. ### Verified Hardware Platforms diff --git a/python/llm/example/GPU/Speculative-Decoding/Eagle/data/mt_bench/question.jsonl b/python/llm/example/GPU/Speculative-Decoding/EAGLE/data/mt_bench/question.jsonl similarity index 100% rename from python/llm/example/GPU/Speculative-Decoding/Eagle/data/mt_bench/question.jsonl rename to python/llm/example/GPU/Speculative-Decoding/EAGLE/data/mt_bench/question.jsonl diff --git a/python/llm/example/GPU/Speculative-Decoding/Eagle/evaluation/gen_ea_answer_llama2chat.py b/python/llm/example/GPU/Speculative-Decoding/EAGLE/evaluation/gen_ea_answer_llama2chat.py similarity index 99% rename from python/llm/example/GPU/Speculative-Decoding/Eagle/evaluation/gen_ea_answer_llama2chat.py rename to python/llm/example/GPU/Speculative-Decoding/EAGLE/evaluation/gen_ea_answer_llama2chat.py index 93e7b1a6..4b461652 100644 --- a/python/llm/example/GPU/Speculative-Decoding/Eagle/evaluation/gen_ea_answer_llama2chat.py +++ b/python/llm/example/GPU/Speculative-Decoding/EAGLE/evaluation/gen_ea_answer_llama2chat.py @@ -211,7 +211,7 @@ def get_model_answers( ) if enable_ipex_llm: # single line of change to enable ipex-llm - model = optimize_model(model, optimize_llm=False) + model = optimize_model(model, low_bit='sym_int4', optimize_llm=False) model.to("xpu") tokenizer = model.get_tokenizer() diff --git a/python/llm/example/GPU/Speculative-Decoding/Eagle/evaluation/speed.py b/python/llm/example/GPU/Speculative-Decoding/EAGLE/evaluation/speed.py similarity index 100% rename from python/llm/example/GPU/Speculative-Decoding/Eagle/evaluation/speed.py rename to python/llm/example/GPU/Speculative-Decoding/EAGLE/evaluation/speed.py diff --git a/python/llm/example/GPU/Speculative-Decoding/Eagle/requirements.txt b/python/llm/example/GPU/Speculative-Decoding/EAGLE/requirements.txt similarity index 100% rename from python/llm/example/GPU/Speculative-Decoding/Eagle/requirements.txt rename to python/llm/example/GPU/Speculative-Decoding/EAGLE/requirements.txt diff --git a/python/llm/example/GPU/Speculative-Decoding/README.md b/python/llm/example/GPU/Speculative-Decoding/README.md index bb003532..240ea446 100644 --- a/python/llm/example/GPU/Speculative-Decoding/README.md +++ b/python/llm/example/GPU/Speculative-Decoding/README.md @@ -1,26 +1,6 @@ -# Self-Speculative Decoding for Large Language Model FP16 Inference using IPEX-LLM on Intel GPUs -You can use IPEX-LLM to run FP16 inference for any Huggingface Transformer model with ***self-speculative decoding*** on Intel GPUs. This directory contains example scripts to help you quickly get started to run some popular open-source models using self-speculative decoding. Each model has its own dedicated folder, where you can find detailed instructions on how to install and run it. +# Speculative-Decoding Examples on Intel GPU -## Verified Hardware Platforms +This folder contains examples of running Speculative-Decoding Examples with IPEX-LLM on Intel GPU: -- Intel Data Center GPU Max Series - -## Recommended Requirements -To apply Intel GPU acceleration, there’re several steps for tools installation and environment preparation. See the [GPU installation guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html) for mode details. - -Step 1, only Linux system is supported now, Ubuntu 22.04 is prefered. - -Step 2, please refer to our [driver installation](https://dgpu-docs.intel.com/driver/installation.html) for general purpose GPU capabilities. -> **Note**: IPEX 2.1.10+xpu requires Intel GPU Driver version >= stable_775_20_20231219. - -Step 3, you also need to download and install [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html). OneMKL and DPC++ compiler are needed, others are optional. -> **Note**: IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0. - -## Best Known Configuration on Linux - -For optimal performance on Intel Data Center GPU Max Series, it is recommended to set several environment variables. -```bash -export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so -export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 -export ENABLE_SDP_FUSION=1 -``` +- [Self-Speculation](Self-Speculation): running BF16 inference for Huggingface Transformer model with ***self-speculative decoding*** with IPEX-LLM on Intel GPUs +- [EAGLE](EAGLE): running speculative sampling using ***EAGLE*** (Extrapolation Algorithm for Greater Language-model Efficiency) with IPEX-LLM on Intel GPUs diff --git a/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/README.md b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/README.md new file mode 100644 index 00000000..a49326f5 --- /dev/null +++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/README.md @@ -0,0 +1,26 @@ +# Self-Speculative Decoding for Large Language Model FP16 Inference using IPEX-LLM on Intel GPUs +You can use IPEX-LLM to run FP16 inference for any Huggingface Transformer model with ***self-speculative decoding*** on Intel GPUs. This directory contains example scripts to help you quickly get started to run some popular open-source models using self-speculative decoding. Each model has its own dedicated folder, where you can find detailed instructions on how to install and run it. + +## Verified Hardware Platforms + +- Intel Data Center GPU Max Series + +## Recommended Requirements +To apply Intel GPU acceleration, there’re several steps for tools installation and environment preparation. See the [GPU installation guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html) for more details. + +Step 1, only Linux system is supported now, Ubuntu 22.04 is prefered. + +Step 2, please refer to our [driver installation](https://dgpu-docs.intel.com/driver/installation.html) for general purpose GPU capabilities. +> **Note**: IPEX 2.1.10+xpu requires Intel GPU Driver version >= stable_775_20_20231219. + +Step 3, you also need to download and install [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html). OneMKL and DPC++ compiler are needed, others are optional. +> **Note**: IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0. + +## Best Known Configuration on Linux + +For optimal performance on Intel Data Center GPU Max Series, it is recommended to set several environment variables. +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` diff --git a/python/llm/example/GPU/Speculative-Decoding/baichuan2/README.md b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/baichuan2/README.md similarity index 97% rename from python/llm/example/GPU/Speculative-Decoding/baichuan2/README.md rename to python/llm/example/GPU/Speculative-Decoding/Self-Speculation/baichuan2/README.md index 2f9fd573..dd869e70 100644 --- a/python/llm/example/GPU/Speculative-Decoding/baichuan2/README.md +++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/baichuan2/README.md @@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Baichuan2 FP16 in To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. ## Example: Predict Tokens using `generate()` API -In the example [speculative.py](./speculative.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs. +In the example [speculative.py](speculative.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs. ### 1. Install We suggest using conda to manage environment: ```bash diff --git a/python/llm/example/GPU/Speculative-Decoding/baichuan2/speculative.py b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/baichuan2/speculative.py similarity index 100% rename from python/llm/example/GPU/Speculative-Decoding/baichuan2/speculative.py rename to python/llm/example/GPU/Speculative-Decoding/Self-Speculation/baichuan2/speculative.py diff --git a/python/llm/example/GPU/Speculative-Decoding/chatglm3/README.md b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/chatglm3/README.md similarity index 96% rename from python/llm/example/GPU/Speculative-Decoding/chatglm3/README.md rename to python/llm/example/GPU/Speculative-Decoding/Self-Speculation/chatglm3/README.md index 8766bf3d..ceba7d32 100644 --- a/python/llm/example/GPU/Speculative-Decoding/chatglm3/README.md +++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/chatglm3/README.md @@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run ChatGLM3 FP16 inf To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. ## Example: Predict Tokens using `generate()` API -In the example [speculative.py](./speculative.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs. +In the example [speculative.py](speculative.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs. ### 1. Install We suggest using conda to manage environment: ```bash diff --git a/python/llm/example/GPU/Speculative-Decoding/chatglm3/speculative.py b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/chatglm3/speculative.py similarity index 100% rename from python/llm/example/GPU/Speculative-Decoding/chatglm3/speculative.py rename to python/llm/example/GPU/Speculative-Decoding/Self-Speculation/chatglm3/speculative.py diff --git a/python/llm/example/GPU/Speculative-Decoding/gpt-j/README.md b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/gpt-j/README.md similarity index 97% rename from python/llm/example/GPU/Speculative-Decoding/gpt-j/README.md rename to python/llm/example/GPU/Speculative-Decoding/Self-Speculation/gpt-j/README.md index 9f82533a..059fdc42 100644 --- a/python/llm/example/GPU/Speculative-Decoding/gpt-j/README.md +++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/gpt-j/README.md @@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run GPT-J FP16 infern To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. ## Example: Predict Tokens using `generate()` API -In the example [speculative.py](./speculative.py), we show a basic use case for a GPT-J model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs. +In the example [speculative.py](speculative.py), we show a basic use case for a GPT-J model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs. ### 1. Install We suggest using conda to manage environment: ```bash diff --git a/python/llm/example/GPU/Speculative-Decoding/gpt-j/speculative.py b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/gpt-j/speculative.py similarity index 100% rename from python/llm/example/GPU/Speculative-Decoding/gpt-j/speculative.py rename to python/llm/example/GPU/Speculative-Decoding/Self-Speculation/gpt-j/speculative.py diff --git a/python/llm/example/GPU/Speculative-Decoding/llama2/README.md b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/llama2/README.md similarity index 98% rename from python/llm/example/GPU/Speculative-Decoding/llama2/README.md rename to python/llm/example/GPU/Speculative-Decoding/Self-Speculation/llama2/README.md index 38668c96..31a1641d 100644 --- a/python/llm/example/GPU/Speculative-Decoding/llama2/README.md +++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/llama2/README.md @@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run LLaMA2 FP16 infer To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. ## Example: Predict Tokens using `generate()` API -In the example [speculative.py](./speculative.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs. +In the example [speculative.py](speculative.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs. ### 1. Install We suggest using conda to manage environment: ```bash diff --git a/python/llm/example/GPU/Speculative-Decoding/llama2/speculative.py b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/llama2/speculative.py similarity index 100% rename from python/llm/example/GPU/Speculative-Decoding/llama2/speculative.py rename to python/llm/example/GPU/Speculative-Decoding/Self-Speculation/llama2/speculative.py diff --git a/python/llm/example/GPU/Speculative-Decoding/mistral/README.md b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/mistral/README.md similarity index 98% rename from python/llm/example/GPU/Speculative-Decoding/mistral/README.md rename to python/llm/example/GPU/Speculative-Decoding/Self-Speculation/mistral/README.md index 12fbeb41..1b50d5fb 100644 --- a/python/llm/example/GPU/Speculative-Decoding/mistral/README.md +++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/mistral/README.md @@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Mistral FP16 infe To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. ## Example: Predict Tokens using `generate()` API -In the example [speculative.py](./speculative.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs. +In the example [speculative.py](speculative.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs. ### 1. Install We suggest using conda to manage environment: ```bash diff --git a/python/llm/example/GPU/Speculative-Decoding/mistral/speculative.py b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/mistral/speculative.py similarity index 100% rename from python/llm/example/GPU/Speculative-Decoding/mistral/speculative.py rename to python/llm/example/GPU/Speculative-Decoding/Self-Speculation/mistral/speculative.py diff --git a/python/llm/example/GPU/Speculative-Decoding/qwen/README.md b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/qwen/README.md similarity index 97% rename from python/llm/example/GPU/Speculative-Decoding/qwen/README.md rename to python/llm/example/GPU/Speculative-Decoding/Self-Speculation/qwen/README.md index 515aaf7b..1fabe0d4 100644 --- a/python/llm/example/GPU/Speculative-Decoding/qwen/README.md +++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/qwen/README.md @@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Qwen FP16 inferne To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. ## Example: Predict Tokens using `generate()` API -In the example [speculative.py](./speculative.py), we show a basic use case for a Qwen model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs. +In the example [speculative.py](speculative.py), we show a basic use case for a Qwen model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs. ### 1. Install We suggest using conda to manage environment: ```bash diff --git a/python/llm/example/GPU/Speculative-Decoding/qwen/speculative.py b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/qwen/speculative.py similarity index 100% rename from python/llm/example/GPU/Speculative-Decoding/qwen/speculative.py rename to python/llm/example/GPU/Speculative-Decoding/Self-Speculation/qwen/speculative.py