Reconstruct Speculative Decoding example directory (#11136)

* update * update * update
2024-05-29 13:15:27 -07:00 · 2024-05-29 13:15:27 -07:00 · 93146b9433
commit 93146b9433
parent 2299698b45
48 changed files with 79 additions and 59 deletions
--- a/python/llm/example/CPU/Speculative-Decoding/EAGLE/README.md
+++ b/python/llm/example/CPU/Speculative-Decoding/EAGLE/README.md
@ -1,8 +1,8 @@
-# Eagle - Speculative Sampling using IPEX-LLM on Intel CPUs
+# EAGLE - Speculative Sampling using IPEX-LLM on Intel CPUs
 In this directory, you will find the examples on how IPEX-LLM accelerate inference with speculative sampling using EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a speculative sampling method that improves text generation speed) on Intel CPUs. See [here](https://arxiv.org/abs/2401.15077) to view the paper and [here](https://github.com/SafeAILab/EAGLE) for more info on EAGLE code.

 ## Requirements
-To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
+To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../../README.md#system-support) for more information. Make sure you have installed `ipex-llm` before:

 ## Example - EAGLE Speculative Sampling with IPEX-LLM on MT-bench
 In this example, we run inference for a Llama2 model to showcase the speed of EAGLE with IPEX-LLM on MT-bench data on Intel CPUs.
--- a/python/llm/example/CPU/Speculative-Decoding/EAGLE/data/mt_bench/question.jsonl
+++ b/python/llm/example/CPU/Speculative-Decoding/EAGLE/data/mt_bench/question.jsonl
--- a/python/llm/example/CPU/Speculative-Decoding/EAGLE/evaluation/gen_ea_answer_llama2chat.py
+++ b/python/llm/example/CPU/Speculative-Decoding/EAGLE/evaluation/gen_ea_answer_llama2chat.py
@ -199,7 +199,7 @@ def get_model_answers(

    if enable_ipex_llm:
        # single line of change to enable ipex-llm
-        model = optimize_model(model, optimize_llm=False)
+        model = optimize_model(model, low_bit='sym_int4', optimize_llm=False)

    tokenizer = model.get_tokenizer()

--- a/python/llm/example/CPU/Speculative-Decoding/EAGLE/evaluation/speed.py
+++ b/python/llm/example/CPU/Speculative-Decoding/EAGLE/evaluation/speed.py
--- a/python/llm/example/CPU/Speculative-Decoding/EAGLE/requirements.txt
+++ b/python/llm/example/CPU/Speculative-Decoding/EAGLE/requirements.txt
--- a/python/llm/example/CPU/Speculative-Decoding/README.md
+++ b/python/llm/example/CPU/Speculative-Decoding/README.md
@ -1,15 +1,6 @@
-# Self-Speculative Decoding for Large Language Model BF16 Inference using IPEX-LLM on Intel CPUs
-You can use IPEX-LLM to run BF16 inference for any Huggingface Transformer model with ***self-speculative decoding*** on Intel CPUs. This directory contains example scripts to help you quickly get started to run some popular open-source models using self-speculative decoding. Each model has its own dedicated folder, where you can find detailed instructions on how to install and run it.
+# Speculative-Decoding Examples on Intel CPU

-## Verified Hardware Platforms
+This folder contains examples of running Speculative-Decoding Examples with IPEX-LLM on Intel CPU:

- Intel Xeon SPR server
-
-## Recommended Requirements
-To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#system-support) for more information. Make sure you have installed `ipex-llm` before:
-
-```bash
-pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
-```
-
-Moreover, install IPEX 2.1.0, which can be done through `pip install intel_extension_for_pytorch==2.1.0`.
+- [Self-Speculation](Self-Speculation): running BF16 inference for Huggingface Transformer model with ***self-speculative decoding*** with IPEX-LLM on Intel CPUs
+- [EAGLE](EAGLE): running speculative sampling using ***EAGLE*** (Extrapolation Algorithm for Greater Language-model Efficiency) with IPEX-LLM on Intel CPUs
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/README.md
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/README.md
@ -0,0 +1,15 @@
+# Self-Speculative Decoding for Large Language Model BF16 Inference using IPEX-LLM on Intel CPUs
+You can use IPEX-LLM to run BF16 inference for any Huggingface Transformer model with ***self-speculative decoding*** on Intel CPUs. This directory contains example scripts to help you quickly get started to run some popular open-source models using self-speculative decoding. Each model has its own dedicated folder, where you can find detailed instructions on how to install and run it.
+
+## Verified Hardware Platforms
+
+- Intel Xeon SPR server
+
+## Recommended Requirements
+To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../../README.md#system-support) for more information. Make sure you have installed `ipex-llm` before:
+
+```bash
+pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
+```
+
+Moreover, install IPEX 2.1.0, which can be done through `pip install intel_extension_for_pytorch==2.1.0`.
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/README.md
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/README.md
@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Baichuan2 BF16 in
 To run these examples with IPEX-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.

 ## Example: Predict Tokens using `generate()` API
-In the example [speculative.py](./speculative.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
+In the example [speculative.py](speculative.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
 ### 1. Install
 We suggest using conda to manage environment:
 ```bash
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/baichaun2_7b_opt_ipex/modeling_baichuan.ipex
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/baichaun2_7b_opt_ipex/modeling_baichuan.ipex
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/baichaun2_7b_opt_ipex/tokenization_baichuan.ipex
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/baichaun2_7b_opt_ipex/tokenization_baichuan.ipex
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/speculative.py
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/baichuan2/speculative.py
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/chatglm3/README.md
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/chatglm3/README.md
@ -3,7 +3,7 @@ In this directory, you will find examples on how you could run ChatGLM3 BF16 inf


 ## Example: Predict Tokens using `generate()` API
-In the example [speculative.py](./speculative.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
+In the example [speculative.py](speculative.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
 ### 1. Install
 We suggest using conda to manage environment:
 ```bash
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/chatglm3/speculative.py
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/chatglm3/speculative.py
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama2/README.md
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama2/README.md
@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run LLaMA2 BF16 infer
 To run these examples with IPEX-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.

 ## Example: Predict Tokens using `generate()` API
-In the example [speculative.py](./speculative.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
+In the example [speculative.py](speculative.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
 ### 1. Install
 We suggest using conda to manage environment:
 ```bash
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama2/speculative.py
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama2/speculative.py
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama3/README.md
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama3/README.md
@ -8,7 +8,7 @@ To run these examples with IPEX-LLM on Intel CPUs, we have some recommended requ

 ## Example: Predict Tokens using `generate()` API

-In the example [speculative.py](./speculative.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
+In the example [speculative.py](speculative.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.

 ### 1. Install

--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama3/speculative.py
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/llama3/speculative.py
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mistral/README.md
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mistral/README.md
@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Mistral BF16 infe
 To run these examples with IPEX-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.

 ## Example: Predict Tokens using `generate()` API
-In the example [speculative.py](./speculative.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
+In the example [speculative.py](speculative.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
 ### 1. Install
 We suggest using conda to manage environment:
 ```bash
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mistral/speculative.py
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mistral/speculative.py
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mixtral/README.md
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mixtral/README.md
@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Mixtral BF16 infe
 To run these examples with IPEX-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.

 ## Example: Predict Tokens using `generate()` API
-In the example [speculative.py](./speculative.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
+In the example [speculative.py](speculative.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
 ### 1. Install
 We suggest using conda to manage environment:
 ```bash
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mixtral/speculative.py
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/mixtral/speculative.py
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/qwen/README.md
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/qwen/README.md
@ -3,7 +3,7 @@ In this directory, you will find examples on how you could run Qwen BF16 inferne
 self-speculative decoding using IPEX-LLM on Intel CPUs. For illustration purposes, we utilize the [Qwen/Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat) and [Qwen/Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat) and [Qwen/Qwen-72B-Chat](https://huggingface.co/Qwen/Qwen-72B-Chat) as reference Qwen models.

 ## Example: Predict Tokens using `generate()` API
-In the example [speculative.py](./speculative.py), we show a basic use case for a Qwen model to 
+In the example [speculative.py](speculative.py), we show a basic use case for a Qwen model to 
 predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
 ### 1. Install
 We suggest using conda to manage environment:
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/qwen/speculative.py
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/qwen/speculative.py
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/starcoder/README.md
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/starcoder/README.md
@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Starcoder BF16 in
 To run these examples with IPEX-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.

 ## Example: Predict Tokens using `generate()` API
-In the example [speculative.py](./speculative.py), we show a basic use case for a Starcoder model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
+In the example [speculative.py](speculative.py), we show a basic use case for a Starcoder model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
 ### 1. Install
 We suggest using conda to manage environment:
 ```bash
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/starcoder/speculative.py
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/starcoder/speculative.py
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/vicuna/README.md
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/vicuna/README.md
@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Vicuna BF16 infer
 To run these examples with IPEX-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.

 ## Example: Predict Tokens using `generate()` API
-In the example [speculative.py](./speculative.py), we show a basic use case for a Vicuna model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
+In the example [speculative.py](speculative.py), we show a basic use case for a Vicuna model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
 ### 1. Install
 We suggest using conda to manage environment:
 ```bash
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/vicuna/speculative.py
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/vicuna/speculative.py
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/ziya/README.md
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/ziya/README.md
@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Ziya BF16 inferen
 To run the example with IPEX-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.

 ## Example: Predict Tokens using `generate()` API
-In the example [speculative.py](./speculative.py), we show a basic use case for a Ziya model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
+In the example [speculative.py](speculative.py), we show a basic use case for a Ziya model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
 ### 1. Install
 We suggest using conda to manage environment:
 ```bash
--- a/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/ziya/speculative.py
+++ b/python/llm/example/CPU/Speculative-Decoding/Self-Speculation/ziya/speculative.py
--- a/python/llm/example/GPU/Speculative-Decoding/EAGLE/README.md
+++ b/python/llm/example/GPU/Speculative-Decoding/EAGLE/README.md
@ -1,8 +1,16 @@
-# Eagle - Speculative Sampling using IPEX-LLM on Intel GPUs
+# EAGLE - Speculative Sampling using IPEX-LLM on Intel GPUs
 In this directory, you will find the examples on how IPEX-LLM accelerate inference with speculative sampling using EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a speculative sampling method that improves text generation speed) on Intel GPUs. See [here](https://arxiv.org/abs/2401.15077) to view the paper and [here](https://github.com/SafeAILab/EAGLE) for more info on EAGLE code.

 ## Requirements
-To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
+To apply Intel GPU acceleration, there’re several steps for tools installation and environment preparation. See the [GPU installation guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html) for more details.
+
+Step 1, only Linux system is supported now, Ubuntu 22.04 is prefered.
+
+Step 2, please refer to our [driver installation](https://dgpu-docs.intel.com/driver/installation.html) for general purpose GPU capabilities.
+> **Note**: IPEX 2.1.10+xpu requires Intel GPU Driver version >= stable_775_20_20231219.
+
+Step 3, you also need to download and install [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html). OneMKL and DPC++ compiler are needed, others are optional.
+> **Note**: IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0.

 ### Verified Hardware Platforms

--- a/python/llm/example/GPU/Speculative-Decoding/EAGLE/data/mt_bench/question.jsonl
+++ b/python/llm/example/GPU/Speculative-Decoding/EAGLE/data/mt_bench/question.jsonl
--- a/python/llm/example/GPU/Speculative-Decoding/EAGLE/evaluation/gen_ea_answer_llama2chat.py
+++ b/python/llm/example/GPU/Speculative-Decoding/EAGLE/evaluation/gen_ea_answer_llama2chat.py
@ -211,7 +211,7 @@ def get_model_answers(
        )
    if enable_ipex_llm:
        # single line of change to enable ipex-llm 
-        model = optimize_model(model, optimize_llm=False)
+        model = optimize_model(model, low_bit='sym_int4', optimize_llm=False)
    model.to("xpu")
    tokenizer = model.get_tokenizer()

--- a/python/llm/example/GPU/Speculative-Decoding/EAGLE/evaluation/speed.py
+++ b/python/llm/example/GPU/Speculative-Decoding/EAGLE/evaluation/speed.py
--- a/python/llm/example/GPU/Speculative-Decoding/EAGLE/requirements.txt
+++ b/python/llm/example/GPU/Speculative-Decoding/EAGLE/requirements.txt
--- a/python/llm/example/GPU/Speculative-Decoding/README.md
+++ b/python/llm/example/GPU/Speculative-Decoding/README.md
@ -1,26 +1,6 @@
-# Self-Speculative Decoding for Large Language Model FP16 Inference using IPEX-LLM on Intel GPUs
-You can use IPEX-LLM to run FP16 inference for any Huggingface Transformer model with ***self-speculative decoding*** on Intel GPUs. This directory contains example scripts to help you quickly get started to run some popular open-source models using self-speculative decoding. Each model has its own dedicated folder, where you can find detailed instructions on how to install and run it.
+# Speculative-Decoding Examples on Intel GPU

-## Verified Hardware Platforms
+This folder contains examples of running Speculative-Decoding Examples with IPEX-LLM on Intel GPU:

- Intel Data Center GPU Max Series
-
-## Recommended Requirements
-To apply Intel GPU acceleration, there’re several steps for tools installation and environment preparation. See the [GPU installation guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html) for mode details.
-
-Step 1, only Linux system is supported now, Ubuntu 22.04 is prefered.
-
-Step 2, please refer to our [driver installation](https://dgpu-docs.intel.com/driver/installation.html) for general purpose GPU capabilities.
-> **Note**: IPEX 2.1.10+xpu requires Intel GPU Driver version >= stable_775_20_20231219.
-
-Step 3, you also need to download and install [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html). OneMKL and DPC++ compiler are needed, others are optional.
-> **Note**: IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0.
-
-## Best Known Configuration on Linux
-
-For optimal performance on Intel Data Center GPU Max Series, it is recommended to set several environment variables.
-```bash
-export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
-export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
-export ENABLE_SDP_FUSION=1
-```
+- [Self-Speculation](Self-Speculation): running BF16 inference for Huggingface Transformer model with ***self-speculative decoding*** with IPEX-LLM on Intel GPUs
+- [EAGLE](EAGLE): running speculative sampling using ***EAGLE*** (Extrapolation Algorithm for Greater Language-model Efficiency) with IPEX-LLM on Intel GPUs
--- a/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/README.md
+++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/README.md
@ -0,0 +1,26 @@
+# Self-Speculative Decoding for Large Language Model FP16 Inference using IPEX-LLM on Intel GPUs
+You can use IPEX-LLM to run FP16 inference for any Huggingface Transformer model with ***self-speculative decoding*** on Intel GPUs. This directory contains example scripts to help you quickly get started to run some popular open-source models using self-speculative decoding. Each model has its own dedicated folder, where you can find detailed instructions on how to install and run it.
+
+## Verified Hardware Platforms
+
+- Intel Data Center GPU Max Series
+
+## Recommended Requirements
+To apply Intel GPU acceleration, there’re several steps for tools installation and environment preparation. See the [GPU installation guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html) for more details.
+
+Step 1, only Linux system is supported now, Ubuntu 22.04 is prefered.
+
+Step 2, please refer to our [driver installation](https://dgpu-docs.intel.com/driver/installation.html) for general purpose GPU capabilities.
+> **Note**: IPEX 2.1.10+xpu requires Intel GPU Driver version >= stable_775_20_20231219.
+
+Step 3, you also need to download and install [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html). OneMKL and DPC++ compiler are needed, others are optional.
+> **Note**: IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0.
+
+## Best Known Configuration on Linux
+
+For optimal performance on Intel Data Center GPU Max Series, it is recommended to set several environment variables.
+```bash
+export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
+export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
+export ENABLE_SDP_FUSION=1
+```
--- a/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/baichuan2/README.md
+++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/baichuan2/README.md
@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Baichuan2 FP16 in
 To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.

 ## Example: Predict Tokens using `generate()` API
-In the example [speculative.py](./speculative.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs.
+In the example [speculative.py](speculative.py), we show a basic use case for a Baichuan2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs.
 ### 1. Install
 We suggest using conda to manage environment:
 ```bash
--- a/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/baichuan2/speculative.py
+++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/baichuan2/speculative.py
--- a/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/chatglm3/README.md
+++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/chatglm3/README.md
@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run ChatGLM3 FP16 inf
 To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.

 ## Example: Predict Tokens using `generate()` API
-In the example [speculative.py](./speculative.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs.
+In the example [speculative.py](speculative.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs.
 ### 1. Install
 We suggest using conda to manage environment:
 ```bash
--- a/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/chatglm3/speculative.py
+++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/chatglm3/speculative.py
--- a/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/gpt-j/README.md
+++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/gpt-j/README.md
@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run GPT-J FP16 infern
 To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.

 ## Example: Predict Tokens using `generate()` API
-In the example [speculative.py](./speculative.py), we show a basic use case for a GPT-J model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs.
+In the example [speculative.py](speculative.py), we show a basic use case for a GPT-J model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs.
 ### 1. Install
 We suggest using conda to manage environment:
 ```bash
--- a/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/gpt-j/speculative.py
+++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/gpt-j/speculative.py
--- a/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/llama2/README.md
+++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/llama2/README.md
@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run LLaMA2 FP16 infer
 To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.

 ## Example: Predict Tokens using `generate()` API
-In the example [speculative.py](./speculative.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs.
+In the example [speculative.py](speculative.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs.
 ### 1. Install
 We suggest using conda to manage environment:
 ```bash
--- a/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/llama2/speculative.py
+++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/llama2/speculative.py
--- a/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/mistral/README.md
+++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/mistral/README.md
@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Mistral FP16 infe
 To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.

 ## Example: Predict Tokens using `generate()` API
-In the example [speculative.py](./speculative.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs.
+In the example [speculative.py](speculative.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs.
 ### 1. Install
 We suggest using conda to manage environment:
 ```bash
--- a/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/mistral/speculative.py
+++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/mistral/speculative.py
--- a/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/qwen/README.md
+++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/qwen/README.md
@ -5,7 +5,7 @@ In this directory, you will find examples on how you could run Qwen FP16 inferne
 To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.

 ## Example: Predict Tokens using `generate()` API
-In the example [speculative.py](./speculative.py), we show a basic use case for a Qwen model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs.
+In the example [speculative.py](speculative.py), we show a basic use case for a Qwen model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel GPUs.
 ### 1. Install
 We suggest using conda to manage environment:
 ```bash
--- a/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/qwen/speculative.py
+++ b/python/llm/example/GPU/Speculative-Decoding/Self-Speculation/qwen/speculative.py