Eagle Speculative Sampling examples (#11104 )

* Eagle Speculative Sampling examples

* rm multi-gpu and ray content

* updated README to include Arc A770

2024-05-24 11:13:43 -07:00

2.3 KiB

Raw Blame History

Eagle - Speculative Sampling using IPEX-LLM on Intel CPUs

In this directory, you will find the examples on how IPEX-LLM accelerate inference with speculative sampling using EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a speculative sampling method that improves text generation speed) on Intel CPUs. See here to view the paper and here for more info on EAGLE code.

Requirements

To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to here for more information.

Example - EAGLE Speculative Sampling with IPEX-LLM on MT-bench

In this example, we run inference for a Llama2 model to showcase the speed of EAGLE with IPEX-LLM on MT-bench data on Intel CPUs.

1. Install

We suggest using conda to manage the Python environment. For more information about conda installation, please refer to here.

After installing conda, create a Python environment for IPEX-LLM:

conda create -n llm python=3.11 # recommend to use Python 3.11
conda activate llm

pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
pip install intel_extension_for_pytorch==2.1.0 
pip install -r requirements.txt
pip install eagle-llm

2. Configures IPEX-LLM environment variables for Linux

Note

Skip this step if you are running on Windows.

# set IPEX-LLM env variables
source ipex-llm-init

3. Running Example

You can test the speed of EAGLE speculative sampling with ipex-llm on MT-bench using the following command.

python -m evaluation.gen_ea_answer_llama2chat\
                 --ea-model-path [path of EAGLE weight]\
                 --base-model-path [path of the original model]\
                 --enable-ipex-llm\

Please refer to here for the complete list of available EAGLE weights.

The above command will generate a .jsonl file that records the generation results and wall time. Then, you can use evaluation/speed.py to calculate the speed.

python -m evaluation.speed\
                 --base-model-path [path of the original model]\
                 --jsonl-file [pathname of the .jsonl file]\

2.3 KiB Raw Blame History