ipex-llm/python/llm/example/CPU/Speculative-Decoding/ziya
Wang, Jian4 9df70d95eb
Refactor bigdl.llm to ipex_llm (#24)
* Rename bigdl/llm to ipex_llm

* rm python/llm/src/bigdl

* from bigdl.llm to from ipex_llm
2024-03-22 15:41:21 +08:00
..
README.md Speculative Ziya on CPU (#10160) 2024-02-21 10:30:39 +08:00
speculative.py Refactor bigdl.llm to ipex_llm (#24) 2024-03-22 15:41:21 +08:00

Ziya

In this directory, you will find examples on how you could run Ziya BF16 inference with self-speculative decoding using BigDL-LLM on Intel CPUs. For illustration purposes,we utilize the IDEA-CCNL/Ziya-Coding-34B-v1.0 as reference Ziya model.

0. Requirements

To run the example with BigDL-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to here for more information.

Example: Predict Tokens using generate() API

In the example speculative.py, we show a basic use case for a Ziya model to predict the next N tokens using generate() API, with BigDL-LLM speculative decoding optimizations on Intel CPUs.

1. Install

We suggest using conda to manage environment:

conda create -n llm python=3.9
conda activate llm
pip install --pre --upgrade bigdl-llm[all]
pip install intel_extension_for_pytorch==2.1.0
pip install transformers==4.35.2

2. Configures high-performing processor environment variables

source bigdl-llm-init -t
export OMP_NUM_THREADS=48 # you can change 48 here to #cores of one processor socket

3. Run

We recommend to use numactl to bind the program to a specified processor socket:

numactl -C 0-47 -m 0 python ./speculative.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT

For example, 0-47 means bind the python program to core list 0-47 for a 48-core socket.

Arguments info:

  • --repo-id-or-model-path REPO_ID_OR_MODEL_PATH: argument defining the huggingface repo id for the Ziya model (e.g. IDEA-CCNL/Ziya-Coding-34B-v1.0) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be IDEA-CCNL/Ziya-Coding-34B-v1.0.
  • --prompt PROMPT: argument defining the prompt to be infered (with integrated prompt format for chat). A default prompt is provided.
  • --n-predict N_PREDICT: argument defining the max number of tokens to predict. It is default to be 128.

Sample Output

IDEA-CCNL/Ziya-Coding-34B-v1.0

<human>: 
写一段快速排序
<bot>: 
def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quick_sort(left) + middle + quick_sort(right)
Tokens generated 100
E2E Generation time xx.xxxxs
First token latency xx.xxxxs