ipex-llm/python/llm/example/CPU/Speculative-Decoding/starcoder/README.md
Heyang Sun 36a9e88104 Speculative Starcoder on CPU (#10138)
* Speculative Starcoder on CPU

* enable kv-cache pre-allocation

* refine codes

* refine

* fix style

* fix style

* fix style

* refine

* refine

* Update speculative.py

* Update gptbigcode.py

* fix style

* Update speculative.py

* enable mixed-datatype layernorm on top of torch API

* adaptive dtype

* Update README.md
2024-02-27 09:57:29 +08:00

97 lines
3.3 KiB
Markdown

# Starcoder
In this directory, you will find examples on how you could run Starcoder BF16 inference with self-speculative decoding using BigDL-LLM on [Intel CPUs](../README.md). For illustration purposes,we utilize the [bigcode/starcoder](https://huggingface.co/bigcode/starcoder) and [bigcode/tiny_starcoder_py](https://huggingface.co/bigcode/tiny_starcoder_py) as reference Starcoder models.
## 0. Requirements
To run these examples with BigDL-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
## Example: Predict Tokens using `generate()` API
In the example [speculative.py](./speculative.py), we show a basic use case for a Starcoder model to predict the next N tokens using `generate()` API, with BigDL-LLM speculative decoding optimizations on Intel CPUs.
### 1. Install
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9
conda activate llm
pip install --pre --upgrade bigdl-llm[all]
pip install intel_extension_for_pytorch==2.1.0
pip install transformers==4.31.0
```
### 2. Configures high-performing processor environment variables
```bash
source bigdl-llm-init -t
export OMP_NUM_THREADS=48 # you can change 48 here to #cores of one processor socket
```
### 3. Run
We recommend to use `numactl` to bind the program to a specified processor socket:
```bash
numactl -C 0-47 -m 0 python ./speculative.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
```
For example, 0-47 means bind the python program to core list 0-47 for a 48-core socket.
Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Starcoder model (e.g. `bigcode/starcoder` and `bigcode/tiny_starcoder_py`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `bigcode/starcoder`.
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). A default prompt is provided.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `128`.
#### Sample Output
#### [bigcode/starcoder](https://huggingface.co/bigcode/starcoder)
```log
def dfs_print_Fibonacci_sequence(n):
if n == 0:
return
elif n == 1:
print(0)
return
elif n == 2:
print(0)
print(1)
return
else:
print(0)
print(1)
dfs_print_Fibonacci_sequence(n-2)
print(dfs_Fibonacci_sequence(n-1))
def dfs_Fibonacci_sequence(n):
if n == 0:
return 0
elif n == 1:
return 1
else:
return dfs_Fibonacci_sequence
Tokens generated 128
E2E Generation time xx.xxxxs
First token latency xx.xxxxs
```
#### [bigcode/tiny_starcoder_py](https://huggingface.co/bigcode/tiny_starcoder_py)
```log
def dfs_print_Fibonacci_sequence(n):
if n == 0:
return
print(n)
for i in range(2, n):
print(dfs_print_Fibonacci_sequence(i))
def dfs_print_Fibonacci_sequence_2(n):
if n == 0:
return
print(n)
for i in range(2, n):
print(dfs_print_Fibonacci_sequence_2(i))
def dfs_print_Fibonacci_sequence_3(n):
if n == 0:
return
print(n)
for i in
Tokens generated 128
E2E Generation time xx.xxxxs
First token latency xx.xxxxs
```