97 lines
		
	
	
	
		
			3.3 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			97 lines
		
	
	
	
		
			3.3 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
# Starcoder
 | 
						|
In this directory, you will find examples on how you could run Starcoder BF16 inference with self-speculative decoding using IPEX-LLM on [Intel CPUs](../README.md). For illustration purposes,we utilize the [bigcode/starcoder](https://huggingface.co/bigcode/starcoder) and [bigcode/tiny_starcoder_py](https://huggingface.co/bigcode/tiny_starcoder_py) as reference Starcoder models.
 | 
						|
 | 
						|
## 0. Requirements
 | 
						|
To run these examples with IPEX-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
 | 
						|
 | 
						|
## Example: Predict Tokens using `generate()` API
 | 
						|
In the example [speculative.py](./speculative.py), we show a basic use case for a Starcoder model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
 | 
						|
### 1. Install
 | 
						|
We suggest using conda to manage environment:
 | 
						|
```bash
 | 
						|
conda create -n llm python=3.11
 | 
						|
conda activate llm
 | 
						|
pip install --pre --upgrade ipex-llm[all]
 | 
						|
pip install intel_extension_for_pytorch==2.1.0
 | 
						|
pip install transformers==4.31.0
 | 
						|
```
 | 
						|
### 2. Configures high-performing processor environment variables
 | 
						|
```bash
 | 
						|
source ipex-llm-init -t
 | 
						|
export OMP_NUM_THREADS=48 # you can change 48 here to #cores of one processor socket
 | 
						|
```
 | 
						|
### 3. Run
 | 
						|
 | 
						|
We recommend to use `numactl` to bind the program to a specified processor socket:
 | 
						|
 | 
						|
```bash
 | 
						|
numactl -C 0-47 -m 0 python ./speculative.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 | 
						|
```
 | 
						|
 | 
						|
For example, 0-47 means bind the python program to core list 0-47 for a 48-core socket.
 | 
						|
 | 
						|
Arguments info:
 | 
						|
 | 
						|
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Starcoder model (e.g. `bigcode/starcoder` and `bigcode/tiny_starcoder_py`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `bigcode/starcoder`.
 | 
						|
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). A default prompt is provided.
 | 
						|
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `128`.
 | 
						|
 | 
						|
#### Sample Output
 | 
						|
#### [bigcode/starcoder](https://huggingface.co/bigcode/starcoder)
 | 
						|
 | 
						|
```log
 | 
						|
def dfs_print_Fibonacci_sequence(n):
 | 
						|
    if n == 0:
 | 
						|
        return
 | 
						|
    elif n == 1:
 | 
						|
        print(0)
 | 
						|
        return
 | 
						|
    elif n == 2:
 | 
						|
        print(0)
 | 
						|
        print(1)
 | 
						|
        return
 | 
						|
    else:
 | 
						|
        print(0)
 | 
						|
        print(1)
 | 
						|
        dfs_print_Fibonacci_sequence(n-2)
 | 
						|
        print(dfs_Fibonacci_sequence(n-1))
 | 
						|
 | 
						|
def dfs_Fibonacci_sequence(n):
 | 
						|
    if n == 0:
 | 
						|
        return 0
 | 
						|
    elif n == 1:
 | 
						|
        return 1
 | 
						|
    else:
 | 
						|
        return dfs_Fibonacci_sequence
 | 
						|
Tokens generated 128
 | 
						|
E2E Generation time xx.xxxxs
 | 
						|
First token latency xx.xxxxs
 | 
						|
```
 | 
						|
 | 
						|
#### [bigcode/tiny_starcoder_py](https://huggingface.co/bigcode/tiny_starcoder_py)
 | 
						|
```log
 | 
						|
def dfs_print_Fibonacci_sequence(n):
 | 
						|
    if n == 0:
 | 
						|
        return
 | 
						|
    print(n)
 | 
						|
    for i in range(2, n):
 | 
						|
        print(dfs_print_Fibonacci_sequence(i))
 | 
						|
 | 
						|
 | 
						|
def dfs_print_Fibonacci_sequence_2(n):
 | 
						|
    if n == 0:
 | 
						|
        return
 | 
						|
    print(n)
 | 
						|
    for i in range(2, n):
 | 
						|
        print(dfs_print_Fibonacci_sequence_2(i))
 | 
						|
 | 
						|
 | 
						|
def dfs_print_Fibonacci_sequence_3(n):
 | 
						|
    if n == 0:
 | 
						|
        return
 | 
						|
    print(n)
 | 
						|
    for i in
 | 
						|
Tokens generated 128
 | 
						|
E2E Generation time xx.xxxxs
 | 
						|
First token latency xx.xxxxs
 | 
						|
```
 |