57 lines
2.6 KiB
Markdown
57 lines
2.6 KiB
Markdown
# Ziya
|
|
In this directory, you will find examples on how you could run Ziya BF16 inference with self-speculative decoding using IPEX-LLM on [Intel CPUs](../README.md). For illustration purposes,we utilize the [IDEA-CCNL/Ziya-Coding-34B-v1.0](https://huggingface.co/IDEA-CCNL/Ziya-Coding-34B-v1.0) as reference Ziya model.
|
|
|
|
## 0. Requirements
|
|
To run the example with IPEX-LLM on Intel CPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
|
|
|
|
## Example: Predict Tokens using `generate()` API
|
|
In the example [speculative.py](./speculative.py), we show a basic use case for a Ziya model to predict the next N tokens using `generate()` API, with IPEX-LLM speculative decoding optimizations on Intel CPUs.
|
|
### 1. Install
|
|
We suggest using conda to manage environment:
|
|
```bash
|
|
conda create -n llm python=3.11
|
|
conda activate llm
|
|
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
|
|
pip install intel_extension_for_pytorch==2.1.0
|
|
pip install transformers==4.35.2
|
|
```
|
|
### 2. Configures high-performing processor environment variables
|
|
```bash
|
|
source ipex-llm-init -t
|
|
export OMP_NUM_THREADS=48 # you can change 48 here to #cores of one processor socket
|
|
```
|
|
### 3. Run
|
|
|
|
We recommend to use `numactl` to bind the program to a specified processor socket:
|
|
|
|
```bash
|
|
numactl -C 0-47 -m 0 python ./speculative.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
|
|
```
|
|
|
|
For example, 0-47 means bind the python program to core list 0-47 for a 48-core socket.
|
|
|
|
Arguments info:
|
|
|
|
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Ziya model (e.g. `IDEA-CCNL/Ziya-Coding-34B-v1.0`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `IDEA-CCNL/Ziya-Coding-34B-v1.0`.
|
|
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). A default prompt is provided.
|
|
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `128`.
|
|
|
|
#### Sample Output
|
|
#### [IDEA-CCNL/Ziya-Coding-34B-v1.0](https://huggingface.co/IDEA-CCNL/Ziya-Coding-34B-v1.0)
|
|
|
|
```log
|
|
<human>:
|
|
写一段快速排序
|
|
<bot>:
|
|
def quick_sort(arr):
|
|
if len(arr) <= 1:
|
|
return arr
|
|
pivot = arr[len(arr) // 2]
|
|
left = [x for x in arr if x < pivot]
|
|
middle = [x for x in arr if x == pivot]
|
|
right = [x for x in arr if x > pivot]
|
|
return quick_sort(left) + middle + quick_sort(right)
|
|
Tokens generated 100
|
|
E2E Generation time xx.xxxxs
|
|
First token latency xx.xxxxs
|
|
```
|