History

Chu,Youcheng ae7302a654 add gptq option for ppl test (#11921 ) * feat：add gptq for ppl * fix: add an empty line * fix: add an empty line * fix: remove an empty line * Resolve comments * Resolve comments * Resolve comments		2024-08-30 13:43:48 +08:00
..
make_csv.py	combine english and chinese, remove nan	2024-04-08 19:37:51 +08:00
make_table.py	Mark Color Modification	2024-04-12 14:00:50 +08:00
ppl.py	Align ppl with llama.cpp (#11055 )	2024-05-22 16:43:11 +08:00
README.md	set IPEX_LLM_LAST_LM_HEAD=1 as default (#11885 )	2024-08-21 15:06:12 +08:00
run_longbench.py	feat: add mixed_precision argument on ppl longbench evaluation (#11837 )	2024-08-19 10:00:44 +08:00
run_wikitext.py	add gptq option for ppl test (#11921 )	2024-08-30 13:43:48 +08:00

README.md

Perplexity

Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation is adapted from transformers/perplexity and benchmark_patch_llm.py

Environment Preparation

pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
pip install datasets

This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.

source /opt/intel/oneapi/setvars.sh

Please set IPEX_LLM_LAST_LM_HEAD=0 to disable the last_lm_head optimization.

export IPEX_LLM_LAST_LM_HEAD=0

PPL Evaluation

1. Run on Wikitext

An example to run perplexity on wikitext:

python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B --dataset path=wikitext,name=wikitext-2-raw-v1 --precision sym_int4 --device xpu --stride 512 --max_length 4096

2. Run on THUDM/LongBench dataset

An example to run perplexity on chatglm3-6b using the default Chinese datasets("multifieldqa_zh", "dureader", "vcsum", "lsht", "passage_retrieval_zh")

python run_longbench.py --model_path THUDM/chatglm3-6b --precisions float16 sym_int4 --device xpu --language zh

Notes:

If you want to test model perplexity on a few selected datasets from the LongBench dataset, please use the format below.
```
--datasets narrativeqa qasper ...
```
The language argument will only take effect if datasets is None. The choices for this argument are en, zh, all, which stands for all the English datasets, all the Chinese datasets and all the datasets respectively during testing.
If you want to test perplexity on pre-downloaded datasets, please specify the <path/to/dataset> in the dataset_path argument in your command.
You can run python make_table.py <input_dir> to summarize the results.