History

RyuKosei 1da1f1dd0e Combine two versions of run_wikitext.py (#11597 ) * Combine two versions of run_wikitext.py * Update run_wikitext.py * Update run_wikitext.py * aligned the format * update error display * simplified argument parser --------- Co-authored-by: jenniew <jenniewang123@gmail.com>		2024-07-29 15:56:16 +08:00
..
make_csv.py	combine english and chinese, remove nan	2024-04-08 19:37:51 +08:00
make_table.py	Mark Color Modification	2024-04-12 14:00:50 +08:00
ppl.py	Align ppl with llama.cpp (#11055 )	2024-05-22 16:43:11 +08:00
README.md	Align ppl with llama.cpp (#11055 )	2024-05-22 16:43:11 +08:00
run.py	Align ppl with llama.cpp (#11055 )	2024-05-22 16:43:11 +08:00
run_wikitext.py	Combine two versions of run_wikitext.py (#11597 )	2024-07-29 15:56:16 +08:00

README.md

Perplexity

Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation is adapted from transformers/perplexity and benchmark_patch_llm.py

Run on Wikitext

Download the dataset from here, unzip it and we will use the test dataset wiki.test.raw for evaluation.

python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B/ --data_path wikitext-2-raw-v1/wikitext-2-raw/wiki.test.raw --precision sym_int4 --use-cache --device xpu

# Run with stride
python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B/ --data_path wikitext-2-raw-v1/wikitext-2-raw/wiki.test.raw --precision fp16 --device xpu --stride 512

Run on THUDM/LongBench dataset

python run.py --model_path <path/to/model> --precisions sym_int4 fp8 --device xpu --datasets dataset_names --dataset_path <path/to/dataset> --language en

A more specific example to run perplexity on Llama2-7B using the default English datasets:

python run.py --model_path meta-llama/Llama-2-7b-chat-hf --precisions float16 sym_int4 --device xpu --language en

Notes:

If you want to test model perplexity on a few selected datasets from the LongBench dataset, please use the format below.
```
--datasets narrativeqa qasper ...
```
The language argument will only take effect if datasets is None. The choices for this argument are en, zh, all, which stands for all the English datasets, all the Chinese datasets and all the datasets respectively during testing.
If you want to test perplexity on pre-downloaded datasets, please specify the <path/to/dataset> in the dataset_path argument in your command.
You can run python make_table.py <input_dir> to summarize the results.