ipex-llm/python/llm/dev/benchmark/perplexity/README.md
hxsz1997 cba61a2909 Add html report of ppl (#10218)
* remove include and language option, select the corresponding dataset based on the model name in Run

* change the nightly test time

* change the nightly test time of harness and ppl

* save the ppl result to json file

* generate csv file and print table result

* generate html

* modify the way to get parent folder

* update html in parent folder

* add llm-ppl-summary and llm-ppl-summary-html

* modify echo single result

* remove download fp16.csv

* change model name of PR

* move ppl nightly related files to llm/test folder

* reformat

* seperate make_table from make_table_and_csv.py

* separate make_csv from make_table_and_csv.py

* update llm-ppl-html

* remove comment

* add Download fp16.results
2024-02-27 17:37:08 +08:00

25 lines
No EOL
1.5 KiB
Markdown

# Perplexity
Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation was from [transformers/perplexity](https://huggingface.co/docs/transformers/perplexity#perplexity-of-fixed-length-models) and [benchmark_patch_llm.py](https://github.com/insuhan/hyper-attn/blob/main/benchmark_patch_llm.py)
## HOW TO RUN
```bash
python run.py --model_path <path/to/model> --precisions sym_int4 fp4 mixed_fp4 sym_int8 fp8_e5m2 fp8_e4m3 mixed_fp8 --device xpu --datasets dataset_names --dataset_path <path/to/dataset> --language en
```
A more specific example to run perplexity on Llama2-7B using the default English datasets:
```bash
python run.py --model_path meta-llama/Llama-2-7b-chat-hf --precisions float16 sym_int4 --device xpu --language en
```
> Note: We currently only support the `THUDM/LongBench` [dataset](https://github.com/THUDM/LongBench)
- If you want to test model perplexity on a few selected datasets from the `LongBench` dataset, please use the format below.
```bash
--datasets narrativeqa qasper ...
```
- The `language` argument will only take effect if `datasets` is `None`. The choices for this argument are `en, zh, all`, which stands for all the English datasets, all the Chinese datasets and all the datasets respectively during testing.
- If you want to test perplexity on pre-downloaded datasets, please specify the `<path/to/dataset>` in the `dataset_path` argument in your command.
## Summarize the results
"""python
python make_table.py <input_dir>
"""