hxsz1997 cba61a2909 Add html report of ppl (#10218 )

* remove include and language option, select the corresponding dataset based on the model name in Run

* change the nightly test time

* change the nightly test time of harness and ppl

* save the ppl result to json file

* generate csv file and print table result

* generate html

* modify the way to get parent folder

* update html in parent folder

* add llm-ppl-summary and llm-ppl-summary-html

* modify echo single result

* remove download fp16.csv

* change model name of PR

* move ppl nightly related files to llm/test folder

* reformat

* seperate make_table from make_table_and_csv.py

* separate make_csv from make_table_and_csv.py

* update llm-ppl-html

* remove comment

* add Download fp16.results

2024-02-27 17:37:08 +08:00

1.5 KiB

Raw Blame History

Perplexity

Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation was from transformers/perplexity and benchmark_patch_llm.py

HOW TO RUN

python run.py --model_path <path/to/model> --precisions sym_int4 fp4 mixed_fp4 sym_int8 fp8_e5m2 fp8_e4m3 mixed_fp8 --device xpu --datasets dataset_names --dataset_path <path/to/dataset> --language en

A more specific example to run perplexity on Llama2-7B using the default English datasets:

python run.py --model_path meta-llama/Llama-2-7b-chat-hf --precisions float16 sym_int4 --device xpu --language en

Note: We currently only support the THUDM/LongBench dataset

If you want to test model perplexity on a few selected datasets from the LongBench dataset, please use the format below.
```
--datasets narrativeqa qasper ...
```
The language argument will only take effect if datasets is None. The choices for this argument are en, zh, all, which stands for all the English datasets, all the Chinese datasets and all the datasets respectively during testing.
If you want to test perplexity on pre-downloaded datasets, please specify the <path/to/dataset> in the dataset_path argument in your command.

Summarize the results

"""python python make_table.py <input_dir> """

1.5 KiB Raw Blame History

Perplexity

HOW TO RUN

Summarize the results

1.5 KiB

Raw Blame History