ipex-llm/python/llm/dev/benchmark/harness
yb-peng b1a97b71a9 Harness eval: Add is_last parameter and fix logical operator in highlight_vals (#10192)
* Add is_last parameter and fix logical operator in highlight_vals

* Add script to update HTML files in parent folder

* Add running update_html_in_parent_folder.py in summarize step

* Add licence info

* Remove update_html_in_parent_folder.py in Summarize the results for pull request
2024-02-21 14:45:32 +08:00
..
bigdl_llm.py
fp16.csv
harness_csv_to_html.py
harness_to_leaderboard.py
make_table_and_csv.py
README.md
run_llb.py
run_multi_llb.py
update_html_in_parent_folder.py

Harness Evaluation

Harness evaluation allows users to eaisly get accuracy on various datasets. Here we have enabled harness evaluation with BigDL-LLM under Open LLM Leaderboard settings. Before running, make sure to have bigdl-llm installed.

Install Harness

pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@b281b09

Run

run python run_llb.py. run_llb.py combines some arguments in main.py to make evaluations easier. The mapping of arguments is defined as a dict in llb.py.

Evaluation on CPU

python run_llb.py --model bigdl-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device cpu --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache

Evaluation on Intel GPU

python run_llb.py --model bigdl-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device xpu --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache

Evaluation using multiple Intel GPU

python run_multi_llb.py --model bigdl-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device xpu:0,2,3 --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache

Taking example above, the script will fork 3 processes, each for one xpu, to execute the tasks.

Results

We follow Open LLM Leaderboard to record our metrics, acc_norm for hellaswag and arc_challenge, mc2 for truthful_qa and acc for mmlu. For mmlu, there are 57 subtasks which means users may need to average them manually to get final result.