History

yb-peng b4dc33def6 In harness-evaluation workflow, add statistical tables (#10118 ) * chnage storage * fix typo * change label * change label to arc03 * change needs in the last step * add generate csv in harness/make_table_results.py * modify needs in the last job * add csv to html * mfix path issue in llm-harness-summary-nightly * modify output_path * modify args in make_table_results.py * modify make table command in summary * change pr env label * remove irrelevant code in summary; add set output path step; add limit in harness run * re-organize code structure * modify limit in run harness * modify csv_to_html input path * modify needs in summary-nightly		2024-02-08 19:01:05 +08:00
..
bigdl_llm.py	fix optimize_model not working (#9995 )	2024-01-25 16:39:05 +08:00
harness_csv_to_html.py	In harness-evaluation workflow, add statistical tables (#10118 )	2024-02-08 19:01:05 +08:00
harness_to_leaderboard.py	Enable fp8e5 harness (#9761 )	2023-12-22 16:59:48 +08:00
make_table_and_csv.py	In harness-evaluation workflow, add statistical tables (#10118 )	2024-02-08 19:01:05 +08:00
make_table_results.py	In harness-evaluation workflow, add statistical tables (#10118 )	2024-02-08 19:01:05 +08:00
README.md	harness tests on pvc multiple xpus (#9908 )	2024-01-23 13:20:37 +08:00
run_llb.py	Add harness nightly (#9552 )	2023-12-01 14:16:35 +08:00
run_multi_llb.py	harness tests on pvc multiple xpus (#9908 )	2024-01-23 13:20:37 +08:00

README.md

Harness Evalution

Harness evalution allows users to eaisly get accuracy on various datasets. Here we have enabled harness evalution with BigDL-LLM under Open LLM Leaderboard settings. Before running, make sure to have bigdl-llm installed.

Install Harness

git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd  lm-evaluation-harness
git checkout e81d3cc
pip install -e .

Run

run python run_llb.py. run_llb.py combines some arguments in main.py to make evalutions easier. The mapping of arguments is defined as a dict in llb.py.

Evaluation on CPU

python run_llb.py --model bigdl-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device cpu --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache

Evaluation on Intel GPU

python run_llb.py --model bigdl-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device xpu --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache

Evaluation using multiple Intel GPU

python run_multi_llb.py --model bigdl-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device xpu:0,2,3 --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache

Taking example above, the script will fork 3 processes, each for one xpu, to execute the tasks.

Results

We follow Open LLM Leaderboard to record our metrics, acc_norm for hellaswag and arc_challenge, mc2 for truthful_qa and acc for mmlu. For mmlu, there are 57 subtasks which means users may need to average them manually to get final result.