ipex-llm/python/llm/dev/benchmark/harness
Chen, Zhentao 972cdb9992 gsm8k OOM workaround (#9597)
* update bigdl_llm.py

* update the installation of harness

* fix partial function

* import ipex

* force seq len in decrease order

* put func outside class

* move comments

* default 'trust_remote_code' as True

* Update llm-harness-evaluation.yml
2023-12-08 18:47:25 +08:00
..
bigdl_llm.py gsm8k OOM workaround (#9597) 2023-12-08 18:47:25 +08:00
harness_to_leaderboard.py Add 3 leaderboard tasks (#9566) 2023-12-01 14:01:14 +08:00
make_table_results.py Add harness summary job (#9457) 2023-12-05 10:04:10 +08:00
README.md patch bigdl-llm model to harness by binding instead of patch file (#9420) 2023-11-14 12:51:39 +08:00
run_llb.py Add harness nightly (#9552) 2023-12-01 14:16:35 +08:00

Harness Evalution

Harness evalution allows users to eaisly get accuracy on various datasets. Here we have enabled harness evalution with BigDL-LLM under Open LLM Leaderboard settings. Before running, make sure to have bigdl-llm installed.

Install Harness

git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd  lm-evaluation-harness
git checkout e81d3cc
pip install -e .

Run

run python run_llb.py. run_llb.py combines some arguments in main.py to make evalutions easier. The mapping of arguments is defined as a dict in llb.py.

Evaluation on CPU

python run_llb.py --model bigdl-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device cpu --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache

Evaluation on Intel GPU

python run_llb.py --model bigdl-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device xpu --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache

Results

We follow Open LLM Leaderboard to record our metrics, acc_norm for hellaswag and arc_challenge, mc2 for truthful_qa and acc for mmlu. For mmlu, there are 57 subtasks which means users may need to average them manually to get final result.