History

Wenjing Margaret Mao 289cc99cd6 Update README.md (#10700 ) Edit "summarize the results"		2024-04-09 16:01:12 +08:00
..
harness_to_leaderboard.py	Migrate harness to ipexllm (#10703 )	2024-04-09 15:48:53 +08:00
ipexllm.py	Migrate harness to ipexllm (#10703 )	2024-04-09 15:48:53 +08:00
make_csv.py	separate make_csv from the file	2024-02-23 16:33:38 +08:00
make_table.py	fall back to make_table.py	2024-02-23 16:33:38 +08:00
README.md	Update README.md (#10700 )	2024-04-09 16:01:12 +08:00
run_llb.py	Migrate harness to ipexllm (#10703 )	2024-04-09 15:48:53 +08:00
run_multi_llb.py	Migrate harness to ipexllm (#10703 )	2024-04-09 15:48:53 +08:00

README.md

Harness Evaluation

Harness evaluation allows users to eaisly get accuracy on various datasets. Here we have enabled harness evaluation with IPEX-LLM under Open LLM Leaderboard settings. Before running, make sure to have ipex-llm installed.

Install Harness

git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout b281b09
pip install -e .

Run

run python run_llb.py. run_llb.py combines some arguments in main.py to make evaluations easier. The mapping of arguments is defined as a dict in llb.py.

Evaluation on CPU

python run_llb.py --model ipex-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device cpu --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache

Evaluation on Intel GPU

python run_llb.py --model ipex-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device xpu --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache

Evaluation using multiple Intel GPU

python run_multi_llb.py --model ipex-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device xpu:0,2,3 --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache

Taking example above, the script will fork 3 processes, each for one xpu, to execute the tasks.

Results

We follow Open LLM Leaderboard to record our metrics, acc_norm for hellaswag and arc_challenge, mc2 for truthful_qa and acc for mmlu. For mmlu, there are 57 subtasks which means users may need to average them manually to get final result.

Summarize the results

python make_table.py <input_dir>