History

Chen, Zhentao 298b64217e add auto triggered acc test (#9364 ) * add auto triggered acc test * use llama 7b instead * fix env * debug download * fix download prefix * add cut dirs * fix env of model path * fix dataset download * full job * source xpu env vars * use matrix to trigger model run * reset batch=1 * remove redirect * remove some trigger * add task matrix * add precision list * test llama-7b-chat * use /mnt/disk1 to store model and datasets * remove installation test * correct downloading path * fix HF vars * add bigdl-llm env vars * rename file * fix hf_home * fix script path * rename as harness evalution * rerun		2023-11-08 10:22:27 +08:00
..
bigdl-llm.patch	Merge harness (#9319 )	2023-11-02 15:14:19 +08:00
llb.py	add auto triggered acc test (#9364 )	2023-11-08 10:22:27 +08:00
README.md	Merge harness (#9319 )	2023-11-02 15:14:19 +08:00

README.md

Harness Evalution

Harness evalution allows users to eaisly get accuracy on various datasets. Here we have enabled harness evalution with BigDL-LLM under Open LLM Leaderboard settings. Before running, make sure to have bigdl-llm installed.

Install Harness

git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd  lm-evaluation-harness
git checkout e81d3cc
pip install -e .
git apply ../bigdl-llm.patch
cd ..

Run

run python llb.py. llb.py combines some arguments in main.py to make evalutions easier. The mapping of arguments is defined as a dict in llb.py.

Evaluation on CPU

python llb.py --model bigdl-llm --pretrained /path/to/model --precision nf3 int4 nf4 --device cpu --tasks hellaswag arc mmlu truthfulqa --output_dir results/output

Evaluation on Intel GPU

python llb.py --model bigdl-llm --pretrained /path/to/model --precision nf3 int4 nf4 --device xpu --tasks hellaswag arc mmlu truthfulqa --output_dir results/output

Results

We follow Open LLM Leaderboard to record our metrics, acc_norm for hellaswag and arc_challenge, mc2 for truthful_qa and acc for mmlu. For mmlu, there are 57 subtasks which means users may need to average them manually to get final result.