* modify output_path as a directory * schedule nightly at 21 on Friday * add tasks and models for nightly * add accuracy regression * comment out if to test * mixed fp4 * for test * add missing delimiter * remove comma * fixed golden results * add mixed 4 golden result * add more options * add mistral results * get golden result of stable lm * move nightly scripts and results to test folder * add license * add fp8 stable lm golden * run on all available devices * trigger only when ready for review * fix new line * update golden * add mistral |
||
|---|---|---|
| .. | ||
| bigdl_llm.py | ||
| harness_to_leaderboard.py | ||
| README.md | ||
| run_llb.py | ||
Harness Evalution
Harness evalution allows users to eaisly get accuracy on various datasets. Here we have enabled harness evalution with BigDL-LLM under Open LLM Leaderboard settings. Before running, make sure to have bigdl-llm installed.
Install Harness
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout e81d3cc
pip install -e .
Run
run python run_llb.py. run_llb.py combines some arguments in main.py to make evalutions easier. The mapping of arguments is defined as a dict in llb.py.
Evaluation on CPU
python run_llb.py --model bigdl-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device cpu --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache
Evaluation on Intel GPU
python run_llb.py --model bigdl-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device xpu --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache
Results
We follow Open LLM Leaderboard to record our metrics, acc_norm for hellaswag and arc_challenge, mc2 for truthful_qa and acc for mmlu. For mmlu, there are 57 subtasks which means users may need to average them manually to get final result.