* add harness patch and llb script * add readme * add license * use patch instead * update readme * rename tests to evaluation * fix typo * remove nano dependency * add original harness link * rename title of usage * rename BigDLGPULM as BigDLLM * empty commit to rerun job
		
			
				
	
	
	
	
		
			1.5 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			1.5 KiB
		
	
	
	
	
	
	
	
Harness Evalution
Harness evalution allows users to eaisly get accuracy on various datasets. Here we have enabled harness evalution with BigDL-LLM under Open LLM Leaderboard settings. Before running, make sure to have bigdl-llm installed.
Install Harness
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd  lm-evaluation-harness
git checkout e81d3cc
pip install -e .
git apply ../bigdl-llm.patch
cd ..
Run
run python llb.py. llb.py combines some arguments in main.py to make evalutions easier. The mapping of arguments is defined as a dict in llb.py.
Evaluation on CPU
python llb.py --model bigdl-llm --pretrained /path/to/model --precision nf3 int4 nf4 --device cpu --tasks hellaswag arc mmlu truthfulqa --output_dir results/output
Evaluation on Intel GPU
python llb.py --model bigdl-llm --pretrained /path/to/model --precision nf3 int4 nf4 --device xpu --tasks hellaswag arc mmlu truthfulqa --output_dir results/output
Results
We follow Open LLM Leaderboard to record our metrics, acc_norm for hellaswag and arc_challenge, mc2 for truthful_qa and acc for mmlu. For mmlu, there are 57 subtasks which means users may need to average them manually to get final result.