| 
				 | 
			||
|---|---|---|
| .. | ||
| harness_to_leaderboard.py | ||
| ipexllm.py | ||
| make_csv.py | ||
| make_table.py | ||
| README.md | ||
| run_llb.py | ||
| run_multi_llb.py | ||
Harness Evaluation
Harness evaluation allows users to eaisly get accuracy on various datasets. Here we have enabled harness evaluation with IPEX-LLM under Open LLM Leaderboard settings. Before running, make sure to have ipex-llm installed.
Install Harness
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout b281b09
pip install -e .
Run
run python run_llb.py. run_llb.py combines some arguments in main.py to make evaluations easier. The mapping of arguments is defined as a dict in llb.py.
Evaluation on CPU
export IPEX_LLM_LAST_LM_HEAD=0
python run_llb.py --model ipex-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device cpu --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache
Evaluation on Intel GPU
export IPEX_LLM_LAST_LM_HEAD=0
python run_llb.py --model ipex-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device xpu --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache
Evaluation using multiple Intel GPU
export IPEX_LLM_LAST_LM_HEAD=0
python run_multi_llb.py --model ipex-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device xpu:0,2,3 --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache
Taking example above, the script will fork 3 processes, each for one xpu, to execute the tasks.
Results
We follow Open LLM Leaderboard to record our metrics, acc_norm for hellaswag and arc_challenge, mc2 for truthful_qa and acc for mmlu. For mmlu, there are 57 subtasks which means users may need to average them manually to get final result.
Summarize the results
python make_table.py <input_dir>
Known Issues
1.Detected model is a low-bit(sym int4) model, Please use load_low_bit to load this model
Harness evaluation is meant for unquantified models and by passing the argument precision can the model be converted to target precision. If you load the quantified models, you may encounter the following error:
********************************Usage Error********************************
Detected model is a low-bit(sym int4) model, Please use load_low_bit to load this model.
However, you can replace the following code in this line
AutoModelForCausalLM.from_pretrained = partial(AutoModelForCausalLM.from_pretrained,**self.bigdl_llm_kwargs)
to the following codes to load the low bit models.
class ModifiedAutoModelForCausalLM(AutoModelForCausalLM): 
    @classmethod
    def load_low_bit(cls,*args,**kwargs):
        for k in ['load_in_low_bit', 'device_map', 'max_memory', 'load_in_8bit','load_in_4bit']: 
        kwargs.pop(k)
    return super().load_low_bit(*args, **kwargs)
AutoModelForCausalLM.from_pretrained=partial(ModifiedAutoModelForCausalLM.load_low_bit, *self.bigdl_llm_kwargs)
2.please pass the argument trust_remote_code=True to allow custom code to be run.
lm-evaluation-harness doesn't pass trust_remote_code=true to datasets. This may cause errors similar to the following error:
RuntimeError: Job config of task=winogrande, precision=sym_int4 failed. 
Error Message: The repository for winogrande con tains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https: //hf. co/datasets/winogrande.
please pass the argument trust_remote_code=True' to allow custom code to be run. 
Please Refer to these:
You have to manually add datasets.config.HF_DATASETS_TRUST_REMOTE_CODE=True in your pypi dataset package directory.