ipex-llm/python/llm/dev/benchmark/harness
Chu,Youcheng 16c658e732
LLM: add known issues to harness evaluation (#12036)
* feat: 在harness中添加known issue

* fix: resolve comments

* fix: small fixes
2024-09-09 14:15:42 +08:00
..
harness_to_leaderboard.py Migrate harness to ipexllm (#10703) 2024-04-09 15:48:53 +08:00
ipexllm.py Migrate harness to ipexllm (#10703) 2024-04-09 15:48:53 +08:00
make_csv.py separate make_csv from the file 2024-02-23 16:33:38 +08:00
make_table.py fall back to make_table.py 2024-02-23 16:33:38 +08:00
README.md LLM: add known issues to harness evaluation (#12036) 2024-09-09 14:15:42 +08:00
run_llb.py Migrate harness to ipexllm (#10703) 2024-04-09 15:48:53 +08:00
run_multi_llb.py Migrate harness to ipexllm (#10703) 2024-04-09 15:48:53 +08:00

Harness Evaluation

Harness evaluation allows users to eaisly get accuracy on various datasets. Here we have enabled harness evaluation with IPEX-LLM under Open LLM Leaderboard settings. Before running, make sure to have ipex-llm installed.

Install Harness

git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout b281b09
pip install -e .

Run

run python run_llb.py. run_llb.py combines some arguments in main.py to make evaluations easier. The mapping of arguments is defined as a dict in llb.py.

Evaluation on CPU

export IPEX_LLM_LAST_LM_HEAD=0

python run_llb.py --model ipex-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device cpu --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache

Evaluation on Intel GPU

export IPEX_LLM_LAST_LM_HEAD=0

python run_llb.py --model ipex-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device xpu --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache

Evaluation using multiple Intel GPU

export IPEX_LLM_LAST_LM_HEAD=0

python run_multi_llb.py --model ipex-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device xpu:0,2,3 --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache

Taking example above, the script will fork 3 processes, each for one xpu, to execute the tasks.

Results

We follow Open LLM Leaderboard to record our metrics, acc_norm for hellaswag and arc_challenge, mc2 for truthful_qa and acc for mmlu. For mmlu, there are 57 subtasks which means users may need to average them manually to get final result.

Summarize the results

python make_table.py <input_dir>

Known Issues

1.Detected model is a low-bit(sym int4) model, Please use load_low_bit to load this model

Harness evaluation is meant for unquantified models and by passing the argument precision can the model be converted to target precision. If you load the quantified models, you may encounter the following error:

********************************Usage Error********************************
Detected model is a low-bit(sym int4) model, Please use load_low_bit to load this model.

However, you can replace the following code in this line

AutoModelForCausalLM.from_pretrained = partial(AutoModelForCausalLM.from_pretrained,**self.bigdl_llm_kwargs)

to the following codes to load the low bit models.

class ModifiedAutoModelForCausalLM(AutoModelForCausalLM): 
    @classmethod
    def load_low_bit(cls,*args,**kwargs):
        for k in ['load_in_low_bit', 'device_map', 'max_memory', 'load_in_8bit','load_in_4bit']: 
        kwargs.pop(k)
    return super().load_low_bit(*args, **kwargs)

AutoModelForCausalLM.from_pretrained=partial(ModifiedAutoModelForCausalLM.load_low_bit, *self.bigdl_llm_kwargs)

2.please pass the argument trust_remote_code=True to allow custom code to be run.

lm-evaluation-harness doesn't pass trust_remote_code=true to datasets. This may cause errors similar to the following error:

RuntimeError: Job config of task=winogrande, precision=sym_int4 failed. 
Error Message: The repository for winogrande con tains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https: //hf. co/datasets/winogrande.
please pass the argument trust_remote_code=True' to allow custom code to be run. 

Please Refer to these:

You have to manually add datasets.config.HF_DATASETS_TRUST_REMOTE_CODE=True in your pypi dataset package directory.