ipex-llm/python/llm/dev/benchmark/harness/README.md

# Harness Evaluation
[Harness evaluation](https://github.com/EleutherAI/lm-evaluation-harness) allows users to eaisly get accuracy on various datasets. Here we have enabled harness evaluation with IPEX-LLM under
[Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) settings.
Before running, make sure to have [ipex-llm](../../../README.md) installed.

## Install Harness
```bash
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout b281b09
pip install -e .
```

## Run
run `python run_llb.py`. `run_llb.py` combines some arguments in `main.py` to make evaluations easier. The mapping of arguments is defined as a dict in [`llb.py`](llb.py).

### Evaluation on CPU
```bash
export IPEX_LLM_LAST_LM_HEAD=0

python run_llb.py --model ipex-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device cpu --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache
```
### Evaluation on Intel GPU
```bash
export IPEX_LLM_LAST_LM_HEAD=0

python run_llb.py --model ipex-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device xpu --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache
```
### Evaluation using multiple Intel GPU
```bash
export IPEX_LLM_LAST_LM_HEAD=0

python run_multi_llb.py --model ipex-llm --pretrained /path/to/model --precision nf3 sym_int4 nf4 --device xpu:0,2,3 --tasks hellaswag arc mmlu truthfulqa --batch 1 --no_cache
```
Taking example above, the script will fork 3 processes, each for one xpu, to execute the tasks.
## Results
We follow [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) to record our metrics, `acc_norm` for `hellaswag` and `arc_challenge`, `mc2` for `truthful_qa` and `acc` for `mmlu`. For `mmlu`, there are 57 subtasks which means users may need to average them manually to get final result.
## Summarize the results
```python
python make_table.py <input_dir>
```

## Known Issues
### 1.Detected model is a low-bit(sym int4) model, Please use `load_low_bit` to load this model
Harness evaluation is meant for unquantified models and by passing the argument precision can the model be converted to target precision. If you load the quantified models, you may encounter the following error:
```bash
********************************Usage Error********************************
Detected model is a low-bit(sym int4) model, Please use load_low_bit to load this model.
```
 However, you can replace the following code in [this line](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/dev/benchmark/harness/ipexllm.py#L52)
```python
AutoModelForCausalLM.from_pretrained = partial(AutoModelForCausalLM.from_pretrained,**self.bigdl_llm_kwargs)
```
to the following codes to load the low bit models.
```python
class ModifiedAutoModelForCausalLM(AutoModelForCausalLM):
    @classmethod
    def load_low_bit(cls,*args,**kwargs):
        for k in ['load_in_low_bit', 'device_map', 'max_memory', 'load_in_8bit','load_in_4bit']:
        kwargs.pop(k)
    return super().load_low_bit(*args, **kwargs)

AutoModelForCausalLM.from_pretrained=partial(ModifiedAutoModelForCausalLM.load_low_bit, *self.bigdl_llm_kwargs)
```
### 2.please pass the argument `trust_remote_code=True` to allow custom code to be run.
`lm-evaluation-harness` doesn't pass `trust_remote_code=true` to datasets. This may cause errors similar to the following error:
```
RuntimeError: Job config of task=winogrande, precision=sym_int4 failed.
Error Message: The repository for winogrande con tains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https: //hf. co/datasets/winogrande.
please pass the argument trust_remote_code=True' to allow custom code to be run.
```
Please Refer to these:

- [trust_remote_code error in simple evaluate for hellaswag · Issue #2222 · EleutherAI/lm-evaluation-harness (github.com) ](https://github.com/EleutherAI/lm-evaluation-harness/issues/2222)

- [Setting trust_remote_code to True for HuggingFace datasets compatibility by veekaybee · Pull Request #1467 · EleutherAI/lm-evaluation-harness (github.com)](https://github.com/EleutherAI/lm-evaluation-harness/pull/1467#issuecomment-1964282427)

- [Security features from the Hugging Face datasets library · Issue #1135 · EleutherAI/lm-evaluation-harness (github.com)](https://github.com/EleutherAI/lm-evaluation-harness/issues/1135#issuecomment-1961928695)

You have to manually add `datasets.config.HF_DATASETS_TRUST_REMOTE_CODE=True` in your pypi dataset package directory.