ipex-llm/python/llm/dev/benchmark/LongBench/README.md

# LongBench Benchmark Test

LongBench is the first benchmark for bilingual, multitask, and comprehensive assessment of long context understanding capabilities of large language models. This benchmark implementation is adapted from [THUDM/LongBench](https://github.com/THUDM/LongBench) and [SnapKV/experiments/LongBench](https://github.com/FasterDecoding/SnapKV/tree/main/experiments/LongBench).


## Environment Preparation

Before running, make sure to have [ipex-llm](../../../../../README.md) installed.

```bash
pip install omegaconf
pip install datasets
pip install jieba
pip install fuzzywuzzy
pip install rouge
```

### Load Data

You can download and load the LongBench data through the Hugging Face datasets ([🤗 HF Repo](https://huggingface.co/datasets/THUDM/LongBench)):

```python

from datasets import load_dataset

datasets = ["narrativeqa", "qasper", "multifieldqa_en", "multifieldqa_zh", "hotpotqa", "2wikimqa", "musique", \
            "dureader", "gov_report", "qmsum", "multi_news", "vcsum", "trec", "triviaqa", "samsum", "lsht", \
            "passage_count", "passage_retrieval_en", "passage_retrieval_zh", "lcc", "repobench-p"]

for dataset in datasets:
    data = load_dataset('THUDM/LongBench', dataset, split='test')
    data = load_dataset('THUDM/LongBench', f"{dataset}_e", split='test')

```

## config

### `config.yaml`

Config YAML file has following format

```yaml
# The name of the models you want to test
model_name:
  # - "mistral-7B-instruct-v0.2"
  - "llama2-7b-chat-4k"
  # - "chatglm4-9b"
  # - "qwen2-7b-instruct"

# whether test the full-kv
full_kv: True
# Whether apply model optimization
optimize_model: True
# dtype of the model
dtype: 'fp16'
# low bit of the model
low_bit: 'sym_int4'
# whether or not to use the 'e' version of the datasets
e: False

# the compress kv configs you want to test
compress_kv:
  - "ablation_c512_w32_k7_maxpool"
  - "ablation_c1024_w32_k7_maxpool"

# the datasets you want to test
datasets:
  - "multi_news"
  - "qasper"
  - "hotpotqa"
  - "trec"
  - "passage_count"
  - "lcc"
  # - "multifieldqa_zh"
  # - "dureader"
  # - "vcsum"
  # - "lsht"
  # - "passage_retrieval_zh"

```

### The `config` dir

Some json files is saved in the `config` dir. It can be divided into three kinds: about models, about datasets, and about compress-kv.

#### About Models

- `model2path.json`: This file saves the path to the models.

-  `model2maxlen.json`: This file saves the max length of the prompts of each model.

#### About datasets

- `dataset2maxlen.json`: The max length of the outputs of the models of each dataset.

- `dataset2prompt.json`: The format of prompts of each dataset.

#### About compress-kv

The rest JSON files are compress-kv test configurations.

## Run

There are two python files for users' call.

1. Configure the `config.yaml` and run `pred.py` and you can obtain the output of the model under `pred/` folder corresponding to the model name.

2. Run the evaluation code `eval.py`, you can get the evaluation results on all datasets in `result.json`.

> [!Note]
>
> To test the models and get the score in a row, please run `test_and_eval.sh`

## Citation

```bibtex
@article{bai2023longbench,
  title={LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding},
  author={Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi},
  journal={arXiv preprint arXiv:2308.14508},
  year={2023}
}

```