Longbench: NV code to ipex-llm (#11662 )

* add nv longbench

* LongBench: NV code to ipex-llm

* ammend

* add more models support

* ammend

* optimize LongBench's user experience

* ammend

* ammend

* fix typo

* ammend

* remove cuda related information & add a readme

* add license to python scripts & polish the readme

* ammend

* ammend

---------

Co-authored-by: cyita <yitastudy@gmail.com>
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>

2024-09-18 15:55:14 +08:00

3.5 KiB

Raw Blame History

LongBench Benchmark Test

LongBench is the first benchmark for bilingual, multitask, and comprehensive assessment of long context understanding capabilities of large language models. This benchmark implementation is adapted from THUDM/LongBench and SnapKV/experiments/LongBench.

Environment Preparation

Before running, make sure to have ipex-llm installed.

pip install omegaconf
pip install datasets
pip install jieba
pip install fuzzywuzzy
pip install rouge

Load Data

You can download and load the LongBench data through the Hugging Face datasets (🤗 HF Repo):


from datasets import load_dataset

datasets = ["narrativeqa", "qasper", "multifieldqa_en", "multifieldqa_zh", "hotpotqa", "2wikimqa", "musique", \
            "dureader", "gov_report", "qmsum", "multi_news", "vcsum", "trec", "triviaqa", "samsum", "lsht", \
            "passage_count", "passage_retrieval_en", "passage_retrieval_zh", "lcc", "repobench-p"]

for dataset in datasets:
    data = load_dataset('THUDM/LongBench', dataset, split='test')
    data = load_dataset('THUDM/LongBench', f"{dataset}_e", split='test')

config

`config.yaml`

Config YAML file has following format

# The name of the models you want to test
model_name:
  # - "mistral-7B-instruct-v0.2"
  - "llama2-7b-chat-4k"
  # - "chatglm4-9b"
  # - "qwen2-7b-instruct"

# whether test the full-kv
full_kv: True
# Whether apply model optimization
optimize_model: True
# dtype of the model
dtype: 'fp16'
# low bit of the model
low_bit: 'sym_int4'
# whether or not to use the 'e' version of the datasets
e: False

# the compress kv configs you want to test
compress_kv:
  - "ablation_c512_w32_k7_maxpool"
  - "ablation_c1024_w32_k7_maxpool"

# the datasets you want to test
datasets:
  - "multi_news"
  - "qasper"
  - "hotpotqa"
  - "trec"
  - "passage_count"
  - "lcc"
  # - "multifieldqa_zh"
  # - "dureader"
  # - "vcsum"
  # - "lsht"
  # - "passage_retrieval_zh"

The `config` dir

Some json files is saved in the config dir. It can be divided into three kinds: about models, about datasets, and about compress-kv.

About Models

model2path.json: This file saves the path to the models.
model2maxlen.json: This file saves the max length of the prompts of each model.

About datasets

dataset2maxlen.json: The max length of the outputs of the models of each dataset.
dataset2prompt.json: The format of prompts of each dataset.

About compress-kv

The rest JSON files are compress-kv test configurations.

Run

There are two python files for users' call.

Configure the config.yaml and run pred.py and you can obtain the output of the model under pred/ folder corresponding to the model name.
Run the evaluation code eval.py, you can get the evaluation results on all datasets in result.json.

Note

To test the models and get the score in a row, please run test_and_eval.sh

Citation

@article{bai2023longbench,
  title={LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding},
  author={Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi},
  journal={arXiv preprint arXiv:2308.14508},
  year={2023}
}

3.5 KiB Raw Blame History