* add nv longbench * LongBench: NV code to ipex-llm * ammend * add more models support * ammend * optimize LongBench's user experience * ammend * ammend * fix typo * ammend * remove cuda related information & add a readme * add license to python scripts & polish the readme * ammend * ammend --------- Co-authored-by: cyita <yitastudy@gmail.com> Co-authored-by: ATMxsp01 <shou.xu@intel.com> Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
		
			
				
	
	
	
	
		
			3.5 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	LongBench Benchmark Test
LongBench is the first benchmark for bilingual, multitask, and comprehensive assessment of long context understanding capabilities of large language models. This benchmark implementation is adapted from THUDM/LongBench and SnapKV/experiments/LongBench.
Environment Preparation
Before running, make sure to have ipex-llm installed.
pip install omegaconf
pip install datasets
pip install jieba
pip install fuzzywuzzy
pip install rouge
Load Data
You can download and load the LongBench data through the Hugging Face datasets (🤗 HF Repo):
from datasets import load_dataset
datasets = ["narrativeqa", "qasper", "multifieldqa_en", "multifieldqa_zh", "hotpotqa", "2wikimqa", "musique", \
            "dureader", "gov_report", "qmsum", "multi_news", "vcsum", "trec", "triviaqa", "samsum", "lsht", \
            "passage_count", "passage_retrieval_en", "passage_retrieval_zh", "lcc", "repobench-p"]
for dataset in datasets:
    data = load_dataset('THUDM/LongBench', dataset, split='test')
    data = load_dataset('THUDM/LongBench', f"{dataset}_e", split='test')
config
config.yaml
Config YAML file has following format
# The name of the models you want to test
model_name:
  # - "mistral-7B-instruct-v0.2"
  - "llama2-7b-chat-4k"
  # - "chatglm4-9b"
  # - "qwen2-7b-instruct"
# whether test the full-kv
full_kv: True
# Whether apply model optimization
optimize_model: True
# dtype of the model
dtype: 'fp16'
# low bit of the model
low_bit: 'sym_int4'
# whether or not to use the 'e' version of the datasets
e: False
# the compress kv configs you want to test
compress_kv:
  - "ablation_c512_w32_k7_maxpool"
  - "ablation_c1024_w32_k7_maxpool"
# the datasets you want to test
datasets:
  - "multi_news"
  - "qasper"
  - "hotpotqa"
  - "trec"
  - "passage_count"
  - "lcc"
  # - "multifieldqa_zh"
  # - "dureader"
  # - "vcsum"
  # - "lsht"
  # - "passage_retrieval_zh"
The config dir
Some json files is saved in the config dir. It can be divided into three kinds: about models, about datasets, and about compress-kv.
About Models
- 
model2path.json: This file saves the path to the models. - 
model2maxlen.json: This file saves the max length of the prompts of each model. 
About datasets
- 
dataset2maxlen.json: The max length of the outputs of the models of each dataset. - 
dataset2prompt.json: The format of prompts of each dataset. 
About compress-kv
The rest JSON files are compress-kv test configurations.
Run
There are two python files for users' call.
- 
Configure the
config.yamland runpred.pyand you can obtain the output of the model underpred/folder corresponding to the model name. - 
Run the evaluation code
eval.py, you can get the evaluation results on all datasets inresult.json. 
Note
To test the models and get the score in a row, please run
test_and_eval.sh
Citation
@article{bai2023longbench,
  title={LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding},
  author={Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi},
  journal={arXiv preprint arXiv:2308.14508},
  year={2023}
}