* edit README.md * update the branch * edited README.md * updated * updated description --------- Co-authored-by: jenniew <jenniewang123@gmail.com>
		
			
				
	
	
	
	
		
			1.6 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			1.6 KiB
		
	
	
	
	
	
	
	
Perplexity
Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation is adapted from transformers/perplexity and benchmark_patch_llm.py
Run on Wikitext
pip install datasets
An example to run perplexity on wikitext:
python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B --dataset path=wikitext,name=wikitext-2-raw-v1 --precision sym_int4 --device xpu --stride 512 --max_length 4096
Run on THUDM/LongBench dataset
pip install datasets
An example to run perplexity on chatglm3-6b using the default Chinese datasets("multifieldqa_zh", "dureader", "vcsum", "lsht", "passage_retrieval_zh")
python run_longbench.py --model_path THUDM/chatglm3-6b --precisions float16 sym_int4 --device xpu --language zh
Notes:
- If you want to test model perplexity on a few selected datasets from the 
LongBenchdataset, please use the format below.--datasets narrativeqa qasper ... - The 
languageargument will only take effect ifdatasetsisNone. The choices for this argument areen, zh, all, which stands for all the English datasets, all the Chinese datasets and all the datasets respectively during testing. - If you want to test perplexity on pre-downloaded datasets, please specify the 
<path/to/dataset>in thedataset_pathargument in your command. - You can run 
python make_table.py <input_dir>to summarize the results.