ipex-llm/python/llm/dev/benchmark/perplexity
Chu,Youcheng 32f0a77846
feat: update readme for ppl test (#11865)
* feat: update readme for ppl test

* fix: textual adjustments

* fix: textual adjustments

* Add ipex-llm npu option in setup.py (#11858)

* add ipex-llm npu release

* update example doc

* meet latest release changes

* optimize phi3 memory usage (#11867)

* Update `ipex-llm` default transformers version to 4.37.0 (#11859)

* Update default transformers version to 4.37.0

* Add dependency requirements for qwen and qwen-vl

* Temp fix transformers version for these not yet verified models

* Skip qwen test in UT for now as it requires transformers<4.37.0

* Update performance test regarding updated default `transformers==4.37.0` (#11869)

* Update igpu performance from transformers 4.36.2 to 4.37.0 (#11841)

* upgrade arc perf test to transformers 4.37 (#11842)

* fix load low bit com dtype (#11832)

* feat: add mixed_precision argument on ppl longbench evaluation

* fix: delete extra code

* feat: upgrade arc perf test to transformers 4.37

* fix: add missing codes

* fix: keep perf test for qwen-vl-chat in transformers 4.36

* fix: remove extra space

* fix: resolve pr comment

* fix: add empty line

* fix: add pip install for spr and core test

* fix: delete extra comments

* fix: remove python -m for pip

* Revert "fix load low bit com dtype (#11832)"

This reverts commit 6841a9ac8f.

---------

Co-authored-by: Zhao Changmin <changmin.zhao@intel.com>
Co-authored-by: Jinhe Tang <jin.tang1337@gmail.com>

* add transformers==4.36 for qwen vl in igpu-perf (#11846)

* add transformers==4.36.2 for qwen-vl

* Small update

---------

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>

* fix: remove qwen-7b on core test (#11851)

* fix: remove qwen-7b on core test

* fix: change delete to comment

---------

Co-authored-by: Jinhe Tang <jin.tang1337@gmail.com>

* replce filename (#11854)

* fix: remove qwen-7b on core test

* fix: change delete to comment

* fix: replace filename

---------

Co-authored-by: Jinhe Tang <jin.tang1337@gmail.com>

* fix: delete extra comments (#11863)

* Remove transformers installation for temp test purposes

* Small fix

* Small update

---------

Co-authored-by: Chu,Youcheng <70999398+cranechu0131@users.noreply.github.com>
Co-authored-by: Zhao Changmin <changmin.zhao@intel.com>
Co-authored-by: Jinhe Tang <jin.tang1337@gmail.com>
Co-authored-by: Zijie Li <michael20001122@gmail.com>
Co-authored-by: Chu,Youcheng <1340390339@qq.com>

* Pytorch models transformers version update (#11860)

* yi sync

* delete 4.34 constraint

* delete 4.34 constraint

* delete 4.31 constraint

* delete 4.34 constraint

* delete 4.35 constraint

* added <=4.33.3 constraint

* added <=4.33.3 constraint

* switched to chinese prompt

* Update compresskv model forward type logic (#11868)

* update

* fix

* Update local import for ppl (#11866)

Co-authored-by: jenniew <jenniewang123@gmail.com>

* fix: textual adjustment

---------

Co-authored-by: SONG Ge <38711238+sgwhat@users.noreply.github.com>
Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
Co-authored-by: Yuwen Hu <54161268+Oscilloscope98@users.noreply.github.com>
Co-authored-by: Zhao Changmin <changmin.zhao@intel.com>
Co-authored-by: Jinhe Tang <jin.tang1337@gmail.com>
Co-authored-by: Zijie Li <michael20001122@gmail.com>
Co-authored-by: Yina Chen <33650826+cyita@users.noreply.github.com>
Co-authored-by: RyuKosei <70006706+RyuKosei@users.noreply.github.com>
Co-authored-by: jenniew <jenniewang123@gmail.com>
2024-08-20 20:13:54 +08:00
..
make_csv.py combine english and chinese, remove nan 2024-04-08 19:37:51 +08:00
make_table.py Mark Color Modification 2024-04-12 14:00:50 +08:00
ppl.py Align ppl with llama.cpp (#11055) 2024-05-22 16:43:11 +08:00
README.md feat: update readme for ppl test (#11865) 2024-08-20 20:13:54 +08:00
run_longbench.py feat: add mixed_precision argument on ppl longbench evaluation (#11837) 2024-08-19 10:00:44 +08:00
run_wikitext.py Update local import for ppl (#11866) 2024-08-20 18:50:00 +08:00

Perplexity

Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation is adapted from transformers/perplexity and benchmark_patch_llm.py

Environment Preparation

pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
pip install datasets

This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.

source /opt/intel/oneapi/setvars.sh

PPL Evaluation

1. Run on Wikitext

An example to run perplexity on wikitext:

python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B --dataset path=wikitext,name=wikitext-2-raw-v1 --precision sym_int4 --device xpu --stride 512 --max_length 4096

2. Run on THUDM/LongBench dataset

An example to run perplexity on chatglm3-6b using the default Chinese datasets("multifieldqa_zh", "dureader", "vcsum", "lsht", "passage_retrieval_zh")

python run_longbench.py --model_path THUDM/chatglm3-6b --precisions float16 sym_int4 --device xpu --language zh

Notes:

  • If you want to test model perplexity on a few selected datasets from the LongBench dataset, please use the format below.
    --datasets narrativeqa qasper ...
    
  • The language argument will only take effect if datasets is None. The choices for this argument are en, zh, all, which stands for all the English datasets, all the Chinese datasets and all the datasets respectively during testing.
  • If you want to test perplexity on pre-downloaded datasets, please specify the <path/to/dataset> in the dataset_path argument in your command.
  • You can run python make_table.py <input_dir> to summarize the results.