ipex-llm/python/llm/dev/benchmark/whisper
Cheen Hau, 俊豪 947b1e27b7 Add readme for Whisper Test (#9944)
* Fix local data path

* Remove non-essential files

* Add readme

* Minor fixes to script

* Bugfix, refactor

* Add references to original source. Bugfixes.

* Reviewer comments

* Properly print and explain output

* Move files to dev/benchmark

* Fixes
2024-01-22 15:11:33 +08:00
..
wer Add readme for Whisper Test (#9944) 2024-01-22 15:11:33 +08:00
librispeech_asr.py Add readme for Whisper Test (#9944) 2024-01-22 15:11:33 +08:00
README.md Add readme for Whisper Test (#9944) 2024-01-22 15:11:33 +08:00
run_whisper.py Add readme for Whisper Test (#9944) 2024-01-22 15:11:33 +08:00

Whisper Test

The Whisper Test allows users to evaluate the performance and accuracy of Whisper speech-to-text models. For accuracy, the model is tested on the LibriSpeech dataset using Word Error Rate (WER) metric. Before running, make sure to have bigdl-llm installed.

Install Dependencies

pip install datasets evaluate soundfile librosa jiwer

Run

python run_whisper.py --model_path /path/to/model --data_type other --device cpu

The LibriSpeech dataset contains 'clean' and 'other' splits. You can specify the split to evaluate with --data_type. By default, we set it to other. You can specify the device to run the test on with --device. To run on Intel GPU, set it to xpu, and refer to GPU installation guide for details on installation and optimal configuration.

Note

If you get the error message ConnectionError: Couldn't reach http://www.openslr.org/resources/12/test-other.tar.gz (error 403), you can source from a local dataset instead.

Using a local dataset

By default, the LibriSpeech dataset is downloaded at runtime with Huggingface Hub. If you prefer to source from a local dataset instead, please set the following environment variable before running the evaluation script

export LIBRISPEECH_DATASET_PATH=/path/to/dataset_folder

Make sure the local dataset folder contains 'dev-other.tar.gz','test-other.tar.gz', and 'train-other-500.tar.gz'. The files can be downloaded from http://www.openslr.org/resources/12/

Printed metrics

Three metrics are printed:

  • Realtime Factor(RTF): RTF indicates total prediction time over the total duration of speech samples.
  • Realtime X(RTX): RTX is the inverse of RTF
  • Word Error Rate (WER): WER indicates the average number of errors per reference word.