llm: benchmark tool for transformers int4 (separate 1st token and rest) (#8460)

* add benchmark utils * fix * fix bug and add readme * hidden latency data
2023-07-06 09:49:52 +08:00 · 2023-07-06 09:49:52 +08:00 · 64b38e1dc8
commit 64b38e1dc8
parent 77808fa124
2 changed files with 4710 additions and 0 deletions
--- a/python/llm/dev/benchmark/README.md
+++ b/python/llm/dev/benchmark/README.md
@ -0,0 +1,32 @@
+# Benchmark tool for transformers int4 (separate 1st token and rest)
+
+`benchmark_util.py` is used to provide a simple benchmark tool for transformer int4 model to calculate 1st token performance and the rest.
+
+## Usage
+Just put this file into your benchmark directory, and then wrap your transformer int4 model with `BenchmarkWrapper` (`model = BenchmarkWrapper(model)`).
+Take `chatglm-6b` as an example:
+```python
+import torch
+import os
+from bigdl.llm.transformers import AutoModel
+from transformers import AutoTokenizer
+import time
+import numpy as np
+from benchmark_util import BenchmarkWrapper
+
+model_path ='THUDM/chatglm-6b'
+model = AutoModel.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True)
+model = BenchmarkWrapper(model)
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+prompt = "今天睡不着怎么办"
+ 
+with torch.inference_mode():
+    input_ids = tokenizer.encode(prompt, return_tensors="pt")
+    output = model.generate(input_ids, do_sample=False, max_new_tokens=32)
+    output_str = tokenizer.decode(output[0], skip_special_tokens=True)
+```
+Output will be like:
+```bash
+=========First token cost xx.xxxxs=========
+=========Last token cost average xx.xxxxs (31 tokens in all)=========
+```
--- a/python/llm/dev/benchmark/benchmark_util.py
+++ b/python/llm/dev/benchmark/benchmark_util.py