ipex-llm/python/llm/dev/benchmark/README.md
Ruonan Wang e9aa2bd890 LLM: reduce GPU 1st token latency and update example (#8763)
* reduce 1st token latency

* update example

* fix

* fix style

* update readme of gpu benchmark
2023-08-16 18:01:23 +08:00

72 lines
2.8 KiB
Markdown

# Benchmark tool for transformers int4 (separate 1st token and rest)
`benchmark_util.py` is used to provide a simple benchmark tool for transformer int4 model to calculate 1st token performance and the rest on CPU.
`gpu_benchmark_util.py` is used to provide a simple benchmark tool for transformer int4 model to calculate 1st token performance and the rest on GPU.
## CPU Usage
Just put this file into your benchmark directory, and then wrap your transformer int4 model with `BenchmarkWrapper` (`model = BenchmarkWrapper(model)`).
Take `chatglm-6b` as an example:
```python
import torch
import os
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import time
import numpy as np
from benchmark_util import BenchmarkWrapper
model_path ='THUDM/chatglm-6b'
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True)
model = BenchmarkWrapper(model)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
prompt = "今天睡不着怎么办"
with torch.inference_mode():
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, do_sample=False, max_new_tokens=32)
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
```
Output will be like:
```bash
=========First token cost xx.xxxxs=========
=========Last token cost average xx.xxxxs (31 tokens in all)=========
```
## GPU Usage
Just put this file into your benchmark directory, and then wrap your transformer int4 model with `BenchmarkWrapper` (`model = BenchmarkWrapper(model)`).
Take `chatglm-6b` as an example:
```python
import torch
import os
import intel_extension_for_pytorch as ipex
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import time
import numpy as np
from gpu_benchmark_util import BenchmarkWrapper
model_path ='THUDM/chatglm-6b'
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True)
model = model.half().to('xpu')
model = BenchmarkWrapper(model)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
prompt = "今天睡不着怎么办"
with torch.inference_mode():
# wamup two times as use ipex
for i in range(2):
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
output = model.generate(input_ids, do_sample=False, max_new_tokens=32)
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
# collect performance data now
for i in range(5):
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
output = model.generate(input_ids, do_sample=False, max_new_tokens=32)
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
```
Output will be like:
```bash
=========First token cost xx.xxxxs=========
=========Last token cost average xx.xxxxs (31 tokens in all)=========
```