History

Yuwen Hu 001c13243e [LLM] Add support for `low_low_bit` benchmark on Windows GPU (#10167 ) * Add support for low_low_bit performance test on Windows GPU * Small fix * Small fix * Save memory during converting model process * Drop the results for first time when loading in low bit on mtl igpu for better performance * Small fix		2024-02-21 10:51:52 +08:00
..
all-in-one	[LLM] Add support for `low_low_bit` benchmark on Windows GPU (#10167 )	2024-02-21 10:51:52 +08:00
ceval	Add Ceval workflow and modify the result printing (#10140 )	2024-02-19 17:06:53 +08:00
harness	Modify harness evaluation workflow (#10174 )	2024-02-20 18:55:43 +08:00
perplexity	LLM: Update ppl tests (#10092 )	2024-02-06 17:31:48 +08:00
whisper	Add readme for Whisper Test (#9944 )	2024-01-22 15:11:33 +08:00
benchmark_util.py	hide detail memory for each token in benchmark_utils.py (#10037 )	2024-01-30 16:04:17 +08:00
README.md	LLM: add avg token latency information and benchmark guide of autotp (#9940 )	2024-01-19 15:09:57 +08:00

README.md

Benchmark tool for transformers int4 (separate 1st token and rest)

benchmark_util.py is used to provide a simple benchmark tool for transformer int4 model to calculate 1st token performance and the rest on CPU and GPU.

CPU Usage

Just put this file into your benchmark directory, and then wrap your transformer int4 model with BenchmarkWrapper (model = BenchmarkWrapper(model)). Take chatglm-6b as an example:

import torch
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
from benchmark_util import BenchmarkWrapper

model_path ='THUDM/chatglm-6b'
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True)
model = BenchmarkWrapper(model, do_print=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
prompt = "今天睡不着怎么办"
 
with torch.inference_mode():
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    output = model.generate(input_ids, do_sample=False, max_new_tokens=32)
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)

Output will be like:

=========First token cost xx.xxxxs=========
=========Last token cost average xx.xxxxs (31 tokens in all)=========

GPU Usage

Inference on single GPU

Just put this file into your benchmark directory, and then wrap your transformer int4 model with BenchmarkWrapper (model = BenchmarkWrapper(model)). Take chatglm-6b as an example:

import torch
import intel_extension_for_pytorch as ipex
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
from benchmark_util import BenchmarkWrapper

model_path ='THUDM/chatglm-6b'
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True)
model = model.to('xpu')
model = BenchmarkWrapper(model, do_print=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
prompt = "今天睡不着怎么办"
 
with torch.inference_mode():
    # wamup two times as use ipex
    for i in range(2):
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        output = model.generate(input_ids, do_sample=False, max_new_tokens=32)
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
    # collect performance data now
    for i in range(5):
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        output = model.generate(input_ids, do_sample=False, max_new_tokens=32)
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)

Inference on multi GPUs

Similarly, put this file into your benchmark directory, and then wrap your optimized model with BenchmarkWrapper (model = BenchmarkWrapper(model)). For example, just need to apply following code patch on Deepspeed Autotp example code to calculate 1st and the rest token performance:

 import torch
 import transformers
 import deepspeed
+from benchmark_util import BenchmarkWrapper
 
 def get_int_from_env(env_keys, default):
     """Returns the first positive env value found in the `env_keys` list or the default."""
@@ -98,6 +99,7 @@ if __name__ == '__main__':
     init_distributed()
 
     print(model)
+    model = BenchmarkWrapper(model, do_print=True)
 
     # Load tokenizer
     tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

Sample Output

Output will be like:

=========First token cost xx.xxxxs=========
=========Last token cost average xx.xxxxs (31 tokens in all)=========