Ruonan Wang e9aa2bd890 LLM: reduce GPU 1st token latency and update example (#8763 )

* reduce 1st token latency

* update example

* fix

* fix style

* update readme of gpu benchmark

2023-08-16 18:01:23 +08:00

2.8 KiB

Raw Blame History

Benchmark tool for transformers int4 (separate 1st token and rest)

benchmark_util.py is used to provide a simple benchmark tool for transformer int4 model to calculate 1st token performance and the rest on CPU.

gpu_benchmark_util.py is used to provide a simple benchmark tool for transformer int4 model to calculate 1st token performance and the rest on GPU.

CPU Usage

Just put this file into your benchmark directory, and then wrap your transformer int4 model with BenchmarkWrapper (model = BenchmarkWrapper(model)). Take chatglm-6b as an example:

import torch
import os
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import time
import numpy as np
from benchmark_util import BenchmarkWrapper

model_path ='THUDM/chatglm-6b'
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True)
model = BenchmarkWrapper(model)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
prompt = "今天睡不着怎么办"
 
with torch.inference_mode():
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    output = model.generate(input_ids, do_sample=False, max_new_tokens=32)
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)

Output will be like:

=========First token cost xx.xxxxs=========
=========Last token cost average xx.xxxxs (31 tokens in all)=========

GPU Usage