Add vllm_online_benchmark.py (#11458)

* init * update and add * update
2024-06-28 14:59:06 +08:00 · 2024-06-28 14:59:06 +08:00 · b7bc1023fb
commit b7bc1023fb
parent 86b81c09d9
3 changed files with 315 additions and 1 deletions
--- a/docker/llm/serving/xpu/docker/Dockerfile
+++ b/docker/llm/serving/xpu/docker/Dockerfile
@ -23,6 +23,7 @@ RUN apt-get update && \
    # For Qwen series models support
    pip install transformers_stream_generator einops tiktoken
 COPY ./vllm_online_benchmark.py        /llm/
 COPY ./vllm_offline_inference.py       /llm/
 COPY ./payload-1024.lua                /llm/
 COPY ./start-vllm-service.sh           /llm/
--- a/docker/llm/serving/xpu/docker/README.md
+++ b/docker/llm/serving/xpu/docker/README.md
@ -67,7 +67,33 @@ We have included multiple example files in `/llm/`:
 We can benchmark the api_server to get an estimation about TPS (transactions per second).  To do so, you need to start the service first according to the instructions in this [section](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/vLLM-Serving/README.md#service).
 ###### Online benchmark through benchmark_util
 After starting vllm service, Sending reqs through `vllm_online_benchmark.py`
 ```bash
 python vllm_online_benchmark.py $model_name $max_seqs
 ```
 And it will output like this:
 ```bash
 model_name: Qwen1.5-14B-Chat
 max_seq: 12
 Warm Up: 100%|█████████████████████████████████████████████████████| 24/24 [01:36<00:00,  4.03s/req]
 Benchmarking: 100%|████████████████████████████████████████████████| 60/60 [04:03<00:00,  4.05s/req]
 Total time for 60 requests with 12 concurrent requests: xxx seconds.
 Average responce time: xxx
 Token throughput: xxx
 Average first token latency: xxx milliseconds.
 P90 first token latency: xxx milliseconds.
 P95 first token latency: xxx milliseconds.
 Average next token latency: xxx milliseconds.
 P90 next token latency: xxx milliseconds.
 P95 next token latency: xxx milliseconds.
 ```
 ###### Online benchmark through wrk
 In container, do the following:
 1. modify the `/llm/payload-1024.lua` so that the "model" attribute is correct.  By default, we use a prompt that is roughly 1024 token long, you can change it if needed.
 2. Start the benchmark using `wrk` using the script below:
@ -77,8 +103,8 @@ cd /llm
 # You can change -t and -c to control the concurrency.
 # By default, we use 12 connections to benchmark the service.
 wrk -t12 -c12 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
 ```
 #### Offline benchmark through benchmark_vllm_throughput.py
 We have included the benchmark_throughput script provied by `vllm` in our image as `/llm/benchmark_vllm_throughput.py`.  To use the benchmark_throughput script, you will need to download the test dataset through:
--- a/docker/llm/serving/xpu/docker/vllm_online_benchmark.py
+++ b/docker/llm/serving/xpu/docker/vllm_online_benchmark.py