parent
86b81c09d9
commit
b7bc1023fb
3 changed files with 315 additions and 1 deletions
|
|
@ -23,6 +23,7 @@ RUN apt-get update && \
|
||||||
# For Qwen series models support
|
# For Qwen series models support
|
||||||
pip install transformers_stream_generator einops tiktoken
|
pip install transformers_stream_generator einops tiktoken
|
||||||
|
|
||||||
|
COPY ./vllm_online_benchmark.py /llm/
|
||||||
COPY ./vllm_offline_inference.py /llm/
|
COPY ./vllm_offline_inference.py /llm/
|
||||||
COPY ./payload-1024.lua /llm/
|
COPY ./payload-1024.lua /llm/
|
||||||
COPY ./start-vllm-service.sh /llm/
|
COPY ./start-vllm-service.sh /llm/
|
||||||
|
|
|
||||||
|
|
@ -67,7 +67,33 @@ We have included multiple example files in `/llm/`:
|
||||||
|
|
||||||
We can benchmark the api_server to get an estimation about TPS (transactions per second). To do so, you need to start the service first according to the instructions in this [section](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/vLLM-Serving/README.md#service).
|
We can benchmark the api_server to get an estimation about TPS (transactions per second). To do so, you need to start the service first according to the instructions in this [section](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/vLLM-Serving/README.md#service).
|
||||||
|
|
||||||
|
###### Online benchmark through benchmark_util
|
||||||
|
|
||||||
|
After starting vllm service, Sending reqs through `vllm_online_benchmark.py`
|
||||||
|
```bash
|
||||||
|
python vllm_online_benchmark.py $model_name $max_seqs
|
||||||
|
```
|
||||||
|
|
||||||
|
And it will output like this:
|
||||||
|
```bash
|
||||||
|
model_name: Qwen1.5-14B-Chat
|
||||||
|
max_seq: 12
|
||||||
|
Warm Up: 100%|█████████████████████████████████████████████████████| 24/24 [01:36<00:00, 4.03s/req]
|
||||||
|
Benchmarking: 100%|████████████████████████████████████████████████| 60/60 [04:03<00:00, 4.05s/req]
|
||||||
|
Total time for 60 requests with 12 concurrent requests: xxx seconds.
|
||||||
|
Average responce time: xxx
|
||||||
|
Token throughput: xxx
|
||||||
|
|
||||||
|
Average first token latency: xxx milliseconds.
|
||||||
|
P90 first token latency: xxx milliseconds.
|
||||||
|
P95 first token latency: xxx milliseconds.
|
||||||
|
|
||||||
|
Average next token latency: xxx milliseconds.
|
||||||
|
P90 next token latency: xxx milliseconds.
|
||||||
|
P95 next token latency: xxx milliseconds.
|
||||||
|
```
|
||||||
|
|
||||||
|
###### Online benchmark through wrk
|
||||||
In container, do the following:
|
In container, do the following:
|
||||||
1. modify the `/llm/payload-1024.lua` so that the "model" attribute is correct. By default, we use a prompt that is roughly 1024 token long, you can change it if needed.
|
1. modify the `/llm/payload-1024.lua` so that the "model" attribute is correct. By default, we use a prompt that is roughly 1024 token long, you can change it if needed.
|
||||||
2. Start the benchmark using `wrk` using the script below:
|
2. Start the benchmark using `wrk` using the script below:
|
||||||
|
|
@ -77,8 +103,8 @@ cd /llm
|
||||||
# You can change -t and -c to control the concurrency.
|
# You can change -t and -c to control the concurrency.
|
||||||
# By default, we use 12 connections to benchmark the service.
|
# By default, we use 12 connections to benchmark the service.
|
||||||
wrk -t12 -c12 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
|
wrk -t12 -c12 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Offline benchmark through benchmark_vllm_throughput.py
|
#### Offline benchmark through benchmark_vllm_throughput.py
|
||||||
|
|
||||||
We have included the benchmark_throughput script provied by `vllm` in our image as `/llm/benchmark_vllm_throughput.py`. To use the benchmark_throughput script, you will need to download the test dataset through:
|
We have included the benchmark_throughput script provied by `vllm` in our image as `/llm/benchmark_vllm_throughput.py`. To use the benchmark_throughput script, you will need to download the test dataset through:
|
||||||
|
|
|
||||||
287
docker/llm/serving/xpu/docker/vllm_online_benchmark.py
Normal file
287
docker/llm/serving/xpu/docker/vllm_online_benchmark.py
Normal file
File diff suppressed because one or more lines are too long
Loading…
Reference in a new issue