parent
							
								
									86b81c09d9
								
							
						
					
					
						commit
						b7bc1023fb
					
				
					 3 changed files with 315 additions and 1 deletions
				
			
		| 
						 | 
					@ -23,6 +23,7 @@ RUN apt-get update && \
 | 
				
			||||||
    # For Qwen series models support
 | 
					    # For Qwen series models support
 | 
				
			||||||
    pip install transformers_stream_generator einops tiktoken
 | 
					    pip install transformers_stream_generator einops tiktoken
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					COPY ./vllm_online_benchmark.py        /llm/
 | 
				
			||||||
COPY ./vllm_offline_inference.py       /llm/
 | 
					COPY ./vllm_offline_inference.py       /llm/
 | 
				
			||||||
COPY ./payload-1024.lua                /llm/
 | 
					COPY ./payload-1024.lua                /llm/
 | 
				
			||||||
COPY ./start-vllm-service.sh           /llm/
 | 
					COPY ./start-vllm-service.sh           /llm/
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -67,7 +67,33 @@ We have included multiple example files in `/llm/`:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
We can benchmark the api_server to get an estimation about TPS (transactions per second).  To do so, you need to start the service first according to the instructions in this [section](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/vLLM-Serving/README.md#service).
 | 
					We can benchmark the api_server to get an estimation about TPS (transactions per second).  To do so, you need to start the service first according to the instructions in this [section](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/vLLM-Serving/README.md#service).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					###### Online benchmark through benchmark_util
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					After starting vllm service, Sending reqs through `vllm_online_benchmark.py`
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					python vllm_online_benchmark.py $model_name $max_seqs
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					And it will output like this:
 | 
				
			||||||
 | 
					```bash
 | 
				
			||||||
 | 
					model_name: Qwen1.5-14B-Chat
 | 
				
			||||||
 | 
					max_seq: 12
 | 
				
			||||||
 | 
					Warm Up: 100%|█████████████████████████████████████████████████████| 24/24 [01:36<00:00,  4.03s/req]
 | 
				
			||||||
 | 
					Benchmarking: 100%|████████████████████████████████████████████████| 60/60 [04:03<00:00,  4.05s/req]
 | 
				
			||||||
 | 
					Total time for 60 requests with 12 concurrent requests: xxx seconds.
 | 
				
			||||||
 | 
					Average responce time: xxx
 | 
				
			||||||
 | 
					Token throughput: xxx
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Average first token latency: xxx milliseconds.
 | 
				
			||||||
 | 
					P90 first token latency: xxx milliseconds.
 | 
				
			||||||
 | 
					P95 first token latency: xxx milliseconds.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Average next token latency: xxx milliseconds.
 | 
				
			||||||
 | 
					P90 next token latency: xxx milliseconds.
 | 
				
			||||||
 | 
					P95 next token latency: xxx milliseconds.
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					###### Online benchmark through wrk
 | 
				
			||||||
In container, do the following:
 | 
					In container, do the following:
 | 
				
			||||||
1. modify the `/llm/payload-1024.lua` so that the "model" attribute is correct.  By default, we use a prompt that is roughly 1024 token long, you can change it if needed.
 | 
					1. modify the `/llm/payload-1024.lua` so that the "model" attribute is correct.  By default, we use a prompt that is roughly 1024 token long, you can change it if needed.
 | 
				
			||||||
2. Start the benchmark using `wrk` using the script below:
 | 
					2. Start the benchmark using `wrk` using the script below:
 | 
				
			||||||
| 
						 | 
					@ -77,8 +103,8 @@ cd /llm
 | 
				
			||||||
# You can change -t and -c to control the concurrency.
 | 
					# You can change -t and -c to control the concurrency.
 | 
				
			||||||
# By default, we use 12 connections to benchmark the service.
 | 
					# By default, we use 12 connections to benchmark the service.
 | 
				
			||||||
wrk -t12 -c12 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
 | 
					wrk -t12 -c12 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
 | 
				
			||||||
 | 
					 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### Offline benchmark through benchmark_vllm_throughput.py
 | 
					#### Offline benchmark through benchmark_vllm_throughput.py
 | 
				
			||||||
 | 
					
 | 
				
			||||||
We have included the benchmark_throughput script provied by `vllm` in our image as `/llm/benchmark_vllm_throughput.py`.  To use the benchmark_throughput script, you will need to download the test dataset through:
 | 
					We have included the benchmark_throughput script provied by `vllm` in our image as `/llm/benchmark_vllm_throughput.py`.  To use the benchmark_throughput script, you will need to download the test dataset through:
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										287
									
								
								docker/llm/serving/xpu/docker/vllm_online_benchmark.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										287
									
								
								docker/llm/serving/xpu/docker/vllm_online_benchmark.py
									
									
									
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because one or more lines are too long
											
										
									
								
							
		Loading…
	
		Reference in a new issue