LLM: update llm latency benchmark. (#8922)
This commit is contained in:
		
							parent
							
								
									7897eb4b51
								
							
						
					
					
						commit
						3d2efe9608
					
				
					 3 changed files with 13 additions and 5 deletions
				
			
		| 
						 | 
				
			
			@ -1,7 +1,7 @@
 | 
			
		|||
# All in One Benchmark Test
 | 
			
		||||
All in one benchmark test allows users to test all the benchmarks and record them in a result CSV. Users can provide models and related information in `config.yaml`.
 | 
			
		||||
 | 
			
		||||
Before running, make sure to have [bigdl-llm](../../../README.md) installed.
 | 
			
		||||
Before running, make sure to have [bigdl-llm](../../../README.md) and [bigdl-nano](../../../../nano/README.md) installed.
 | 
			
		||||
 | 
			
		||||
## Config
 | 
			
		||||
Config YAML file has following format
 | 
			
		||||
| 
						 | 
				
			
			@ -28,4 +28,10 @@ test_api:
 | 
			
		|||
run `python run.py`, this will output results to `results.csv`.
 | 
			
		||||
 | 
			
		||||
For SPR performance, run `bash run-spr.sh`.
 | 
			
		||||
For ARC performance, run `bash run-arc.sh`
 | 
			
		||||
> **Note**
 | 
			
		||||
>
 | 
			
		||||
> In `run-spr.sh`, we set optimal environment varaible by `source bigdl-nano-init -c`, `-c` stands for disabling jemalloc. Enabling jemalloc may lead to latency increasement after multiple trials.
 | 
			
		||||
>
 | 
			
		||||
> The value of `OMP_NUM_THREADS` should be the same as the cpu cores specified by `numactl -C`.
 | 
			
		||||
 | 
			
		||||
For ARC performance, run `bash run-arc.sh`.
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -1,5 +1,7 @@
 | 
			
		|||
#!/bin/bash
 | 
			
		||||
 | 
			
		||||
source bigdl-nano-init -c
 | 
			
		||||
export OMP_NUM_THREADS=48
 | 
			
		||||
export TRANSFORMERS_OFFLINE=1
 | 
			
		||||
 | 
			
		||||
# set following parameters according to the actual specs of the test machine
 | 
			
		||||
numactl -C 0-47 -m 0 python $(dirname "$0")/run.py
 | 
			
		||||
| 
						 | 
				
			
			@ -116,8 +116,8 @@ def run_transformer_int4(repo_id,
 | 
			
		|||
        model = AutoModel.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True, torch_dtype='auto')
 | 
			
		||||
        tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
 | 
			
		||||
    else:
 | 
			
		||||
        model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True)
 | 
			
		||||
        tokenizer = AutoTokenizer.from_pretrained(model_path)
 | 
			
		||||
        model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True)
 | 
			
		||||
        tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
 | 
			
		||||
    end = time.perf_counter()
 | 
			
		||||
    print(">> loading of model costs {}s".format(end - st))
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
		Loading…
	
		Reference in a new issue