4.4 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	IPEX-LLM Benchmarking
We can do benchmarking for IPEX-LLM on Intel CPUs and GPUs using the benchmark scripts we provide.
Prepare The Environment
You can refer to here to install IPEX-LLM in your environment. The following dependencies are also needed to run the benchmark scripts.
pip install pandas
pip install omegaconf
Prepare The Scripts
Navigate to your local workspace and then download IPEX-LLM from GitHub. Modify the config.yaml under all-in-one folder for your own benchmark configurations.
cd your/local/workspace
git clone https://github.com/intel-analytics/ipex-llm.git
cd ipex-llm/python/llm/dev/benchmark/all-in-one/
Configure YAML File
repo_id:
  - 'meta-llama/Llama-2-7b-chat-hf'
local_model_hub: '/mnt/disk1/models'
warm_up: 1
num_trials: 3
low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
batch_size: 1 # default to 1
in_out_pairs:
  - '32-32'
  - '1024-128'
  - '2048-256'
test_api:
  - "transformer_int4_gpu"
cpu_embedding: False
Some parameters in the yaml file that you can configure:
- repo_id: The name of the model and its organization.
 - local_model_hub: The folder path where the models are stored on your machine.
 - low_bit: The low_bit precision you want to convert to for benchmarking.
 - batch_size: The number of samples on which the models makes predictions in one forward pass.
 - in_out_pairs: Input sequence length and output sequence length combined by '-'.
 - test_api: Use different test functions on different machines.
transformer_int4_gpuon Intel GPU for Linuxtransformer_int4_gpu_winon Intel GPU for Windowstransformer_int4on Intel CPU
 - cpu_embedding: Whether to put embedding on CPU (only avaiable now for windows gpu related test_api).
 
Run on Windows
Please refer to here to configure oneAPI environment variables.
.. tabs::
   .. tab:: Intel iGPU
      .. code-block:: bash
         set SYCL_CACHE_PERSISTENT=1
         set BIGDL_LLM_XMX_DISABLED=1
         python run.py
   .. tab:: Intel Arc™ A300-Series or Pro A60
      .. code-block:: bash
         set SYCL_CACHE_PERSISTENT=1
         python run.py
   .. tab:: Other Intel dGPU Series
      .. code-block:: bash
         # e.g. Arc™ A770
         python run.py
Run on Linux
.. tabs::
   .. tab:: Intel Arc™ A-Series and Intel Data Center GPU Flex
      For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, we recommend:
      .. code-block:: bash
         ./run-arc.sh
   .. tab:: Intel Data Center GPU Max
      Please note that you need to run ``conda install -c conda-forge -y gperftools=2.10`` before running the benchmark script on Intel Data Center GPU Max Series.
      .. code-block:: bash
         ./run-max-gpu.sh
   .. tab:: Intel SPR
      For Intel SPR machine, we recommend:
      .. code-block:: bash
         ./run-spr.sh
      The scipt uses a default numactl strategy. If you want to customize it, please use ``lscpu`` or ``numactl -H`` to check how cpu indexs are assigned to numa node, and make sure the run command is binded to only one socket.
   .. tab:: Intel HBM
      For Intel HBM machine, we recommend:
      .. code-block:: bash
         ./run-hbm.sh
      The scipt uses a default numactl strategy. If you want to customize it, please use ``numactl -H`` to check how the index of hbm node and cpu are assigned.
      
      For example:
      .. code-block:: bash
         node   0   1   2   3
            0:  10  21  13  23
            1:  21  10  23  13
            2:  13  23  10  23
            3:  23  13  23  10
      here hbm node is the node whose distance from the checked node is 13, node 2 is node 0's hbm node.
      And make sure the run command is binded to only one socket.
Result
After the script runnning is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns 1st token avg latency (ms) and 2+ avg latency (ms/token) for  performance results. You can also check whether the column actual input/output tokens is consistent with the column input/output tokens and whether the parameters you specified in config.yaml have been successfully applied in the benchmarking.