revise the benchmark part in python inference docker (#11020)
This commit is contained in:
parent
586a151f9c
commit
0b7e78b592
7 changed files with 42 additions and 95 deletions
|
|
@ -1,6 +1,6 @@
|
||||||
# IPEX-LLM Docker Containers
|
# IPEX-LLM Docker Containers
|
||||||
|
|
||||||
You can run IPEX-LLM containers (via docker or k8s) for inference, serving and fine-tuning on Intel CPU and GPU. Details on how to use these containers are available at [IPEX-LLM Docker Container Guides](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Docker/index.html).
|
You can run IPEX-LLM containers (via docker or k8s) for inference, serving and fine-tuning on Intel CPU and GPU. Details on how to use these containers are available at [IPEX-LLM Docker Container Guides](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/index.html).
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
|
|
||||||
|
|
@ -11,7 +11,7 @@ You can run IPEX-LLM containers (via docker or k8s) for inference, serving and f
|
||||||
|
|
||||||
|
|
||||||
#### Pull a IPEX-LLM Docker Image
|
#### Pull a IPEX-LLM Docker Image
|
||||||
To pull IPEX-LLM Docker images from [Intel Analytics Docker Hub](https://hub.docker.com/u/intelanalytics), use the `docker pull` command. For instance, to pull the CPU inference image:
|
To pull IPEX-LLM Docker images from [Docker Hub](https://hub.docker.com/u/intelanalytics), use the `docker pull` command. For instance, to pull the CPU inference image:
|
||||||
```bash
|
```bash
|
||||||
docker pull intelanalytics/ipex-llm-cpu:2.1.0-SNAPSHOT
|
docker pull intelanalytics/ipex-llm-cpu:2.1.0-SNAPSHOT
|
||||||
```
|
```
|
||||||
|
|
@ -29,7 +29,7 @@ Available images in hub are:
|
||||||
| intelanalytics/ipex-llm-finetune-qlora-xpu:2.1.0-SNAPSHOT| GPU Finetuning|
|
| intelanalytics/ipex-llm-finetune-qlora-xpu:2.1.0-SNAPSHOT| GPU Finetuning|
|
||||||
|
|
||||||
#### Run a Container
|
#### Run a Container
|
||||||
Use `docker run` command to run an IPEX-LLM docker container. For detailed instructions, refer to the [IPEX-LLM Docker Container Guides](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Docker/index.html).
|
Use `docker run` command to run an IPEX-LLM docker container. For detailed instructions, refer to the [IPEX-LLM Docker Container Guides](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/index.html).
|
||||||
|
|
||||||
|
|
||||||
#### Build Docker Image
|
#### Build Docker Image
|
||||||
|
|
|
||||||
|
|
@ -75,10 +75,10 @@
|
||||||
</label>
|
</label>
|
||||||
<ul class="bigdl-quicklinks-section-nav">
|
<ul class="bigdl-quicklinks-section-nav">
|
||||||
<li>
|
<li>
|
||||||
<a href="doc/LLM/Docker/docker_windows_gpu.html">Overview of IPEX-LLM Containers for Intel GPU</a>
|
<a href="doc/LLM/DockerGuides/docker_windows_gpu.html">Overview of IPEX-LLM Containers for Intel GPU</a>
|
||||||
</li>
|
</li>
|
||||||
<li>
|
<li>
|
||||||
<a href="doc/LLM/Docker/docker_pytorch_inference_gpu.html">Run PyTorch Inference on an Intel GPU via Docker</a>
|
<a href="doc/LLM/DockerGuides/docker_pytorch_inference_gpu.html">Run PyTorch Inference on an Intel GPU via Docker</a>
|
||||||
</li>
|
</li>
|
||||||
</ul>
|
</ul>
|
||||||
</li>
|
</li>
|
||||||
|
|
|
||||||
|
|
@ -15,12 +15,12 @@ subtrees:
|
||||||
title: "CPU"
|
title: "CPU"
|
||||||
- file: doc/LLM/Overview/install_gpu
|
- file: doc/LLM/Overview/install_gpu
|
||||||
title: "GPU"
|
title: "GPU"
|
||||||
- file: doc/LLM/Docker/index
|
- file: doc/LLM/DockerGuides/index
|
||||||
title: "Docker Guides"
|
title: "Docker Guides"
|
||||||
subtrees:
|
subtrees:
|
||||||
- entries:
|
- entries:
|
||||||
- file: doc/LLM/Docker/docker_windows_gpu
|
- file: doc/LLM/DockerGuides/docker_windows_gpu
|
||||||
- file: doc/LLM/Docker/docker_pytorch_inference_gpu
|
- file: doc/LLM/DockerGuides/docker_pytorch_inference_gpu
|
||||||
- file: doc/LLM/Quickstart/index
|
- file: doc/LLM/Quickstart/index
|
||||||
title: "Quickstart"
|
title: "Quickstart"
|
||||||
subtrees:
|
subtrees:
|
||||||
|
|
|
||||||
|
|
@ -91,90 +91,24 @@ cd /benchmark/all-in-one
|
||||||
vim config.yaml
|
vim config.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
**Modify config.yaml**
|
In the `config.yaml`, change `repo_id` to the model you want to test and `local_model_hub` to point to your model hub path.
|
||||||
```eval_rst
|
|
||||||
.. note::
|
|
||||||
|
|
||||||
``dtype``: The model is originally loaded in this data type. After ipex-llm conversion, all the non-linear layers remain to use this data type.
|
|
||||||
|
|
||||||
``qtype``: ipex-llm will convert all the linear-layers' weight to this data type.
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
|
...
|
||||||
repo_id:
|
repo_id:
|
||||||
# - 'THUDM/chatglm2-6b'
|
|
||||||
- 'meta-llama/Llama-2-7b-chat-hf'
|
- 'meta-llama/Llama-2-7b-chat-hf'
|
||||||
# - 'liuhaotian/llava-v1.5-7b' # requires a LLAVA_REPO_DIR env variables pointing to the llava dir; added only for gpu win related test_api now
|
local_model_hub: '/path/to/your/mode/folder'
|
||||||
local_model_hub: 'path to your local model hub'
|
...
|
||||||
warm_up: 1 # must set >=2 when run "pipeline_parallel_gpu" test_api
|
```
|
||||||
num_trials: 3
|
|
||||||
num_beams: 1 # default to greedy search
|
|
||||||
low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
|
|
||||||
batch_size: 1 # default to 1
|
|
||||||
in_out_pairs:
|
|
||||||
- '32-32'
|
|
||||||
- '1024-128'
|
|
||||||
test_api:
|
|
||||||
- "transformer_int4_gpu" # on Intel GPU, transformer-like API, (qtype=int4)
|
|
||||||
# - "transformer_int4_gpu_win" # on Intel GPU for Windows, transformer-like API, (qtype=int4)
|
|
||||||
# - "transformer_int4_fp16_gpu" # on Intel GPU, transformer-like API, (qtype=int4), (dtype=fp16)
|
|
||||||
# - "transformer_int4_fp16_gpu_win" # on Intel GPU for Windows, transformer-like API, (qtype=int4), (dtype=fp16)
|
|
||||||
# - "transformer_int4_loadlowbit_gpu_win" # on Intel GPU for Windows, transformer-like API, (qtype=int4), use load_low_bit API. Please make sure you have used the save.py to save the converted low bit model
|
|
||||||
# - "ipex_fp16_gpu" # on Intel GPU, use native transformers API, (dtype=fp16)
|
|
||||||
# - "bigdl_fp16_gpu" # on Intel GPU, use ipex-llm transformers API, (dtype=fp16), (qtype=fp16)
|
|
||||||
# - "optimize_model_gpu" # on Intel GPU, can optimize any pytorch models include transformer model
|
|
||||||
# - "deepspeed_optimize_model_gpu" # on Intel GPU, deepspeed autotp inference
|
|
||||||
# - "pipeline_parallel_gpu" # on Intel GPU, pipeline parallel inference
|
|
||||||
# - "speculative_gpu" # on Intel GPU, inference with self-speculative decoding
|
|
||||||
# - "transformer_int4" # on Intel CPU, transformer-like API, (qtype=int4)
|
|
||||||
# - "native_int4" # on Intel CPU
|
|
||||||
# - "optimize_model" # on Intel CPU, can optimize any pytorch models include transformer model
|
|
||||||
# - "pytorch_autocast_bf16" # on Intel CPU
|
|
||||||
# - "transformer_autocast_bf16" # on Intel CPU
|
|
||||||
# - "bigdl_ipex_bf16" # on Intel CPU, (qtype=bf16)
|
|
||||||
# - "bigdl_ipex_int4" # on Intel CPU, (qtype=int4)
|
|
||||||
# - "bigdl_ipex_int8" # on Intel CPU, (qtype=int8)
|
|
||||||
# - "speculative_cpu" # on Intel CPU, inference with self-speculative decoding
|
|
||||||
# - "deepspeed_transformer_int4_cpu" # on Intel CPU, deepspeed autotp inference
|
|
||||||
cpu_embedding: False # whether put embedding to CPU
|
|
||||||
streaming: False # whether output in streaming way (only avaiable now for gpu win related test_api)
|
|
||||||
use_fp16_torch_dtype: True # whether use fp16 for non-linear layer (only avaiable now for "pipeline_parallel_gpu" test_api)
|
|
||||||
n_gpu: 2 # number of GPUs to use (only avaiable now for "pipeline_parallel_gpu" test_api)
|
|
||||||
```
|
|
||||||
|
|
||||||
Some parameters in the yaml file that you can configure:
|
After modifying `config.yaml`, run the following commands to run benchmarking:
|
||||||
|
|
||||||
|
|
||||||
- `repo_id`: The name of the model and its organization.
|
|
||||||
- `local_model_hub`: The folder path where the models are stored on your machine. Replace 'path to your local model hub' with /llm/models.
|
|
||||||
- `warm_up`: The number of warmup trials before performance benchmarking (must set to >= 2 when using "pipeline_parallel_gpu" test_api).
|
|
||||||
- `num_trials`: The number of runs for performance benchmarking (the final result is the average of all trials).
|
|
||||||
- `low_bit`: The low_bit precision you want to convert to for benchmarking.
|
|
||||||
- `batch_size`: The number of samples on which the models make predictions in one forward pass.
|
|
||||||
- `in_out_pairs`: Input sequence length and output sequence length combined by '-'.
|
|
||||||
- `test_api`: Different test functions for different machines.
|
|
||||||
- `cpu_embedding`: Whether to put embedding on CPU (only available for windows GPU-related test_api).
|
|
||||||
- `streaming`: Whether to output in a streaming way (only available for GPU Windows-related test_api).
|
|
||||||
- `use_fp16_torch_dtype`: Whether to use fp16 for the non-linear layer (only available for "pipeline_parallel_gpu" test_api).
|
|
||||||
- `n_gpu`: Number of GPUs to use (only available for "pipeline_parallel_gpu" test_api).
|
|
||||||
|
|
||||||
|
|
||||||
```eval_rst
|
|
||||||
.. note::
|
|
||||||
|
|
||||||
If you want to benchmark the performance without warmup, you can set ``warm_up: 0`` and ``num_trials: 1`` in ``config.yaml``, and run each single model and in_out_pair separately.
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
After configuring the `config.yaml`, run the following scripts:
|
|
||||||
```bash
|
```bash
|
||||||
source ipex-llm-init --gpu --device <value>
|
source ipex-llm-init --gpu --device <value>
|
||||||
python run.py
|
python run.py
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
**Result**
|
**Result Interpretation**
|
||||||
|
|
||||||
After the benchmarking is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns `1st token avg latency (ms)` and `2+ avg latency (ms/token)` for the benchmark results. You can also check whether the column `actual input/output tokens` is consistent with the column `input/output tokens` and whether the parameters you specified in `config.yaml` have been successfully applied in the benchmarking.
|
After the benchmarking is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns `1st token avg latency (ms)` and `2+ avg latency (ms/token)` for the benchmark results. You can also check whether the column `actual input/output tokens` is consistent with the column `input/output tokens` and whether the parameters you specified in `config.yaml` have been successfully applied in the benchmarking.
|
||||||
|
|
||||||
|
|
@ -23,12 +23,14 @@ cd ipex-llm/python/llm/dev/benchmark/all-in-one/
|
||||||
|
|
||||||
## config.yaml
|
## config.yaml
|
||||||
|
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
repo_id:
|
repo_id:
|
||||||
- 'meta-llama/Llama-2-7b-chat-hf'
|
- 'meta-llama/Llama-2-7b-chat-hf'
|
||||||
local_model_hub: '/mnt/disk1/models'
|
local_model_hub: 'path to your local model hub'
|
||||||
warm_up: 1
|
warm_up: 1 # must set >=2 when run "pipeline_parallel_gpu" test_api
|
||||||
num_trials: 3
|
num_trials: 3
|
||||||
|
num_beams: 1 # default to greedy search
|
||||||
low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
|
low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
|
||||||
batch_size: 1 # default to 1
|
batch_size: 1 # default to 1
|
||||||
in_out_pairs:
|
in_out_pairs:
|
||||||
|
|
@ -36,26 +38,37 @@ in_out_pairs:
|
||||||
- '1024-128'
|
- '1024-128'
|
||||||
- '2048-256'
|
- '2048-256'
|
||||||
test_api:
|
test_api:
|
||||||
- "transformer_int4_gpu"
|
- "transformer_int4_gpu" # on Intel GPU, transformer-like API, (qtype=int4)
|
||||||
cpu_embedding: False
|
cpu_embedding: False # whether put embedding to CPU
|
||||||
|
streaming: False # whether output in streaming way (only avaiable now for gpu win related test_api)
|
||||||
```
|
```
|
||||||
|
|
||||||
Some parameters in the yaml file that you can configure:
|
Some parameters in the yaml file that you can configure:
|
||||||
|
|
||||||
- repo_id: The name of the model and its organization.
|
|
||||||
- local_model_hub: The folder path where the models are stored on your machine.
|
- `repo_id`: The name of the model and its organization.
|
||||||
- warm_up: The number of runs as warmup trials, executed before performance benchmarking.
|
- `local_model_hub`: The folder path where the models are stored on your machine. Replace 'path to your local model hub' with /llm/models.
|
||||||
- num_trials: The number of runs for performance benchmarking. The final benchmark result would be the average of all the trials.
|
- `warm_up`: The number of warmup trials before performance benchmarking (must set to >= 2 when using "pipeline_parallel_gpu" test_api).
|
||||||
- low_bit: The low_bit precision you want to convert to for benchmarking.
|
- `num_trials`: The number of runs for performance benchmarking (the final result is the average of all trials).
|
||||||
- batch_size: The number of samples on which the models make predictions in one forward pass.
|
- `low_bit`: The low_bit precision you want to convert to for benchmarking.
|
||||||
- in_out_pairs: Input sequence length and output sequence length combined by '-'.
|
- `batch_size`: The number of samples on which the models make predictions in one forward pass.
|
||||||
- test_api: Use different test functions on different machines.
|
- `in_out_pairs`: Input sequence length and output sequence length combined by '-'.
|
||||||
|
- `test_api`: Different test functions for different machines.
|
||||||
- `transformer_int4_gpu` on Intel GPU for Linux
|
- `transformer_int4_gpu` on Intel GPU for Linux
|
||||||
- `transformer_int4_gpu_win` on Intel GPU for Windows
|
- `transformer_int4_gpu_win` on Intel GPU for Windows
|
||||||
- `transformer_int4` on Intel CPU
|
- `transformer_int4` on Intel CPU
|
||||||
- cpu_embedding: Whether to put embedding on CPU (only available now for windows gpu related test_api).
|
- `cpu_embedding`: Whether to put embedding on CPU (only available for windows GPU-related test_api).
|
||||||
|
- `streaming`: Whether to output in a streaming way (only available for GPU Windows-related test_api).
|
||||||
|
- `use_fp16_torch_dtype`: Whether to use fp16 for the non-linear layer (only available for "pipeline_parallel_gpu" test_api).
|
||||||
|
- `n_gpu`: Number of GPUs to use (only available for "pipeline_parallel_gpu" test_api).
|
||||||
|
|
||||||
|
|
||||||
|
```eval_rst
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
If you want to benchmark the performance without warmup, you can set ``warm_up: 0`` and ``num_trials: 1`` in ``config.yaml``, and run each single model and in_out_pair separately.
|
||||||
|
```
|
||||||
|
|
||||||
Remark: If you want to benchmark the performance without warmup, you can set `warm_up: 0` and `num_trials: 1` in `config.yaml`, and run each single model and in_out_pair separately.
|
|
||||||
|
|
||||||
## Run on Windows
|
## Run on Windows
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue