From 0b7e78b59235295e0cee37cadd9fc0adc04997ec Mon Sep 17 00:00:00 2001 From: Shengsheng Huang Date: Tue, 14 May 2024 18:43:41 +0800 Subject: [PATCH] revise the benchmark part in python inference docker (#11020) --- docker/llm/README.md | 6 +- .../source/_templates/sidebar_quicklinks.html | 4 +- docs/readthedocs/source/_toc.yml | 6 +- .../docker_pytorch_inference_gpu.md | 80 ++----------------- .../docker_windows_gpu.md | 0 .../LLM/{Docker => DockerGuides}/index.rst | 0 .../LLM/Quickstart/benchmark_quickstart.md | 41 ++++++---- 7 files changed, 42 insertions(+), 95 deletions(-) rename docs/readthedocs/source/doc/LLM/{Docker => DockerGuides}/docker_pytorch_inference_gpu.md (52%) rename docs/readthedocs/source/doc/LLM/{Docker => DockerGuides}/docker_windows_gpu.md (100%) rename docs/readthedocs/source/doc/LLM/{Docker => DockerGuides}/index.rst (100%) diff --git a/docker/llm/README.md b/docker/llm/README.md index bc7abeb6..8c60eccd 100644 --- a/docker/llm/README.md +++ b/docker/llm/README.md @@ -1,6 +1,6 @@ # IPEX-LLM Docker Containers -You can run IPEX-LLM containers (via docker or k8s) for inference, serving and fine-tuning on Intel CPU and GPU. Details on how to use these containers are available at [IPEX-LLM Docker Container Guides](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Docker/index.html). +You can run IPEX-LLM containers (via docker or k8s) for inference, serving and fine-tuning on Intel CPU and GPU. Details on how to use these containers are available at [IPEX-LLM Docker Container Guides](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/index.html). ### Prerequisites @@ -11,7 +11,7 @@ You can run IPEX-LLM containers (via docker or k8s) for inference, serving and f #### Pull a IPEX-LLM Docker Image -To pull IPEX-LLM Docker images from [Intel Analytics Docker Hub](https://hub.docker.com/u/intelanalytics), use the `docker pull` command. For instance, to pull the CPU inference image: +To pull IPEX-LLM Docker images from [Docker Hub](https://hub.docker.com/u/intelanalytics), use the `docker pull` command. For instance, to pull the CPU inference image: ```bash docker pull intelanalytics/ipex-llm-cpu:2.1.0-SNAPSHOT ``` @@ -29,7 +29,7 @@ Available images in hub are: | intelanalytics/ipex-llm-finetune-qlora-xpu:2.1.0-SNAPSHOT| GPU Finetuning| #### Run a Container -Use `docker run` command to run an IPEX-LLM docker container. For detailed instructions, refer to the [IPEX-LLM Docker Container Guides](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Docker/index.html). +Use `docker run` command to run an IPEX-LLM docker container. For detailed instructions, refer to the [IPEX-LLM Docker Container Guides](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/index.html). #### Build Docker Image diff --git a/docs/readthedocs/source/_templates/sidebar_quicklinks.html b/docs/readthedocs/source/_templates/sidebar_quicklinks.html index a2c7ccd7..2551026e 100644 --- a/docs/readthedocs/source/_templates/sidebar_quicklinks.html +++ b/docs/readthedocs/source/_templates/sidebar_quicklinks.html @@ -75,10 +75,10 @@ diff --git a/docs/readthedocs/source/_toc.yml b/docs/readthedocs/source/_toc.yml index 7ca00fac..023a642e 100644 --- a/docs/readthedocs/source/_toc.yml +++ b/docs/readthedocs/source/_toc.yml @@ -15,12 +15,12 @@ subtrees: title: "CPU" - file: doc/LLM/Overview/install_gpu title: "GPU" - - file: doc/LLM/Docker/index + - file: doc/LLM/DockerGuides/index title: "Docker Guides" subtrees: - entries: - - file: doc/LLM/Docker/docker_windows_gpu - - file: doc/LLM/Docker/docker_pytorch_inference_gpu + - file: doc/LLM/DockerGuides/docker_windows_gpu + - file: doc/LLM/DockerGuides/docker_pytorch_inference_gpu - file: doc/LLM/Quickstart/index title: "Quickstart" subtrees: diff --git a/docs/readthedocs/source/doc/LLM/Docker/docker_pytorch_inference_gpu.md b/docs/readthedocs/source/doc/LLM/DockerGuides/docker_pytorch_inference_gpu.md similarity index 52% rename from docs/readthedocs/source/doc/LLM/Docker/docker_pytorch_inference_gpu.md rename to docs/readthedocs/source/doc/LLM/DockerGuides/docker_pytorch_inference_gpu.md index f3725e91..e3245d41 100644 --- a/docs/readthedocs/source/doc/LLM/Docker/docker_pytorch_inference_gpu.md +++ b/docs/readthedocs/source/doc/LLM/DockerGuides/docker_pytorch_inference_gpu.md @@ -91,90 +91,24 @@ cd /benchmark/all-in-one vim config.yaml ``` -**Modify config.yaml** -```eval_rst -.. note:: - - ``dtype``: The model is originally loaded in this data type. After ipex-llm conversion, all the non-linear layers remain to use this data type. - - ``qtype``: ipex-llm will convert all the linear-layers' weight to this data type. -``` - +In the `config.yaml`, change `repo_id` to the model you want to test and `local_model_hub` to point to your model hub path. ```yaml +... repo_id: - # - 'THUDM/chatglm2-6b' - 'meta-llama/Llama-2-7b-chat-hf' - # - 'liuhaotian/llava-v1.5-7b' # requires a LLAVA_REPO_DIR env variables pointing to the llava dir; added only for gpu win related test_api now -local_model_hub: 'path to your local model hub' -warm_up: 1 # must set >=2 when run "pipeline_parallel_gpu" test_api -num_trials: 3 -num_beams: 1 # default to greedy search -low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4) -batch_size: 1 # default to 1 -in_out_pairs: - - '32-32' - - '1024-128' -test_api: - - "transformer_int4_gpu" # on Intel GPU, transformer-like API, (qtype=int4) - # - "transformer_int4_gpu_win" # on Intel GPU for Windows, transformer-like API, (qtype=int4) - # - "transformer_int4_fp16_gpu" # on Intel GPU, transformer-like API, (qtype=int4), (dtype=fp16) - # - "transformer_int4_fp16_gpu_win" # on Intel GPU for Windows, transformer-like API, (qtype=int4), (dtype=fp16) - # - "transformer_int4_loadlowbit_gpu_win" # on Intel GPU for Windows, transformer-like API, (qtype=int4), use load_low_bit API. Please make sure you have used the save.py to save the converted low bit model - # - "ipex_fp16_gpu" # on Intel GPU, use native transformers API, (dtype=fp16) - # - "bigdl_fp16_gpu" # on Intel GPU, use ipex-llm transformers API, (dtype=fp16), (qtype=fp16) - # - "optimize_model_gpu" # on Intel GPU, can optimize any pytorch models include transformer model - # - "deepspeed_optimize_model_gpu" # on Intel GPU, deepspeed autotp inference - # - "pipeline_parallel_gpu" # on Intel GPU, pipeline parallel inference - # - "speculative_gpu" # on Intel GPU, inference with self-speculative decoding - # - "transformer_int4" # on Intel CPU, transformer-like API, (qtype=int4) - # - "native_int4" # on Intel CPU - # - "optimize_model" # on Intel CPU, can optimize any pytorch models include transformer model - # - "pytorch_autocast_bf16" # on Intel CPU - # - "transformer_autocast_bf16" # on Intel CPU - # - "bigdl_ipex_bf16" # on Intel CPU, (qtype=bf16) - # - "bigdl_ipex_int4" # on Intel CPU, (qtype=int4) - # - "bigdl_ipex_int8" # on Intel CPU, (qtype=int8) - # - "speculative_cpu" # on Intel CPU, inference with self-speculative decoding - # - "deepspeed_transformer_int4_cpu" # on Intel CPU, deepspeed autotp inference -cpu_embedding: False # whether put embedding to CPU -streaming: False # whether output in streaming way (only avaiable now for gpu win related test_api) -use_fp16_torch_dtype: True # whether use fp16 for non-linear layer (only avaiable now for "pipeline_parallel_gpu" test_api) -n_gpu: 2 # number of GPUs to use (only avaiable now for "pipeline_parallel_gpu" test_api) -``` +local_model_hub: '/path/to/your/mode/folder' +... +``` -Some parameters in the yaml file that you can configure: - - -- `repo_id`: The name of the model and its organization. -- `local_model_hub`: The folder path where the models are stored on your machine. Replace 'path to your local model hub' with /llm/models. -- `warm_up`: The number of warmup trials before performance benchmarking (must set to >= 2 when using "pipeline_parallel_gpu" test_api). -- `num_trials`: The number of runs for performance benchmarking (the final result is the average of all trials). -- `low_bit`: The low_bit precision you want to convert to for benchmarking. -- `batch_size`: The number of samples on which the models make predictions in one forward pass. -- `in_out_pairs`: Input sequence length and output sequence length combined by '-'. -- `test_api`: Different test functions for different machines. -- `cpu_embedding`: Whether to put embedding on CPU (only available for windows GPU-related test_api). -- `streaming`: Whether to output in a streaming way (only available for GPU Windows-related test_api). -- `use_fp16_torch_dtype`: Whether to use fp16 for the non-linear layer (only available for "pipeline_parallel_gpu" test_api). -- `n_gpu`: Number of GPUs to use (only available for "pipeline_parallel_gpu" test_api). - - -```eval_rst -.. note:: - - If you want to benchmark the performance without warmup, you can set ``warm_up: 0`` and ``num_trials: 1`` in ``config.yaml``, and run each single model and in_out_pair separately. -``` - - -After configuring the `config.yaml`, run the following scripts: +After modifying `config.yaml`, run the following commands to run benchmarking: ```bash source ipex-llm-init --gpu --device python run.py ``` -**Result** +**Result Interpretation** After the benchmarking is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns `1st token avg latency (ms)` and `2+ avg latency (ms/token)` for the benchmark results. You can also check whether the column `actual input/output tokens` is consistent with the column `input/output tokens` and whether the parameters you specified in `config.yaml` have been successfully applied in the benchmarking. diff --git a/docs/readthedocs/source/doc/LLM/Docker/docker_windows_gpu.md b/docs/readthedocs/source/doc/LLM/DockerGuides/docker_windows_gpu.md similarity index 100% rename from docs/readthedocs/source/doc/LLM/Docker/docker_windows_gpu.md rename to docs/readthedocs/source/doc/LLM/DockerGuides/docker_windows_gpu.md diff --git a/docs/readthedocs/source/doc/LLM/Docker/index.rst b/docs/readthedocs/source/doc/LLM/DockerGuides/index.rst similarity index 100% rename from docs/readthedocs/source/doc/LLM/Docker/index.rst rename to docs/readthedocs/source/doc/LLM/DockerGuides/index.rst diff --git a/docs/readthedocs/source/doc/LLM/Quickstart/benchmark_quickstart.md b/docs/readthedocs/source/doc/LLM/Quickstart/benchmark_quickstart.md index 6a2aa5ac..6bd48e87 100644 --- a/docs/readthedocs/source/doc/LLM/Quickstart/benchmark_quickstart.md +++ b/docs/readthedocs/source/doc/LLM/Quickstart/benchmark_quickstart.md @@ -23,12 +23,14 @@ cd ipex-llm/python/llm/dev/benchmark/all-in-one/ ## config.yaml + ```yaml repo_id: - 'meta-llama/Llama-2-7b-chat-hf' -local_model_hub: '/mnt/disk1/models' -warm_up: 1 +local_model_hub: 'path to your local model hub' +warm_up: 1 # must set >=2 when run "pipeline_parallel_gpu" test_api num_trials: 3 +num_beams: 1 # default to greedy search low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4) batch_size: 1 # default to 1 in_out_pairs: @@ -36,26 +38,37 @@ in_out_pairs: - '1024-128' - '2048-256' test_api: - - "transformer_int4_gpu" -cpu_embedding: False + - "transformer_int4_gpu" # on Intel GPU, transformer-like API, (qtype=int4) +cpu_embedding: False # whether put embedding to CPU +streaming: False # whether output in streaming way (only avaiable now for gpu win related test_api) ``` Some parameters in the yaml file that you can configure: -- repo_id: The name of the model and its organization. -- local_model_hub: The folder path where the models are stored on your machine. -- warm_up: The number of runs as warmup trials, executed before performance benchmarking. -- num_trials: The number of runs for performance benchmarking. The final benchmark result would be the average of all the trials. -- low_bit: The low_bit precision you want to convert to for benchmarking. -- batch_size: The number of samples on which the models make predictions in one forward pass. -- in_out_pairs: Input sequence length and output sequence length combined by '-'. -- test_api: Use different test functions on different machines. + +- `repo_id`: The name of the model and its organization. +- `local_model_hub`: The folder path where the models are stored on your machine. Replace 'path to your local model hub' with /llm/models. +- `warm_up`: The number of warmup trials before performance benchmarking (must set to >= 2 when using "pipeline_parallel_gpu" test_api). +- `num_trials`: The number of runs for performance benchmarking (the final result is the average of all trials). +- `low_bit`: The low_bit precision you want to convert to for benchmarking. +- `batch_size`: The number of samples on which the models make predictions in one forward pass. +- `in_out_pairs`: Input sequence length and output sequence length combined by '-'. +- `test_api`: Different test functions for different machines. - `transformer_int4_gpu` on Intel GPU for Linux - `transformer_int4_gpu_win` on Intel GPU for Windows - `transformer_int4` on Intel CPU -- cpu_embedding: Whether to put embedding on CPU (only available now for windows gpu related test_api). +- `cpu_embedding`: Whether to put embedding on CPU (only available for windows GPU-related test_api). +- `streaming`: Whether to output in a streaming way (only available for GPU Windows-related test_api). +- `use_fp16_torch_dtype`: Whether to use fp16 for the non-linear layer (only available for "pipeline_parallel_gpu" test_api). +- `n_gpu`: Number of GPUs to use (only available for "pipeline_parallel_gpu" test_api). + + +```eval_rst +.. note:: + + If you want to benchmark the performance without warmup, you can set ``warm_up: 0`` and ``num_trials: 1`` in ``config.yaml``, and run each single model and in_out_pair separately. +``` -Remark: If you want to benchmark the performance without warmup, you can set `warm_up: 0` and `num_trials: 1` in `config.yaml`, and run each single model and in_out_pair separately. ## Run on Windows