From 0b7e78b59235295e0cee37cadd9fc0adc04997ec Mon Sep 17 00:00:00 2001
From: Shengsheng Huang <shengsheng.huang@intel.com>
Date: Tue, 14 May 2024 18:43:41 +0800
Subject: [PATCH] revise the benchmark part in python inference docker (#11020)

---
 docker/llm/README.md                          |  6 +-
 .../source/_templates/sidebar_quicklinks.html |  4 +-
 docs/readthedocs/source/_toc.yml              |  6 +-
 .../docker_pytorch_inference_gpu.md           | 80 ++-----------------
 .../docker_windows_gpu.md                     |  0
 .../LLM/{Docker => DockerGuides}/index.rst    |  0
 .../LLM/Quickstart/benchmark_quickstart.md    | 41 ++++++----
 7 files changed, 42 insertions(+), 95 deletions(-)
 rename docs/readthedocs/source/doc/LLM/{Docker => DockerGuides}/docker_pytorch_inference_gpu.md (52%)
 rename docs/readthedocs/source/doc/LLM/{Docker => DockerGuides}/docker_windows_gpu.md (100%)
 rename docs/readthedocs/source/doc/LLM/{Docker => DockerGuides}/index.rst (100%)
diff --git a/docker/llm/README.md b/docker/llm/README.md
index bc7abeb6..8c60eccd 100644
--- a/docker/llm/README.md
+++ b/docker/llm/README.md
@@ -1,6 +1,6 @@
 # IPEX-LLM Docker Containers
 
-You can run IPEX-LLM containers (via docker or k8s) for inference, serving and fine-tuning on Intel CPU and GPU. Details on how to use these containers are available at [IPEX-LLM Docker Container Guides](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Docker/index.html).
+You can run IPEX-LLM containers (via docker or k8s) for inference, serving and fine-tuning on Intel CPU and GPU. Details on how to use these containers are available at [IPEX-LLM Docker Container Guides](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/index.html).
 
 ### Prerequisites
 
@@ -11,7 +11,7 @@ You can run IPEX-LLM containers (via docker or k8s) for inference, serving and f
 
 
 #### Pull a IPEX-LLM Docker Image
-To pull IPEX-LLM Docker images from [Intel Analytics Docker Hub](https://hub.docker.com/u/intelanalytics), use the `docker pull` command. For instance, to pull the CPU inference image:
+To pull IPEX-LLM Docker images from [Docker Hub](https://hub.docker.com/u/intelanalytics), use the `docker pull` command. For instance, to pull the CPU inference image:
 ```bash
 docker pull intelanalytics/ipex-llm-cpu:2.1.0-SNAPSHOT
 ```
@@ -29,7 +29,7 @@ Available images in hub are:
 | intelanalytics/ipex-llm-finetune-qlora-xpu:2.1.0-SNAPSHOT| GPU Finetuning|
 
 #### Run a Container
-Use `docker run` command to run an IPEX-LLM docker container. For detailed instructions, refer to the [IPEX-LLM Docker Container Guides](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Docker/index.html).
+Use `docker run` command to run an IPEX-LLM docker container. For detailed instructions, refer to the [IPEX-LLM Docker Container Guides](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/index.html).
 
 
 #### Build Docker Image
diff --git a/docs/readthedocs/source/_templates/sidebar_quicklinks.html b/docs/readthedocs/source/_templates/sidebar_quicklinks.html
index a2c7ccd7..2551026e 100644
--- a/docs/readthedocs/source/_templates/sidebar_quicklinks.html
+++ b/docs/readthedocs/source/_templates/sidebar_quicklinks.html
@@ -75,10 +75,10 @@
                 </label>
                 <ul class="bigdl-quicklinks-section-nav">
                     <li>
-                        <a href="doc/LLM/Docker/docker_windows_gpu.html">Overview of IPEX-LLM Containers for Intel GPU</a>
+                        <a href="doc/LLM/DockerGuides/docker_windows_gpu.html">Overview of IPEX-LLM Containers for Intel GPU</a>
                     </li>
                     <li>
-                        <a href="doc/LLM/Docker/docker_pytorch_inference_gpu.html">Run PyTorch Inference on an Intel GPU via Docker</a>
+                        <a href="doc/LLM/DockerGuides/docker_pytorch_inference_gpu.html">Run PyTorch Inference on an Intel GPU via Docker</a>
                     </li>
                 </ul>
             </li>
diff --git a/docs/readthedocs/source/_toc.yml b/docs/readthedocs/source/_toc.yml
index 7ca00fac..023a642e 100644
--- a/docs/readthedocs/source/_toc.yml
+++ b/docs/readthedocs/source/_toc.yml
@@ -15,12 +15,12 @@ subtrees:
                   title: "CPU"
                 - file: doc/LLM/Overview/install_gpu
                   title: "GPU"
-          - file: doc/LLM/Docker/index
+          - file: doc/LLM/DockerGuides/index
             title: "Docker Guides"
             subtrees:
               - entries:
-                - file: doc/LLM/Docker/docker_windows_gpu
-                - file: doc/LLM/Docker/docker_pytorch_inference_gpu
+                - file: doc/LLM/DockerGuides/docker_windows_gpu
+                - file: doc/LLM/DockerGuides/docker_pytorch_inference_gpu
           - file: doc/LLM/Quickstart/index
             title: "Quickstart"
             subtrees:
diff --git a/docs/readthedocs/source/doc/LLM/Docker/docker_pytorch_inference_gpu.md b/docs/readthedocs/source/doc/LLM/DockerGuides/docker_pytorch_inference_gpu.md
similarity index 52%
rename from docs/readthedocs/source/doc/LLM/Docker/docker_pytorch_inference_gpu.md
rename to docs/readthedocs/source/doc/LLM/DockerGuides/docker_pytorch_inference_gpu.md
index f3725e91..e3245d41 100644
--- a/docs/readthedocs/source/doc/LLM/Docker/docker_pytorch_inference_gpu.md
+++ b/docs/readthedocs/source/doc/LLM/DockerGuides/docker_pytorch_inference_gpu.md
@@ -91,90 +91,24 @@ cd /benchmark/all-in-one
 vim config.yaml
 ```
 
-**Modify config.yaml**
-```eval_rst
-.. note::
-
-  ``dtype``: The model is originally loaded in this data type.  After ipex-llm conversion, all the non-linear layers remain to use this data type.
-
-  ``qtype``: ipex-llm will convert all the linear-layers' weight to this data type.
-```
-
+In the `config.yaml`, change `repo_id` to the model you want to test and `local_model_hub` to point to your model hub path. 
 
 ```yaml
+...
 repo_id:
-  # - 'THUDM/chatglm2-6b'
   - 'meta-llama/Llama-2-7b-chat-hf'
-  # - 'liuhaotian/llava-v1.5-7b' # requires a LLAVA_REPO_DIR env variables pointing to the llava dir; added only for gpu win related test_api now
-local_model_hub: 'path to your local model hub'
-warm_up: 1 # must set >=2 when run "pipeline_parallel_gpu" test_api
-num_trials: 3
-num_beams: 1 # default to greedy search
-low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
-batch_size: 1 # default to 1
-in_out_pairs:
-  - '32-32'
-  - '1024-128'
-test_api:
-  - "transformer_int4_gpu"                # on Intel GPU, transformer-like API, (qtype=int4)
-  # - "transformer_int4_gpu_win"            # on Intel GPU for Windows, transformer-like API, (qtype=int4)
-  # - "transformer_int4_fp16_gpu"           # on Intel GPU, transformer-like API, (qtype=int4), (dtype=fp16)
-  # - "transformer_int4_fp16_gpu_win"       # on Intel GPU for Windows, transformer-like API, (qtype=int4), (dtype=fp16)
-  # - "transformer_int4_loadlowbit_gpu_win" # on Intel GPU for Windows, transformer-like API, (qtype=int4), use load_low_bit API. Please make sure you have used the save.py to save the converted low bit model
-  # - "ipex_fp16_gpu"                       # on Intel GPU, use native transformers API, (dtype=fp16)
-  # - "bigdl_fp16_gpu"                      # on Intel GPU, use ipex-llm transformers API, (dtype=fp16), (qtype=fp16)
-  # - "optimize_model_gpu"                  # on Intel GPU, can optimize any pytorch models include transformer model
-  # - "deepspeed_optimize_model_gpu"        # on Intel GPU, deepspeed autotp inference
-  # - "pipeline_parallel_gpu"               # on Intel GPU, pipeline parallel inference
-  # - "speculative_gpu"                     # on Intel GPU, inference with self-speculative decoding
-  # - "transformer_int4"                    # on Intel CPU, transformer-like API, (qtype=int4)
-  # - "native_int4"                         # on Intel CPU
-  # - "optimize_model"                      # on Intel CPU, can optimize any pytorch models include transformer model
-  # - "pytorch_autocast_bf16"               # on Intel CPU
-  # - "transformer_autocast_bf16"           # on Intel CPU
-  # - "bigdl_ipex_bf16"                     # on Intel CPU, (qtype=bf16)
-  # - "bigdl_ipex_int4"                     # on Intel CPU, (qtype=int4)
-  # - "bigdl_ipex_int8"                     # on Intel CPU, (qtype=int8)
-  # - "speculative_cpu"                     # on Intel CPU, inference with self-speculative decoding
-  # - "deepspeed_transformer_int4_cpu"      # on Intel CPU, deepspeed autotp inference
-cpu_embedding: False # whether put embedding to CPU
-streaming: False # whether output in streaming way (only avaiable now for gpu win related test_api)
-use_fp16_torch_dtype: True # whether use fp16 for non-linear layer (only avaiable now for "pipeline_parallel_gpu" test_api)
-n_gpu: 2 # number of GPUs to use (only avaiable now for "pipeline_parallel_gpu" test_api)
-```
+local_model_hub: '/path/to/your/mode/folder'
+...
+``` 
 
-Some parameters in the yaml file that you can configure:
-
-
-- `repo_id`: The name of the model and its organization.
-- `local_model_hub`: The folder path where the models are stored on your machine. Replace 'path to your local model hub' with /llm/models.
-- `warm_up`: The number of warmup trials before performance benchmarking (must set to >= 2 when using "pipeline_parallel_gpu" test_api).
-- `num_trials`: The number of runs for performance benchmarking (the final result is the average of all trials).
-- `low_bit`: The low_bit precision you want to convert to for benchmarking.
-- `batch_size`: The number of samples on which the models make predictions in one forward pass.
-- `in_out_pairs`: Input sequence length and output sequence length combined by '-'.
-- `test_api`: Different test functions for different machines.
-- `cpu_embedding`: Whether to put embedding on CPU (only available for windows GPU-related test_api).
-- `streaming`: Whether to output in a streaming way (only available for GPU Windows-related test_api).
-- `use_fp16_torch_dtype`: Whether to use fp16 for the non-linear layer (only available for "pipeline_parallel_gpu" test_api).
-- `n_gpu`: Number of GPUs to use (only available for "pipeline_parallel_gpu" test_api).
-
-
-```eval_rst
-.. note::
-
-  If you want to benchmark the performance without warmup, you can set ``warm_up: 0`` and ``num_trials: 1`` in ``config.yaml``, and run each single model and in_out_pair separately. 
-```
-
-
-After configuring the `config.yaml`, run the following scripts:
+After modifying `config.yaml`, run the following commands to run benchmarking:
 ```bash
 source ipex-llm-init --gpu --device <value>
 python run.py
 ```
 
 
-**Result**
+**Result Interpretation**
 
 After the benchmarking is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns `1st token avg latency (ms)` and `2+ avg latency (ms/token)` for the benchmark results. You can also check whether the column `actual input/output tokens` is consistent with the column `input/output tokens` and whether the parameters you specified in `config.yaml` have been successfully applied in the benchmarking.
 
diff --git a/docs/readthedocs/source/doc/LLM/Docker/docker_windows_gpu.md b/docs/readthedocs/source/doc/LLM/DockerGuides/docker_windows_gpu.md
similarity index 100%
rename from docs/readthedocs/source/doc/LLM/Docker/docker_windows_gpu.md
rename to docs/readthedocs/source/doc/LLM/DockerGuides/docker_windows_gpu.md
diff --git a/docs/readthedocs/source/doc/LLM/Docker/index.rst b/docs/readthedocs/source/doc/LLM/DockerGuides/index.rst
similarity index 100%
rename from docs/readthedocs/source/doc/LLM/Docker/index.rst
rename to docs/readthedocs/source/doc/LLM/DockerGuides/index.rst
diff --git a/docs/readthedocs/source/doc/LLM/Quickstart/benchmark_quickstart.md b/docs/readthedocs/source/doc/LLM/Quickstart/benchmark_quickstart.md
index 6a2aa5ac..6bd48e87 100644
--- a/docs/readthedocs/source/doc/LLM/Quickstart/benchmark_quickstart.md
+++ b/docs/readthedocs/source/doc/LLM/Quickstart/benchmark_quickstart.md
@@ -23,12 +23,14 @@ cd ipex-llm/python/llm/dev/benchmark/all-in-one/
 
 ## config.yaml
 
+
 ```yaml
 repo_id:
   - 'meta-llama/Llama-2-7b-chat-hf'
-local_model_hub: '/mnt/disk1/models'
-warm_up: 1
+local_model_hub: 'path to your local model hub'
+warm_up: 1 # must set >=2 when run "pipeline_parallel_gpu" test_api
 num_trials: 3
+num_beams: 1 # default to greedy search
 low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
 batch_size: 1 # default to 1
 in_out_pairs:
@@ -36,26 +38,37 @@ in_out_pairs:
   - '1024-128'
   - '2048-256'
 test_api:
-  - "transformer_int4_gpu"
-cpu_embedding: False
+  - "transformer_int4_gpu"   # on Intel GPU, transformer-like API, (qtype=int4)
+cpu_embedding: False # whether put embedding to CPU
+streaming: False # whether output in streaming way (only avaiable now for gpu win related test_api)
 ```
 
 Some parameters in the yaml file that you can configure:
 
-- repo_id: The name of the model and its organization.
-- local_model_hub: The folder path where the models are stored on your machine.
-- warm_up: The number of runs as warmup trials, executed before performance benchmarking.
-- num_trials: The number of runs for performance benchmarking. The final benchmark result would be the average of all the trials.
-- low_bit: The low_bit precision you want to convert to for benchmarking.
-- batch_size: The number of samples on which the models make predictions in one forward pass.
-- in_out_pairs: Input sequence length and output sequence length combined by '-'.
-- test_api: Use different test functions on different machines.
+
+- `repo_id`: The name of the model and its organization.
+- `local_model_hub`: The folder path where the models are stored on your machine. Replace 'path to your local model hub' with /llm/models.
+- `warm_up`: The number of warmup trials before performance benchmarking (must set to >= 2 when using "pipeline_parallel_gpu" test_api).
+- `num_trials`: The number of runs for performance benchmarking (the final result is the average of all trials).
+- `low_bit`: The low_bit precision you want to convert to for benchmarking.
+- `batch_size`: The number of samples on which the models make predictions in one forward pass.
+- `in_out_pairs`: Input sequence length and output sequence length combined by '-'.
+- `test_api`: Different test functions for different machines.
   - `transformer_int4_gpu` on Intel GPU for Linux
   - `transformer_int4_gpu_win` on Intel GPU for Windows
   - `transformer_int4` on Intel CPU
-- cpu_embedding: Whether to put embedding on CPU (only available now for windows gpu related test_api).
+- `cpu_embedding`: Whether to put embedding on CPU (only available for windows GPU-related test_api).
+- `streaming`: Whether to output in a streaming way (only available for GPU Windows-related test_api).
+- `use_fp16_torch_dtype`: Whether to use fp16 for the non-linear layer (only available for "pipeline_parallel_gpu" test_api).
+- `n_gpu`: Number of GPUs to use (only available for "pipeline_parallel_gpu" test_api).
+
+
+```eval_rst
+.. note::
+
+  If you want to benchmark the performance without warmup, you can set ``warm_up: 0`` and ``num_trials: 1`` in ``config.yaml``, and run each single model and in_out_pair separately. 
+```
 
-Remark: If you want to benchmark the performance without warmup, you can set `warm_up: 0` and `num_trials: 1` in `config.yaml`, and run each single model and in_out_pair separately.
 
 ## Run on Windows