Add more examples for pipeline parallel inference (#11372)

* add more model exampels for pipelien parallel inference * add mixtral and vicuna models * add yi model and past_kv supprot for chatglm family * add docs * doc update * add license * update
2024-06-21 17:55:16 +08:00 · 2024-06-21 17:55:16 +08:00 · 0c67639539
commit 0c67639539
parent 2004fe1a43
7 changed files with 279 additions and 3 deletions
--- a/python/llm/example/GPU/Pipeline-Parallel-Inference/README.md
+++ b/python/llm/example/GPU/Pipeline-Parallel-Inference/README.md
@ -17,7 +17,17 @@ To run this example with IPEX-LLM on Intel GPUs, we have some recommended requir
 - [baichuan-inc/Baichuan2-13B-Chat](./run_baichuan2_arc_2_card.sh)
 - [microsoft/Phi-3-mini-4k-instruct](./run_phi3_arc_2_card.sh)
 - [microsoft/Phi-3-medium-4k-instruct](./run_phi3_arc_2_card.sh)
-
+- [mistralai/Mistral-7B-v0.1](./run_mistral_arc_2_card.sh)
 - [mistralai/Mixtral-8x7B-Instruct-v0.1](./run_mistral_arc_2_card.sh)
 - [01-ai/Yi-6B-Chat](./run_yi_arc_2_card.sh)
 - [01-ai/Yi-34B-Chat](./run_yi_arc_2_card.sh)
 - [codellama/CodeLlama-7b-Instruct-hf](./run_codellama_arc_2_card.sh)
 - [codellama/CodeLlama-13b-Instruct-hf](./run_codellama_arc_2_card.sh)
 - [codellama/CodeLlama-34b-Instruct-hf](./run_codellama_arc_2_card.sh)
 - [upstage/SOLAR-10.7B-Instruct-v1.0](./run_solar_arc_2_card.sh)
 - [lmsys/vicuna-7b-v1.3](./run_vicuna_arc_2_card.sh)
 - [lmsys/vicuna-13b-v1.3](./run_vicuna_arc_2_card.sh)
 - [lmsys/vicuna-33b-v1.3](./run_vicuna_arc_2_card.sh)
 ## Example: Run pipeline parallel inference on multiple GPUs
@ -95,7 +105,6 @@ bash run_chatglm_arc_2_card.sh
 You could specify `--repo-id-or-model-path` in the test script to be the huggingface repo id for Baichuan2 to be downloaded, or the path to the huggingface checkpoint folder. Besides, you could change `NUM_GPUS` to the number of GPUs you have on your machine.
 ```bash
 pip install transformers==4.37.0
 bash run_baichuan2_arc_2_card.sh
 ```
@ -117,6 +126,83 @@ bash run_phi3_arc_2_card.sh
 </details>
 </details>
 <details>
  <summary> Show Mistral/Mixtral example </summary>
 #### Run Mistral-7B-v0.1 / Mixtral-8x7B-Instruct-v0.1 on two Intel Arc A770
 You could specify `--repo-id-or-model-path` in the test script to be the huggingface repo id for Mistral / Mixtral to be downloaded, or the path to the huggingface checkpoint folder. Besides, you could change `NUM_GPUS` to the number of GPUs you have on your machine.
 ```bash
 pip install transformers==4.37.0
 bash run_mistral_arc_2_card.sh
 ```
 </details>
 </details>
 <details>
  <summary> Show Yi example </summary>
 #### Run Yi-6B-Chat / Yi-34B-Chat on two Intel Arc A770
 You could specify `--repo-id-or-model-path` in the test script to be the huggingface repo id for Yi to be downloaded, or the path to the huggingface checkpoint folder. Besides, you could change `NUM_GPUS` to the number of GPUs you have on your machine.
 ```bash
 bash run_yi_arc_2_card.sh
 ```
 </details>
 </details>
 <details>
  <summary> Show Codellama example </summary>
 #### Run CodeLlama-7b-Instruct-hf / CodeLlama-13b-Instruct-hf / CodeLlama-34b-Instruct-hf on two Intel Arc A770
 You could specify `--repo-id-or-model-path` in the test script to be the huggingface repo id for Codellama to be downloaded, or the path to the huggingface checkpoint folder. Besides, you could change `NUM_GPUS` to the number of GPUs you have on your machine.
 ```bash
 bash run_codellama_arc_2_card.sh
 ```
 </details>
 </details>
 <details>
  <summary> Show Solar example </summary>
 #### Run SOLAR-10.7B-Instruct-v1.0 on two Intel Arc A770
 You could specify `--repo-id-or-model-path` in the test script to be the huggingface repo id for Solar to be downloaded, or the path to the huggingface checkpoint folder. Besides, you could change `NUM_GPUS` to the number of GPUs you have on your machine.
 ```bash
 bash run_solar_arc_2_card.sh
 ```
 </details>
 </details>
 <details>
  <summary> Show Vicuna example </summary>
 #### Run vicuna-7b-v1.3 / vicuna-13b-v1.3 / vicuna-33b-v1.3 on two Intel Arc A770
 You could specify `--repo-id-or-model-path` in the test script to be the huggingface repo id for Vicuna to be downloaded, or the path to the huggingface checkpoint folder. Besides, you could change `NUM_GPUS` to the number of GPUs you have on your machine.
 ```bash
 bash run_vicuna_arc_2_card.sh
 ```
 </details>
 ### 3. Sample Output
 #### [meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)
 ```log
--- a/python/llm/example/GPU/Pipeline-Parallel-Inference/run_codellama_arc_2_card.sh
+++ b/python/llm/example/GPU/Pipeline-Parallel-Inference/run_codellama_arc_2_card.sh
@ -0,0 +1,41 @@
 #
 # Copyright 2016 The BigDL Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 source /opt/intel/oneapi/setvars.sh
 export MASTER_ADDR=127.0.0.1
 export MASTER_PORT=9090
 export FI_PROVIDER=tcp
 export USE_XETLA=OFF
 export OMP_NUM_THREADS=6
 export IPEX_LLM_QUANTIZE_KV_CACHE=1
 if [[ $KERNEL_VERSION != *"6.5"* ]]; then
    export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 fi
 export TORCH_LLM_ALLREDUCE=0
 NUM_GPUS=2 # number of used GPU
 # To run CodeLlama-7b-Instruct-hf
 CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
    generate.py --repo-id-or-model-path 'codellama/CodeLlama-7b-Instruct-hf' --gpu-num $NUM_GPUS
 # To run CodeLlama-13b-Instruct-hf
 # CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
 #     generate.py --repo-id-or-model-path 'codellama/CodeLlama-13b-Instruct-hf' --gpu-num $NUM_GPUS
 # To run CodeLlama-34b-Instruct-hf
 # CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
 #     generate.py --repo-id-or-model-path 'codellama/CodeLlama-34b-Instruct-hf' --gpu-num $NUM_GPUS
--- a/python/llm/example/GPU/Pipeline-Parallel-Inference/run_mistral_arc_2_card.sh
+++ b/python/llm/example/GPU/Pipeline-Parallel-Inference/run_mistral_arc_2_card.sh
@ -0,0 +1,37 @@
 #
 # Copyright 2016 The BigDL Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 source /opt/intel/oneapi/setvars.sh
 export MASTER_ADDR=127.0.0.1
 export MASTER_PORT=9090
 export FI_PROVIDER=tcp
 export USE_XETLA=OFF
 export OMP_NUM_THREADS=6
 export IPEX_LLM_QUANTIZE_KV_CACHE=1
 if [[ $KERNEL_VERSION != *"6.5"* ]]; then
    export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 fi
 export TORCH_LLM_ALLREDUCE=0
 NUM_GPUS=2 # number of used GPU
 # To run Mistral-7B-v0.1
 CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
    generate.py --repo-id-or-model-path 'mistralai/Mistral-7B-v0.1' --gpu-num $NUM_GPUS
 # To run Mixtral-8x7B-Instruct-v0.1
 # CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
 #     generate.py --repo-id-or-model-path 'mistralai/Mixtral-8x7B-Instruct-v0.1' --gpu-num $NUM_GPUS
--- a/python/llm/example/GPU/Pipeline-Parallel-Inference/run_solar_arc_2_card.sh
+++ b/python/llm/example/GPU/Pipeline-Parallel-Inference/run_solar_arc_2_card.sh
@ -0,0 +1,33 @@
 #
 # Copyright 2016 The BigDL Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 source /opt/intel/oneapi/setvars.sh
 export MASTER_ADDR=127.0.0.1
 export MASTER_PORT=9090
 export FI_PROVIDER=tcp
 export USE_XETLA=OFF
 export OMP_NUM_THREADS=6
 export IPEX_LLM_QUANTIZE_KV_CACHE=1
 if [[ $KERNEL_VERSION != *"6.5"* ]]; then
    export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 fi
 export TORCH_LLM_ALLREDUCE=0
 NUM_GPUS=2 # number of used GPU
 # To run SOLAR-10.7B-Instruct-v1.0
 CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
    generate.py --repo-id-or-model-path 'upstage/SOLAR-10.7B-Instruct-v1.0' --gpu-num $NUM_GPUS
--- a/python/llm/example/GPU/Pipeline-Parallel-Inference/run_vicuna_arc_2_card.sh
+++ b/python/llm/example/GPU/Pipeline-Parallel-Inference/run_vicuna_arc_2_card.sh
@ -0,0 +1,41 @@
 #
 # Copyright 2016 The BigDL Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 source /opt/intel/oneapi/setvars.sh
 export MASTER_ADDR=127.0.0.1
 export MASTER_PORT=9090
 export FI_PROVIDER=tcp
 export USE_XETLA=OFF
 export OMP_NUM_THREADS=6
 export IPEX_LLM_QUANTIZE_KV_CACHE=1
 if [[ $KERNEL_VERSION != *"6.5"* ]]; then
    export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 fi
 export TORCH_LLM_ALLREDUCE=0
 NUM_GPUS=2 # number of used GPU
 # To run vicuna-7b-v1.3
 CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
    generate.py --repo-id-or-model-path 'lmsys/vicuna-7b-v1.3' --gpu-num $NUM_GPUS
 # To run vicuna-13b-v1.3
 # CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
 #     generate.py --repo-id-or-model-path 'lmsys/vicuna-13b-v1.3' --gpu-num $NUM_GPUS
 # To run vicuna-33b-v1.3
 # CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
 #     generate.py --repo-id-or-model-path 'lmsys/vicuna-33b-v1.3' --gpu-num $NUM_GPUS
--- a/python/llm/example/GPU/Pipeline-Parallel-Inference/run_yi_arc_2_card.sh
+++ b/python/llm/example/GPU/Pipeline-Parallel-Inference/run_yi_arc_2_card.sh
@ -0,0 +1,37 @@
 #
 # Copyright 2016 The BigDL Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 source /opt/intel/oneapi/setvars.sh
 export MASTER_ADDR=127.0.0.1
 export MASTER_PORT=9090
 export FI_PROVIDER=tcp
 export USE_XETLA=OFF
 export OMP_NUM_THREADS=6
 export IPEX_LLM_QUANTIZE_KV_CACHE=1
 if [[ $KERNEL_VERSION != *"6.5"* ]]; then
    export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 fi
 export TORCH_LLM_ALLREDUCE=0
 NUM_GPUS=2 # number of used GPU
 # To run Yi-6B-Chat
 CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
    generate.py --repo-id-or-model-path '01-ai/Yi-6B-Chat' --gpu-num $NUM_GPUS
 # To run Yi-34B-Chat
 # CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
 #     generate.py --repo-id-or-model-path '01-ai/Yi-34B-Chat' --gpu-num $NUM_GPUS
--- a/python/llm/src/ipex_llm/transformers/pipeline_parallel.py
+++ b/python/llm/src/ipex_llm/transformers/pipeline_parallel.py
@ -269,7 +269,8 @@ def pipeline_parallel_generate(self,
                                         "make sure that `pad_token_id` is defined.")
            next_ids = next_ids * unfinished_sequences + pad_token_id * (1 - unfinished_sequences)
-        if isinstance(outputs.past_key_values, tuple) and local_rank != 0:
+        # Temporarily specify as Baichuan and ChatGLM
        if self.config.model_type in ["baichuan", "chatglm"] and local_rank != 0:
            value_placeholder = torch.empty_like((outputs.past_key_values)[-1][0])
            past_key_values_placeholder = tuple(
                (value_placeholder, value_placeholder) for _ in range(layer_start)