From 61c36ba08595d5d60891b2c30414d8af40a1f5cb Mon Sep 17 00:00:00 2001 From: "Wang, Jian4" <61138589+hzjane@users.noreply.github.com> Date: Wed, 3 Jul 2024 14:57:09 +0800 Subject: [PATCH] Add pp_serving verified models (#11498) * add verified models * update * verify large model * update commend --- .../GPU/Pipeline-Parallel-FastAPI/README.md | 25 +++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/python/llm/example/GPU/Pipeline-Parallel-FastAPI/README.md b/python/llm/example/GPU/Pipeline-Parallel-FastAPI/README.md index b01e9282..1c994d29 100644 --- a/python/llm/example/GPU/Pipeline-Parallel-FastAPI/README.md +++ b/python/llm/example/GPU/Pipeline-Parallel-FastAPI/README.md @@ -6,6 +6,27 @@ This example demonstrates how to run IPEX-LLM serving on multiple [Intel GPUs](. To run this example with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. For this particular example, you will need at least two GPUs on your machine. +## Verified Models +- meta-llama/Llama-2-7b-chat-hf +- meta-llama/Llama-2-13b-chat-hf +- meta-llama/Meta-Llama-3-8B-Instruct +- meta-llama/Meta-Llama-3-70B-Instruct +- codellama/CodeLlama-13b-hf +- Qwen/Qwen2-7B-Instruct +- Qwen/Qwen1.5-7B-Chat +- Qwen/Qwen1.5-14B-Chat +- Qwen/Qwen1.5-32B-Chat +- Qwen/Qwen1.5-MoE-A2.7B-Chat +- Qwen/CodeQwen1.5-7B-Chat +- THUDM/chatglm3-6b +- THUDM/glm-4-9b-chat +- baichuan-inc/Baichuan2-7B-Chat +- baichuan-inc/Baichuan2-13B-Chat +- microsoft/Phi-3-mini-4k-instruct +- mistralai/Mistral-7B-v0.1 +- mistralai/Mixtral-8x7B-Instruct-v0.1 +- 01-ai/Yi-6B-Chat + ## Example ### 1. Install @@ -23,6 +44,10 @@ pip install gradio # for gradio web UI conda install -c conda-forge -y gperftools=2.10 # to enable tcmalloc pip install transformers==4.37.0 + +# only for Qwen1.5-MoE-A2.7B +pip install transformers==4.40.0 +pip install trl==0.8.1 ``` ### 2. Run pipeline parallel serving on multiple GPUs