Add Modelscope option for chatglm3 on GPU (#12545)

* Add Modelscope option for GPU model chatglm3

* Update readme

* Update readme

* Update readme

* Update readme

* format update

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
This commit is contained in:
Xu, Shuo 2024-12-16 20:00:37 +08:00 committed by GitHub
parent 5ae0006103
commit ccc18eefb5
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 62 additions and 19 deletions

View file

@ -1,6 +1,6 @@
# ChatGLM3 # ChatGLM3
In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on ChatGLM3 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) as a reference ChatGLM3 model. In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on ChatGLM3 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) (or [ZhipuAI/chatglm3-6b](https://www.modelscope.cn/models/ZhipuAI/chatglm3-6b) for ModelScope) as a reference ChatGLM3 model.
## 0. Requirements ## 0. Requirements
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
@ -13,6 +13,9 @@ conda create -n llm python=3.11
conda activate llm conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# [optional] only needed if you would like to use ModelScope as model hub
pip install modelscope==1.11.0
``` ```
### 1.2 Installation on Windows ### 1.2 Installation on Windows
@ -23,6 +26,9 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# [optional] only needed if you would like to use ModelScope as model hub
pip install modelscope==1.11.0
``` ```
## 2. Configures OneAPI environment variables for Linux ## 2. Configures OneAPI environment variables for Linux
@ -93,14 +99,19 @@ set SYCL_CACHE_PERSISTENT=1
### Example 1: Predict Tokens using `generate()` API ### Example 1: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM3 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
``` ```bash
# for Hugging Face model hub
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
# for ModelScope model hub
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT --modelscope
``` ```
Arguments info: Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the ChatGLM3 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/chatglm3-6b'`. - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the ChatGLM3 model to be downloaded, or the path to the checkpoint folder. It is default to be `'THUDM/chatglm3-6b'` for **Hugging Face** or `ZhipuAI/chatglm3-6b` for **ModelScope**.
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
- `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**.
#### Sample Output #### Sample Output
#### [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) #### [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b)
@ -133,16 +144,25 @@ AI stands for Artificial Intelligence. It refers to the development of computer
In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with IPEX-LLM INT4 optimizations. In the example [streamchat.py](./streamchat.py), we show a basic use case for a ChatGLM3 model to stream chat, with IPEX-LLM INT4 optimizations.
**Stream Chat using `stream_chat()` API**: **Stream Chat using `stream_chat()` API**:
``` ```bash
# for Hugging Face model hub
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
# for ModelScope model hub
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION --modelscope
``` ```
**Chat using `chat()` API**: **Chat using `chat()` API**:
``` ```bash
# for Hugging Face model hub
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION --disable-stream python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION --disable-stream
# for ModelScope model hub
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION --disable-stream --modelscope
``` ```
Arguments info: Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the ChatGLM3 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/chatglm3-6b'`. - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the ChatGLM3 model to be downloaded, or the path to the checkpoint folder. It is default to be `'THUDM/chatglm3-6b'` for **Hugging Face** or `ZhipuAI/chatglm3-6b` for **ModelScope**.
- `--question QUESTION`: argument defining the question to ask. It is default to be `"晚上睡不着应该怎么办"`. - `--question QUESTION`: argument defining the question to ask. It is default to be `"晚上睡不着应该怎么办"`.
- `--disable-stream`: argument defining whether to stream chat. If include `--disable-stream` when running the script, the stream chat is disabled and `chat()` API is used. - `--disable-stream`: argument defining whether to stream chat. If include `--disable-stream` when running the script, the stream chat is disabled and `chat()` API is used.
- `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**.

View file

@ -20,7 +20,6 @@ import argparse
import numpy as np import numpy as np
from ipex_llm.transformers import AutoModel from ipex_llm.transformers import AutoModel
from transformers import AutoTokenizer
# you could tune the prompt based on your own model, # you could tune the prompt based on your own model,
# here the prompt tuning refers to https://github.com/THUDM/ChatGLM3/blob/main/PROMPT.md # here the prompt tuning refers to https://github.com/THUDM/ChatGLM3/blob/main/PROMPT.md
@ -28,16 +27,27 @@ CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ChatGLM3 model') parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ChatGLM3 model')
parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/chatglm3-6b", parser.add_argument('--repo-id-or-model-path', type=str,
help='The huggingface repo id for the ChatGLM3 model to be downloaded' help='The Hugging Face or ModelScope repo id for the ChatGLM3 model to be downloaded'
', or the path to the huggingface checkpoint folder') ', or the path to the checkpoint folder')
parser.add_argument('--prompt', type=str, default="AI是什么", parser.add_argument('--prompt', type=str, default="AI是什么",
help='Prompt to infer') help='Prompt to infer')
parser.add_argument('--n-predict', type=int, default=32, parser.add_argument('--n-predict', type=int, default=32,
help='Max tokens to predict') help='Max tokens to predict')
parser.add_argument('--modelscope', action="store_true", default=False,
help="Use models from modelscope")
args = parser.parse_args() args = parser.parse_args()
model_path = args.repo_id_or_model_path
if args.modelscope:
from modelscope import AutoTokenizer
model_hub = 'modelscope'
else:
from transformers import AutoTokenizer
model_hub = 'huggingface'
model_path = args.repo_id_or_model_path if args.repo_id_or_model_path else \
("ZhipuAI/chatglm3-6b" if args.modelscope else "THUDM/chatglm3-6b")
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
@ -47,7 +57,8 @@ if __name__ == '__main__':
load_in_4bit=True, load_in_4bit=True,
optimize_model=True, optimize_model=True,
trust_remote_code=True, trust_remote_code=True,
use_cache=True) use_cache=True,
model_hub=model_hub)
model = model.half().to('xpu') model = model.half().to('xpu')
# Load tokenizer # Load tokenizer

View file

@ -20,21 +20,32 @@ import argparse
import numpy as np import numpy as np
from ipex_llm.transformers import AutoModel from ipex_llm.transformers import AutoModel
from transformers import AutoTokenizer
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Stream Chat for ChatGLM3 model') parser = argparse.ArgumentParser(description='Stream Chat for ChatGLM3 model')
parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/chatglm3-6b", parser.add_argument('--repo-id-or-model-path', type=str,
help='The huggingface repo id for the ChatGLM3 model to be downloaded' help='The Hugging Face or ModelScope repo id for the ChatGLM3 model to be downloaded'
', or the path to the huggingface checkpoint folder') ', or the path to the checkpoint folder')
parser.add_argument('--question', type=str, default="晚上睡不着应该怎么办", parser.add_argument('--question', type=str, default="晚上睡不着应该怎么办",
help='Qustion you want to ask') help='Qustion you want to ask')
parser.add_argument('--disable-stream', action="store_true", parser.add_argument('--disable-stream', action="store_true",
help='Disable stream chat') help='Disable stream chat')
parser.add_argument('--modelscope', action="store_true", default=False,
help="Use models from modelscope")
args = parser.parse_args() args = parser.parse_args()
model_path = args.repo_id_or_model_path
if args.modelscope:
from modelscope import AutoTokenizer
model_hub = 'modelscope'
else:
from transformers import AutoTokenizer
model_hub = 'huggingface'
model_path = args.repo_id_or_model_path if args.repo_id_or_model_path else \
("ZhipuAI/chatglm3-6b" if args.modelscope else "THUDM/chatglm3-6b")
disable_stream = args.disable_stream disable_stream = args.disable_stream
# Load model in 4 bit, # Load model in 4 bit,
@ -44,8 +55,9 @@ if __name__ == '__main__':
model = AutoModel.from_pretrained(model_path, model = AutoModel.from_pretrained(model_path,
load_in_4bit=True, load_in_4bit=True,
trust_remote_code=True, trust_remote_code=True,
optimize_model=True) optimize_model=True,
model.to('xpu') model_hub=model_hub)
model = model.half().to('xpu')
# Load tokenizer # Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, tokenizer = AutoTokenizer.from_pretrained(model_path,