Add --modelscope in GPU examples for glm4, codegeex2, qwen2 and qwen2.5 (#12561)

* Add --modelscope for more models

* imporve readme

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
This commit is contained in:
Xu, Shuo 2024-12-19 10:00:39 +08:00 committed by GitHub
parent 28e81fda8e
commit 47e90a362f
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
10 changed files with 125 additions and 125 deletions

View file

@ -108,7 +108,7 @@ python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROM
``` ```
Arguments info: Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the ChatGLM3 model to be downloaded, or the path to the checkpoint folder. It is default to be `'THUDM/chatglm3-6b'` for **Hugging Face** or `ZhipuAI/chatglm3-6b` for **ModelScope**. - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the ChatGLM3 model to be downloaded, or the path to the checkpoint folder. It is default to be `'THUDM/chatglm3-6b'` for **Hugging Face** or `'ZhipuAI/chatglm3-6b'` for **ModelScope**.
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
- `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**. - `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**.
@ -162,7 +162,7 @@ python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question
``` ```
Arguments info: Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the ChatGLM3 model to be downloaded, or the path to the checkpoint folder. It is default to be `'THUDM/chatglm3-6b'` for **Hugging Face** or `ZhipuAI/chatglm3-6b` for **ModelScope**. - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the ChatGLM3 model to be downloaded, or the path to the checkpoint folder. It is default to be `'THUDM/chatglm3-6b'` for **Hugging Face** or `'ZhipuAI/chatglm3-6b'` for **ModelScope**.
- `--question QUESTION`: argument defining the question to ask. It is default to be `"晚上睡不着应该怎么办"`. - `--question QUESTION`: argument defining the question to ask. It is default to be `"晚上睡不着应该怎么办"`.
- `--disable-stream`: argument defining whether to stream chat. If include `--disable-stream` when running the script, the stream chat is disabled and `chat()` API is used. - `--disable-stream`: argument defining whether to stream chat. If include `--disable-stream` when running the script, the stream chat is disabled and `chat()` API is used.
- `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**. - `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**.

View file

@ -1,6 +1,6 @@
# CodeGeeX2 # CodeGeeX2
In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on CodeGeeX2 models which is implemented based on the ChatGLM2 architecture trained on more code data on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [THUDM/codegeex-6b](https://huggingface.co/THUDM/codegeex2-6b) as a reference CodeGeeX2 model. In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on CodeGeeX2 models which is implemented based on the ChatGLM2 architecture trained on more code data on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [THUDM/codegeex2-6b](https://huggingface.co/THUDM/codegeex2-6b) (or [ZhipuAI/codegeex2-6b](https://www.modelscope.cn/models/ZhipuAI/codegeex2-6b) for ModelScope) as a reference CodeGeeX2 model.
## 0. Requirements ## 0. Requirements
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
@ -16,6 +16,9 @@ conda create -n llm python=3.11
conda activate llm conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# [optional] only needed if you would like to use ModelScope as model hub
pip install modelscope==1.11.0
``` ```
#### 1.2 Installation on Windows #### 1.2 Installation on Windows
@ -26,10 +29,13 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# [optional] only needed if you would like to use ModelScope as model hub
pip install modelscope==1.11.0
``` ```
### 2. Download Model and Replace File ### 2. Download Model and Replace File
If you select the codegeex2-6b model ([THUDM/codegeex-6b](https://huggingface.co/THUDM/codegeex2-6b)), please note that their code (`tokenization_chatglm.py`) initialized tokenizer after the call of `__init__` of its parent class, which may result in error during loading tokenizer. To address issue, we have provided an updated file ([tokenization_chatglm.py](./codegeex2-6b/tokenization_chatglm.py)) If you select the codegeex2-6b model ([THUDM/codegeex2-6b](https://huggingface.co/THUDM/codegeex2-6b) (for **Hugging Face**) or [ZhipuAI/codegeex2-6b](https://www.modelscope.cn/models/ZhipuAI/codegeex2-6b) (for **ModelScope**)), please note that their code (`tokenization_chatglm.py`) initialized tokenizer after the call of `__init__` of its parent class, which may result in error during loading tokenizer. To address issue, we have provided an updated file ([tokenization_chatglm.py](./codegeex2-6b/tokenization_chatglm.py))
```python ```python
def __init__(self, vocab_file, padding_side="left", clean_up_tokenization_spaces=False, **kwargs): def __init__(self, vocab_file, padding_side="left", clean_up_tokenization_spaces=False, **kwargs):
@ -37,7 +43,7 @@ def __init__(self, vocab_file, padding_side="left", clean_up_tokenization_spaces
super().__init__(padding_side=padding_side, clean_up_tokenization_spaces=clean_up_tokenization_spaces, **kwargs) super().__init__(padding_side=padding_side, clean_up_tokenization_spaces=clean_up_tokenization_spaces, **kwargs)
``` ```
You could download the model from [THUDM/codegeex-6b](https://huggingface.co/THUDM/codegeex2-6b), and replace the file `tokenization_chatglm.py` with [tokenization_chatglm.py](./codegeex2-6b/tokenization_chatglm.py). You could download the model from [THUDM/codegeex2-6b](https://huggingface.co/THUDM/codegeex2-6b) (for **Hugging Face**) or [ZhipuAI/codegeex2-6b](https://www.modelscope.cn/models/ZhipuAI/codegeex2-6b) (for **ModelScope**), and replace the file `tokenization_chatglm.py` with [tokenization_chatglm.py](./codegeex2-6b/tokenization_chatglm.py).
### 3. Configures OneAPI environment variables for Linux ### 3. Configures OneAPI environment variables for Linux
@ -104,17 +110,22 @@ set SYCL_CACHE_PERSISTENT=1
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. > For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 5. Running examples ### 5. Running examples
``` ```bash
# for Hugging Face model hub
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
# for ModelScope model hub
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT --modelscope
``` ```
Arguments info: Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the CodeGeeX2 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/codegeex-6b'`. - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the CodeGeeX2 model to be downloaded, or the path to the checkpoint folder. It is default to be `'THUDM/codegeex2-6b'` for **Hugging Face** or `'ZhipuAI/codegeex-6b'` for **ModelScope**.
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'# language: Python\n# write a bubble sort function\n'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'# language: Python\n# write a bubble sort function\n'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `128`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `128`.
- `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**.
#### Sample Output #### Sample Output
#### [THUDM/codegeex-6b](https://huggingface.co/THUDM/codegeex-6b) #### [THUDM/codegeex2-6b](https://huggingface.co/THUDM/codegeex2-6b)
```log ```log
Inference time: xxxx s Inference time: xxxx s
-------------------- Prompt -------------------- -------------------- Prompt --------------------

View file

@ -28,18 +28,29 @@ CODEGEEX_PROMPT_FORMAT = "{prompt}"
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ChatGLM2 model') parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for CodeGeeX2 model')
parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/codegeex2-6b", parser.add_argument('--repo-id-or-model-path', type=str,
help='The huggingface repo id for the CodeGeeX2 model to be downloaded' help='The Hugging Face or ModelScope repo id for the CodeGeeX2 model to be downloaded'
', or the path to the huggingface checkpoint folder') ', or the path to the checkpoint folder')
parser.add_argument('--prompt', type=str, default="# language: Python\n# write a bubble sort function\n", parser.add_argument('--prompt', type=str, default="# language: Python\n# write a bubble sort function\n",
help='Prompt to infer') help='Prompt to infer')
parser.add_argument('--n-predict', type=int, default=128, parser.add_argument('--n-predict', type=int, default=128,
help='Max tokens to predict') help='Max tokens to predict')
parser.add_argument('--modelscope', action="store_true", default=False,
help="Use models from modelscope")
args = parser.parse_args() args = parser.parse_args()
model_path = args.repo_id_or_model_path
if args.modelscope:
from modelscope import AutoTokenizer
model_hub = 'modelscope'
else:
from transformers import AutoTokenizer
model_hub = 'huggingface'
model_path = args.repo_id_or_model_path if args.repo_id_or_model_path else \
("ZhipuAI/codegeex2-6b" if args.modelscope else "THUDM/codegeex2-6b")
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
@ -48,7 +59,8 @@ if __name__ == '__main__':
load_in_4bit=True, load_in_4bit=True,
optimize_model=True, optimize_model=True,
trust_remote_code=True, trust_remote_code=True,
use_cache=True) use_cache=True,
model_hub=model_hub)
model = model.half().to('xpu') model = model.half().to('xpu')
# Load tokenizer # Load tokenizer

View file

@ -1,5 +1,5 @@
# GLM-4 # GLM-4
In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on GLM-4 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) as a reference InternLM model. In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on GLM-4 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) (or [ZhipuAI/glm4-9b-chat](https://www.modelscope.cn/models/ZhipuAI/glm4-9b-chat) for ModelScope) as a reference InternLM model.
## 0. Requirements ## 0. Requirements
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
@ -15,6 +15,9 @@ pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-exte
# install packages required for GLM-4, it is recommended to use transformers>=4.44 for THUDM/glm-4-9b-chat updated after August 12, 2024 # install packages required for GLM-4, it is recommended to use transformers>=4.44 for THUDM/glm-4-9b-chat updated after August 12, 2024
pip install "tiktoken>=0.7.0" transformers==4.44 "trl<0.12.0" pip install "tiktoken>=0.7.0" transformers==4.44 "trl<0.12.0"
# [optional] only needed if you would like to use ModelScope as model hub
pip install modelscope==1.11.0
``` ```
### 1.2 Installation on Windows ### 1.2 Installation on Windows
@ -28,6 +31,9 @@ pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-exte
# install packages required for GLM-4, it is recommended to use transformers>=4.44 for THUDM/glm-4-9b-chat updated after August 12, 2024 # install packages required for GLM-4, it is recommended to use transformers>=4.44 for THUDM/glm-4-9b-chat updated after August 12, 2024
pip install "tiktoken>=0.7.0" transformers==4.44 "trl<0.12.0" pip install "tiktoken>=0.7.0" transformers==4.44 "trl<0.12.0"
# [optional] only needed if you would like to use ModelScope as model hub
pip install modelscope==1.11.0
``` ```
## 2. Configures OneAPI environment variables for Linux ## 2. Configures OneAPI environment variables for Linux
@ -98,14 +104,19 @@ set SYCL_CACHE_PERSISTENT=1
### Example 1: Predict Tokens using `generate()` API ### Example 1: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs. In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
``` ```bash
# for Hugging Face model hub
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
# for ModelScope model hub
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT --modelscope
``` ```
Arguments info: Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the GLM-4 model (e.g. `THUDM/glm-4-9b-chat`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'`. - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the GLM-4 model (e.g. `THUDM/glm-4-9b-chat`) to be downloaded, or the path to the checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'` for **Hugging Face** or `'ZhipuAI/glm-4-9b-chat'` for **ModelScope**.
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
- `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**.
#### Sample Output #### Sample Output
#### [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) #### [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat)
@ -134,21 +145,3 @@ What is AI?
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term "art Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term "art
``` ```
### Example 2: Stream Chat using `stream_chat()` API
In the example [streamchat.py](./streamchat.py), we show a basic use case for a GLM-4 model to stream chat, with IPEX-LLM INT4 optimizations.
**Stream Chat using `stream_chat()` API**:
```
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
```
**Chat using `chat()` API**:
```
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION --disable-stream
```
Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the GLM-4 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'`.
- `--question QUESTION`: argument defining the question to ask. It is default to be `"AI是什么"`.
- `--disable-stream`: argument defining whether to stream chat. If include `--disable-stream` when running the script, the stream chat is disabled and `chat()` API is used.

View file

@ -20,7 +20,6 @@ import argparse
import numpy as np import numpy as np
from ipex_llm.transformers import AutoModel from ipex_llm.transformers import AutoModel
from transformers import AutoTokenizer
# you could tune the prompt based on your own model, # you could tune the prompt based on your own model,
# here the prompt tuning refers to https://huggingface.co/THUDM/glm-4-9b-chat/blob/main/tokenization_chatglm.py # here the prompt tuning refers to https://huggingface.co/THUDM/glm-4-9b-chat/blob/main/tokenization_chatglm.py
@ -28,16 +27,27 @@ GLM4_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for GLM-4 model') parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for GLM-4 model')
parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/glm-4-9b-chat", parser.add_argument('--repo-id-or-model-path', type=str,
help='The huggingface repo id for the GLM-4 model to be downloaded' help='The Hugging Face or ModelScope repo id for GLM-4 model model to be downloaded'
', or the path to the huggingface checkpoint folder') ', or the path to the checkpoint folder')
parser.add_argument('--prompt', type=str, default="AI是什么", parser.add_argument('--prompt', type=str, default="AI是什么",
help='Prompt to infer') help='Prompt to infer')
parser.add_argument('--n-predict', type=int, default=32, parser.add_argument('--n-predict', type=int, default=32,
help='Max tokens to predict') help='Max tokens to predict')
parser.add_argument('--modelscope', action="store_true", default=False,
help="Use models from modelscope")
args = parser.parse_args() args = parser.parse_args()
model_path = args.repo_id_or_model_path
if args.modelscope:
from modelscope import AutoTokenizer
model_hub = 'modelscope'
else:
from transformers import AutoTokenizer
model_hub = 'huggingface'
model_path = args.repo_id_or_model_path if args.repo_id_or_model_path else \
("ZhipuAI/glm-4-9b-chat" if args.modelscope else "THUDM/glm-4-9b-chat")
# Load model in 4 bit, # Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format # which convert the relevant layers in the model into INT4 format
@ -47,8 +57,9 @@ if __name__ == '__main__':
load_in_4bit=True, load_in_4bit=True,
optimize_model=True, optimize_model=True,
trust_remote_code=True, trust_remote_code=True,
use_cache=True) use_cache=True,
model = model.to("xpu") model_hub=model_hub)
model = model.half().to("xpu")
# Load tokenizer # Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, tokenizer = AutoTokenizer.from_pretrained(model_path,

View file

@ -1,69 +0,0 @@
#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import torch
import time
import argparse
import numpy as np
from ipex_llm.transformers import AutoModel
from transformers import AutoTokenizer
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Stream Chat for GLM-4 model')
parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/glm-4-9b-chat",
help='The huggingface repo id for the GLM-4 model to be downloaded'
', or the path to the huggingface checkpoint folder')
parser.add_argument('--question', type=str, default="晚上睡不着应该怎么办",
help='Qustion you want to ask')
parser.add_argument('--disable-stream', action="store_true",
help='Disable stream chat')
args = parser.parse_args()
model_path = args.repo_id_or_model_path
disable_stream = args.disable_stream
# Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = AutoModel.from_pretrained(model_path,
trust_remote_code=True,
load_in_4bit=True,
optimize_model=True,
use_cache=True,
cpu_embedding=True)
model = model.to('xpu')
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
trust_remote_code=True)
with torch.inference_mode():
if disable_stream:
# Chat
response, history = model.chat(tokenizer, args.question, history=[])
print('-'*20, 'Chat Output', '-'*20)
print(response)
else:
# Stream chat
response_ = ""
print('-'*20, 'Stream Chat Output', '-'*20)
for response, history in model.stream_chat(tokenizer, args.question, history=[]):
print(response.replace(response_, ""), end="")
response_ = response

View file

@ -1,5 +1,5 @@
# Qwen2.5 # Qwen2.5
In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2.5 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct), [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) as reference Qwen2.5 models. In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2.5 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct), [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) (or [Qwen/Qwen2.5-3B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen2.5-3B-Instruct), [Qwen/Qwen2.5-7B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct) and [Qwen/Qwen2.5-14B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen2.5-14B-Instruct) for ModelScope) as reference Qwen2.5 models.
## 0. Requirements ## 0. Requirements
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
@ -14,6 +14,9 @@ conda create -n llm python=3.11
conda activate llm conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# [optional] only needed if you would like to use ModelScope as model hub
pip install modelscope==1.11.0
``` ```
#### 1.2 Installation on Windows #### 1.2 Installation on Windows
@ -24,6 +27,9 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# [optional] only needed if you would like to use ModelScope as model hub
pip install modelscope==1.11.0
``` ```
### 2. Configures OneAPI environment variables for Linux ### 2. Configures OneAPI environment variables for Linux
@ -91,14 +97,19 @@ set SYCL_CACHE_PERSISTENT=1
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. > For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples ### 4. Running examples
``` ```bash
# for Hugging Face model hub
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
# for ModelScope model hub
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT --modelscope
``` ```
Arguments info: Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Qwen2.5 model (e.g. `Qwen/Qwen2.5-7B-Instruct`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'Qwen/Qwen2.5-7B-Instruct'`. - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the Qwen2.5 model (e.g. `Qwen/Qwen2.5-7B-Instruct`) to be downloaded, or the path to the checkpoint folder. It is default to be `'Qwen/Qwen2.5-7B-Instruct'`.
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
- `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**.
#### Sample Output #### Sample Output
##### [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) ##### [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)

View file

@ -18,20 +18,29 @@ import torch
import time import time
import argparse import argparse
from transformers import AutoTokenizer
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Predict Tokens using generate() API for Qwen2.5 model') parser = argparse.ArgumentParser(description='Predict Tokens using generate() API for Qwen2.5 model')
parser.add_argument('--repo-id-or-model-path', type=str, default="Qwen/Qwen2.5-7B-Instruct", parser.add_argument('--repo-id-or-model-path', type=str, default="Qwen/Qwen2.5-7B-Instruct",
help='The huggingface repo id for the Qwen2.5 model to be downloaded' help='The Hugging Face or ModelScope repo id for the Qwen2.5 model to be downloaded'
', or the path to the huggingface checkpoint folder') ', or the path to the huggingface checkpoint folder')
parser.add_argument('--prompt', type=str, default="AI是什么", parser.add_argument('--prompt', type=str, default="AI是什么",
help='Prompt to infer') help='Prompt to infer')
parser.add_argument('--n-predict', type=int, default=32, parser.add_argument('--n-predict', type=int, default=32,
help='Max tokens to predict') help='Max tokens to predict')
parser.add_argument('--modelscope', action="store_true", default=False,
help="Use models from modelscope")
args = parser.parse_args() args = parser.parse_args()
if args.modelscope:
from modelscope import AutoTokenizer
model_hub = 'modelscope'
else:
from transformers import AutoTokenizer
model_hub = 'huggingface'
model_path = args.repo_id_or_model_path model_path = args.repo_id_or_model_path
@ -42,7 +51,8 @@ if __name__ == '__main__':
load_in_4bit=True, load_in_4bit=True,
optimize_model=True, optimize_model=True,
trust_remote_code=True, trust_remote_code=True,
use_cache=True) use_cache=True,
model_hub=model_hub)
model = model.half().to("xpu") model = model.half().to("xpu")
# Load tokenizer # Load tokenizer

View file

@ -1,5 +1,5 @@
# Qwen2 # Qwen2
In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) and [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) as reference Qwen2 models. In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) and [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) (or [Qwen/Qwen2-7B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen2-7B-Instruct) and [Qwen/Qwen2-1.5B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen2-1.5B-Instruct) for ModelScope) as reference Qwen2 models.
## 0. Requirements ## 0. Requirements
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
@ -16,6 +16,9 @@ conda activate llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
pip install transformers==4.37.0 # install transformers which supports Qwen2 pip install transformers==4.37.0 # install transformers which supports Qwen2
# [optional] only needed if you would like to use ModelScope as model hub
pip install modelscope==1.11.0
``` ```
#### 1.2 Installation on Windows #### 1.2 Installation on Windows
@ -28,6 +31,9 @@ conda activate llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
pip install transformers==4.37.0 # install transformers which supports Qwen2 pip install transformers==4.37.0 # install transformers which supports Qwen2
# [optional] only needed if you would like to use ModelScope as model hub
pip install modelscope==1.11.0
``` ```
### 2. Configures OneAPI environment variables for Linux ### 2. Configures OneAPI environment variables for Linux
@ -95,14 +101,19 @@ set SYCL_CACHE_PERSISTENT=1
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. > For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples ### 4. Running examples
``` ```bash
# for Hugging Face model hub
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
# for ModelScope model hub
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT --modelscope
``` ```
Arguments info: Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Qwen2 model (e.g. `Qwen/Qwen2-7B-Instruct`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'Qwen/Qwen2-7B-Instruct'`. - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the Qwen2 model (e.g. `Qwen/Qwen2-7B-Instruct`) to be downloaded, or the path to the checkpoint folder. It is default to be `'Qwen/Qwen2-7B-Instruct'`.
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
- `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**.
#### Sample Output #### Sample Output
##### [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) ##### [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)

View file

@ -18,21 +18,30 @@ import torch
import time import time
import argparse import argparse
from transformers import AutoTokenizer
import numpy as np import numpy as np
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Qwen2-7B-Instruct') parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for Qwen2 model')
parser.add_argument('--repo-id-or-model-path', type=str, default="Qwen/Qwen2-7B-Instruct", parser.add_argument('--repo-id-or-model-path', type=str, default="Qwen/Qwen2-7B-Instruct",
help='The huggingface repo id for the Qwen2 model to be downloaded' help='The Hugging Face or ModelScope repo id for the Qwen2 model to be downloaded'
', or the path to the huggingface checkpoint folder') ', or the path to the checkpoint folder')
parser.add_argument('--prompt', type=str, default="AI是什么", parser.add_argument('--prompt', type=str, default="AI是什么",
help='Prompt to infer') help='Prompt to infer')
parser.add_argument('--n-predict', type=int, default=32, parser.add_argument('--n-predict', type=int, default=32,
help='Max tokens to predict') help='Max tokens to predict')
parser.add_argument('--modelscope', action="store_true", default=False,
help="Use models from modelscope")
args = parser.parse_args() args = parser.parse_args()
if args.modelscope:
from modelscope import AutoTokenizer
model_hub = 'modelscope'
else:
from transformers import AutoTokenizer
model_hub = 'huggingface'
model_path = args.repo_id_or_model_path model_path = args.repo_id_or_model_path
@ -43,7 +52,8 @@ if __name__ == '__main__':
load_in_4bit=True, load_in_4bit=True,
optimize_model=True, optimize_model=True,
trust_remote_code=True, trust_remote_code=True,
use_cache=True) use_cache=True,
model_hub=model_hub)
model = model.half().to("xpu") model = model.half().to("xpu")
# Load tokenizer # Load tokenizer