LLM: add chatglm2 example for Arc (#8741)

* add chatglm2 example

* update

* fix readme
This commit is contained in:
Ruonan Wang 2023-08-14 10:43:08 +08:00 committed by GitHub
parent b10d7e1adf
commit faaccb64a2
3 changed files with 143 additions and 0 deletions

View file

@ -3,8 +3,11 @@ You can use BigDL-LLM to run almost every Huggingface Transformer models with IN
## Recommended Requirements ## Recommended Requirements
To apply Intel® Arc™ A-Series Graphics acceleration, therere several steps for tools installation and environment preparation. To apply Intel® Arc™ A-Series Graphics acceleration, therere several steps for tools installation and environment preparation.
Step 1, only Linux system is supported now, Ubuntu 22.04 is prefered. Step 1, only Linux system is supported now, Ubuntu 22.04 is prefered.
Step 2, please refer to our [drive installation](https://dgpu-docs.intel.com/installation-guides/index.html#intel-arc-gpus) for general purpose GPU capabilities. Step 2, please refer to our [drive installation](https://dgpu-docs.intel.com/installation-guides/index.html#intel-arc-gpus) for general purpose GPU capabilities.
Step 3, you also need to download and install [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html). OneMKL and DPC++ compiler are needed, others are optional. Step 3, you also need to download and install [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html). OneMKL and DPC++ compiler are needed, others are optional.
## Best Known Configuration on Linux ## Best Known Configuration on Linux

View file

@ -0,0 +1,67 @@
# ChatGLM2
In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on ChatGLM2 models on any Intel® Arc™ A-Series Graphics. For illustration purposes, we utilize the [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b) as a reference ChatGLM2 model.
## 0. Requirements
To run these examples with BigDL-LLM on Intel® Arc™ A-Series Graphics, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
## Example 1: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a ChatGLM2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel® Arc™ A-Series Graphics.
### 1. Install
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9
conda activate llm
# below command will install intel_extension_for_pytorch==2.0.110+xpu as default
# you can install specific ipex/torch version for your need
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
### 2. Configures OneAPI environment variables
```bash
source /opt/intel/oneapi/setvars.sh
```
### 3. Run
For optimal performance on Arc, it is recommended to set several environment variables.
```bash
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
```
```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
```
Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the ChatGLM2 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/chatglm2-6b'`.
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
#### Sample Output
#### [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
```log
Inference time: xxxx s
-------------------- Prompt --------------------
AI是什么
答:
-------------------- Output --------------------
AI是什么
答: AI指的是人工智能,是一种能够通过学习和推理来执行任务的计算机程序。它可以模仿人类的思维方式,做出类似人类的决策,并且具有自主学习、自我
```
```log
Inference time: xxxx s
-------------------- Prompt --------------------
What is AI?
答:
-------------------- Output --------------------
What is AI?
答: Artificial Intelligence (AI) refers to the ability of a computer or machine to perform tasks that typically require human-like intelligence, such as understanding language, recognizing patterns
```

View file

@ -0,0 +1,73 @@
#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import torch
import time
import argparse
import numpy as np
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import intel_extension_for_pytorch as ipex
# you could tune the prompt based on your own model,
# here the prompt tuning refers to https://huggingface.co/THUDM/chatglm2-6b/blob/main/modeling_chatglm.py#L1007
CHATGLM_V2_PROMPT_FORMAT = "问:{prompt}\n\n答:"
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ChatGLM2 model')
parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/chatglm2-6b",
help='The huggingface repo id for the ChatGLM2 model to be downloaded'
', or the path to the huggingface checkpoint folder')
parser.add_argument('--prompt', type=str, default="AI是什么",
help='Prompt to infer')
parser.add_argument('--n-predict', type=int, default=32,
help='Max tokens to predict')
args = parser.parse_args()
model_path = args.repo_id_or_model_path
# Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format
model = AutoModel.from_pretrained(model_path,
load_in_4bit=True,
optimize_model=False,
trust_remote_code=True)
model = model.half().to('xpu')
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
trust_remote_code=True)
# Generate predicted tokens
with torch.inference_mode():
prompt = CHATGLM_V2_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
st = time.time()
# if your selected model is capable of utilizing previous key/value attentions
# to enhance decoding speed, but has `"use_cache": false` in its model config,
# it is important to set `use_cache=True` explicitly in the `generate` function
# to obtain optimal performance with BigDL-LLM INT4 optimizations
output = model.generate(input_ids,
max_new_tokens=args.n_predict)
torch.xpu.synchronize()
end = time.time()
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end-st} s')
print('-'*20, 'Prompt', '-'*20)
print(prompt)
print('-'*20, 'Output', '-'*20)
print(output_str)