LLM: add mpt example on arc (#8723)

2023-08-14 09:40:01 +08:00 · 2023-08-14 09:40:01 +08:00 · b10d7e1adf
commit b10d7e1adf
parent e9a1afffc5
2 changed files with 135 additions and 0 deletions
--- a/python/llm/example/transformers/transformers_int4/GPU/mpt/README.md
+++ b/python/llm/example/transformers/transformers_int4/GPU/mpt/README.md
@ -0,0 +1,56 @@
 # MPT
 In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Llama2 models on any Intel® Arc™ A-Series Graphics. For illustration purposes, we utilize the [mosaicml/mpt-7b-chat](https://huggingface.co/mosaicml/mpt-7b-chat) as a reference MPT model.
 ## 0. Requirements
 To run these examples with BigDL-LLM on Intel® Arc™ A-Series Graphics, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
 ## Example: Predict Tokens using `generate()` API
 In the example [generate.py](./generate.py), we show a basic use case for an MPT model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel® Arc™ A-Series Graphics.
 ### 1. Install
 We suggest using conda to manage environment:
 ```bash
 conda create -n llm python=3.9
 conda activate llm
 # below command will install intel_extension_for_pytorch==2.0.110+xpu as default
 # you can install specific ipex/torch version for your need
 pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 pip install einops  # additional package required for mpt-7b-chat and mpt-30b-chat to conduct generation
 ```
 ### 2. Configures OneAPI environment variables
 ```bash
 source /opt/intel/oneapi/setvars.sh
 ```
 ### 3. Run
 For optimal performance on Arc, it is recommended to set several environment variables.
 ```bash
 export USE_XETLA=OFF
 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 ```
 ```
 python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 ```
 Arguments info:
 - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the MPT model (e.g. `mosaicml/mpt-7b-chat`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'mosaicml/mpt-7b-chat'`.
 - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
 - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 #### Sample Output
 #### [mosaicml/mpt-7b-chat](https://huggingface.co/mosaicml/mpt-7b-chat)
 ```log
 Inference time: xxxx s
 -------------------- Prompt --------------------
 <|im_start|>user
 What is AI?<|im_end|>
 <|im_start|>assistant
 -------------------- Output --------------------
 user
 What is AI?
 assistant
 AI, or artificial intelligence, is the simulation of human intelligence in machines that are programmed to think and learn like humans. AI systems can perform tasks that typically require
 ```
--- a/python/llm/example/transformers/transformers_int4/GPU/mpt/generate.py
+++ b/python/llm/example/transformers/transformers_int4/GPU/mpt/generate.py
@ -0,0 +1,79 @@
 #
 # Copyright 2016 The BigDL Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import torch
 import time
 import argparse
 from bigdl.llm.transformers import AutoModelForCausalLM
 from transformers import AutoTokenizer, GenerationConfig
 import intel_extension_for_pytorch as ipex
 # you could tune the prompt based on your own model,
 # here the prompt tuning refers to https://huggingface.co/spaces/mosaicml/mpt-30b-chat/blob/main/app.py
 MPT_PROMPT_FORMAT = "<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
 if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for MPT model')
    parser.add_argument('--repo-id-or-model-path', type=str, default="mosaicml/mpt-7b-chat",
                        help='The huggingface repo id for the MPT models'
                             '(e.g. `mosaicml/mpt-7b-chat`) to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--prompt', type=str, default="What is AI?",
                        help='Prompt to infer')
    parser.add_argument('--n-predict', type=int, default=32,
                        help='Max tokens to predict')
    args = parser.parse_args()
    model_path = args.repo_id_or_model_path
    # Load model in 4 bit,
    # which convert the relevant layers in the model into INT4 format
    model = AutoModelForCausalLM.from_pretrained(model_path,
                                                 load_in_4bit=True,
                                                 optimize_model=False,
                                                 trust_remote_code=True)
    model = model.half().to('xpu')
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path,
                                              trust_remote_code=True)
    # Generate predicted tokens
    with torch.inference_mode():
        prompt = MPT_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        # enabling `use_cache=True` allows the model to utilize the previous
        # key/values attentions to speed up decoding;
        # to obtain optimal performance with BigDL-LLM INT4 optimizations,
        # it is important to set use_cache=True for MPT models
        mpt_generation_config = GenerationConfig(
            max_new_tokens=args.n_predict, 
            use_cache=True, 
            pad_token_id=tokenizer.eos_token_id, 
            eos_token_id=tokenizer.eos_token_id
        )
        st = time.time()
        output = model.generate(input_ids,
                                generation_config=mpt_generation_config)
        end = time.time()
        output = output.cpu()
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print(f'Inference time: {end-st} s')
        print('-'*20, 'Prompt', '-'*20)
        print(prompt)
        print('-'*20, 'Output', '-'*20)
        print(output_str)