add gpu more data types example (#9592)

* add gpu more data types example * add int8
2023-12-05 15:45:38 +08:00 · 2023-12-05 15:45:38 +08:00 · a66fbedd7e
commit a66fbedd7e
parent 65934c9f4f
3 changed files with 105 additions and 0 deletions
--- a/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types/.keep
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types/.keep
--- a/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types/README.md
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types/README.md
@ -0,0 +1,45 @@
 # BigDL-LLM Transformers Low-Bit Inference Pipeline for Large Language Model
 In this example, we show a pipeline to apply BigDL-LLM low-bit optimizations (including FP8/INT8/MixedFP8/FP4/MixedFP4) to any Hugging Face Transformers model, and then run inference on the optimized low-bit model.
 ## Prepare Environment
 We suggest using conda to manage environment:
 ```bash
 conda create -n llm python=3.9
 conda activate llm
 # below command will install intel_extension_for_pytorch==2.0.110+xpu as default
 # you can install specific ipex/torch version for your need
 pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 ```
 ## Run Example
 ```bash
 python ./transformers_low_bit_pipeline.py --repo-id-or-model-path meta-llama/Llama-2-7b-chat-hf --low-bit fp4 --save-path ./llama-2-7b-fp4
 ```
 arguments info:
 - `--repo-id-or-model-path`: str value, argument defining the huggingface repo id for the large language model to be downloaded, or the path to the huggingface checkpoint folder, the value is `meta-llama/Llama-2-7b-chat-hf` by default.
 - `--low-bit`: str value, options are fp8, sym_int8, fp4, mixed_fp8 or mixed_fp4. Relevant low bit optimizations will be applied to the model.
 - `--save-path`: str value, the path to save the low-bit model. Then you can load the low-bit directly.
 - `--load-path`: optional str value. The path to load low-bit model.
 ## Sample Output for Inference
 ### `meta-llama/Llama-2-7b-chat-hf` Model
 ```log
 Prompt: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun
 Output: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. But her parents were always telling her to stay at home and be careful. They were worried about her safety and didn't want her to get hurt
 Model and tokenizer are saved to ./llama-2-7b-fp4
 ```
 ### Load low-bit model
 Command to run:
 ```bash
 python ./transformers_low_bit_pipeline.py --load-path ./llama-2-7b-fp4
 ```
 Output log:
 ```log
 Prompt: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun
 Output: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. But her parents were always telling her to stay at home and be careful. They were worried about her safety and didn't want her to get hurt
 ```
--- a/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types/transformers_low_bit_pipeline.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types/transformers_low_bit_pipeline.py
@ -0,0 +1,60 @@
 #
 # Copyright 2016 The BigDL Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import torch
 import intel_extension_for_pytorch as ipex
 import argparse
 from bigdl.llm.transformers import AutoModelForCausalLM
 from transformers import AutoTokenizer, TextGenerationPipeline
 if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Transformer save_load example')
    parser.add_argument('--repo-id-or-model-path', type=str, default="meta-llama/Llama-2-7b-chat-hf",
                        help='The huggingface repo id for the large language model to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--low-bit', type=str, default="fp4",
                        choices=['fp8', 'sym_int8', 'fp4', 'mixed_fp8', 'mixed_fp4'],
                        help='The quantization type the model will convert to.')
    parser.add_argument('--save-path', type=str, default=None,
                        help='The path to save the low-bit model.')
    parser.add_argument('--load-path', type=str, default=None,
                        help='The path to load the low-bit model.')
    args = parser.parse_args()
    model_path = args.repo_id_or_model_path
    low_bit = args.low_bit
    load_path = args.load_path
    if load_path:
        model = AutoModelForCausalLM.load_low_bit(load_path)
        model = model.to('xpu')
        tokenizer = AutoTokenizer.from_pretrained(load_path)
    else:
        # load_in_low_bit in bigdl.llm.transformers will convert
        # the relevant layers in the model into corresponding int X format
        model = AutoModelForCausalLM.from_pretrained(model_path, load_in_low_bit=low_bit, trust_remote_code=True)
        model = model.to('xpu')
        tokenizer =  AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, max_new_tokens=32, device="xpu")
    input_str = "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun"
    output = pipeline(input_str)[0]["generated_text"]
    print(f"Prompt: {input_str}")
    print(f"Output: {output}")
    save_path = args.save_path
    if save_path:
        model.save_low_bit(save_path)
        tokenizer.save_pretrained(save_path)
        print(f"Model and tokenizer are saved to {save_path}")