[LLM] Unify transformers-like API example for 3 different model families (#8315)

* Refactor bigdl-llm transformers-like API to unify them * Small fix
2023-06-12 17:20:30 +08:00 · 2023-06-12 17:20:30 +08:00 · f83c48280f
commit f83c48280f
parent c4028d507c
8 changed files with 244 additions and 457 deletions
--- a/python/llm/example/bloom/README.md
+++ b/python/llm/example/bloom/README.md
@ -1,33 +0,0 @@
-# Inference Pipeline for BLOOM Family Models in INT4 Data Type
-
-In this example, we show a pipeline to conduct inference on a converted low-precision (int4) large language model in BLOOM family, using `bigdl-llm`.
-
-## Prepare Environment
-We suggest using conda to manage environment:
-```bash
-conda create -n llm python=3.9
-conda activate llm
-
-pip install bigdl-llm[all]
-```
-
-## Run Example
-```bash
-python ./gptneox.py --thread-num THREAD_NUM
-```
-arguments info:
- `--thread-num THREAD_NUM`: required argument defining the number of threads to use for inference. It is default to be `2`.
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: optional argument defining the huggingface repo id from which the BLOOM family model is downloaded, or the path to the huggingface checkpoint folder for BLOOM family model. It is default to be `'bigscience/bloomz-7b1'`
- `--promp PROMPT`: optional argument defining the prompt to be infered. It is default to be `'Q: What is AI? A:'`.
-
-## Sample Output for Inference
-```log
-inference:    mem per token = 24471324 bytes
-inference:      sample time =     xxxx ms
-inference: evel prompt time =     xxxx ms / 5 tokens / xxxx ms per token
-inference:     predict time =     xxxx ms / 2 tokens / xxxx ms per token
-inference:       total time =     xxxx ms
-Inference time (fast forward): xxxx s
-Output:
-{'id': 'cmpl-bb268afb-e088-4729-91fa-8746ea4fa706', 'object': 'text_completion', 'created': 1686294707, 'model': '/disk5/yuwen/bloom/bigdl_llm_bloom_q4_0.bin', 'choices': [{'text': 'Q: What is AI? A: artificial intelligence</s>', 'index': 0, 'logprobs': None, 'finish_reason': None}], 'usage': {'prompt_tokens': None, 'completion_tokens': None, 'total_tokens': None}}
-```
--- a/python/llm/example/bloom/bloom.py
+++ b/python/llm/example/bloom/bloom.py
@ -1,89 +0,0 @@
-#
-# Copyright 2016 The BigDL Authors.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import time
-import argparse
-
-
-def convert_and_load(repo_id_or_model_path, n_threads):
-
-    from bigdl.llm.ggml.transformers import AutoModelForCausalLM
-
-    # here you may input the HuggingFace repo id directly as the value of `pretrained_model_name_or_path`.
-    # This will allow the pre-trained model to be downloaded directly from the HuggingFace repository.
-    # The downloaded model will then be converted to binary format with int4 dtype weights,
-    # and saved into the cache_dir folder.
-    #
-    # if you already have the pre-trained model downloaded, you can provide the path to
-    # the downloaded folder as the value of `pretrained_model_name_or_path``
-    llm = AutoModelForCausalLM.from_pretrained(
-        pretrained_model_name_or_path=repo_id_or_model_path,
-        model_family='bloom',
-        dtype='int4',
-        cache_dir='./',
-        n_threads=n_threads)
-
-    # if you want to explicitly convert the pre-trained model, you can use the `convert_model` API 
-    # to convert the downloaded Huggungface checkpoint first,
-    # and then load the binary checkpoint directly.
-    #
-    # from bigdl.llm.ggml import convert_model
-    #
-    # model_path = repo_id_or_model_path
-    # output_ckpt_path = convert_model(
-    #     input_path=model_path,
-    #     output_path='./',
-    #     dtype='int4',
-    #     model_family='bloom')
-    #
-    # llm = AutoModelForCausalLM.from_pretrained(
-    #     pretrained_model_name_or_path=output_ckpt_path,
-    #     model_family='bloom',
-    #     n_threads=n_threads)
-
-    return llm
-
-def inference(llm, prompt):
-
-    st = time.time()
-
-    output = llm(prompt,
-                 max_tokens=32)
-
-    print(f'Inference time (fast forward): {time.time()-st} s')
-    print(f'Output:\n{output}')
-
-
-def main():
-    parser = argparse.ArgumentParser(description='BLOOM pipeline example')
-    parser.add_argument('--thread-num', type=int, default=2, required=True,
-                        help='Number of threads to use for inference')
-    parser.add_argument('--repo-id-or-model-path', type=str, default="bigscience/bloomz-7b1",
-                        help='The huggingface repo id for BLOOM family model to be downloaded'
-                             ', or the path to the huggingface checkpoint folder')
-    parser.add_argument('--prompt', type=str, default='Q: What is AI? A:',
-                        help='Prompt to infer')
-    args = parser.parse_args()
-
-    # Step 1: convert and load int4 model
-    llm = convert_and_load(repo_id_or_model_path=args.repo_id_or_model_path, n_threads=args.thread_num)
-
-    # Step 2: conduct inference
-    inference(llm=llm, prompt=args.prompt)
-
-
-if __name__ == '__main__':
-    main()
--- a/python/llm/example/gptneox/README.md
+++ b/python/llm/example/gptneox/README.md
@ -1,46 +0,0 @@
-# Inference Pipeline for GPT-NeoX Family Models in INT4 Data Type
-
-In this example, we show a pipeline to conduct inference on a converted low-precision (int4) large language model in GPT-NeoX family, using `bigdl-llm`.
-
-## Prepare Environment
-We suggest using conda to manage environment:
-```bash
-conda create -n llm python=3.9
-conda activate llm
-
-pip install bigdl-llm[all]
-```
-
-## Run Example
-```bash
-python ./gptneox.py --thread-num THREAD_NUM
-```
-arguments info:
- `--thread-num THREAD_NUM`: required argument defining the number of threads to use for inference. It is default to be `2`.
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: optional argument defining the huggingface repo id from which the GPT-NeoX family model is downloaded, or the path to the huggingface checkpoint folder for GPT-NeoX family model. It is default to be `'togethercomputer/RedPajama-INCITE-7B-Chat'`
- `--promp PROMPT`: optional argument defining the prompt to be infered. It is default to be `'Q: What is AI? A:'`.
-
-## Sample Output for Inference
-```log
--------------------  HuggingFace transformers tokenizer  --------------------
-Please note that the loading of transformers tokenizer may takes some time.
-
-Inference time: xxxx s
-Output:
-[' The term "AI" itself is a bit of a red herring, as real intelligence is impossible to fully replicate in a machine. However, it\'s commonly accepted']
--------------------  bigdl-llm based tokenizer  --------------------
-Inference time: xxxx s
-Output:
-[' Artificial Intelligence is the development of computer systems which can carry out activities which normally require human intelligence, such as visual perception, speech recognition, decision-making, and']
--------------------  fast forward  --------------------
-Gptneox.generate: prefix-match hit
-
-gptneox_print_timings:        load time =  xxxx ms
-gptneox_print_timings:      sample time =  xxxx ms /    32 runs   (    xxxx ms per run)
-gptneox_print_timings: prompt eval time =  xxxx ms /     8 tokens (    xxxx ms per token)
-gptneox_print_timings:        eval time =  xxxx ms /    31 runs   (    xxxx ms per run)
-gptneox_print_timings:       total time =  xxxx ms
-Inference time (fast forward): xxxx s
-Output:
-{'id': 'cmpl-f598d623-5186-44c9-ba58-d8bc76634b3c', 'object': 'text_completion', 'created': 1686294834, 'model': '/disk5/yuwen/gptneox/bigdl_llm_gptneox_q4_0.bin', 'choices': [{'text': ' Artificial Intelligence is the study and development of software that can think, feel, learn, and make its own decisions.\n<human>: Classify each one', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 9, 'completion_tokens': 32, 'total_tokens': 41}}
-```
--- a/python/llm/example/gptneox/gptneox.py
+++ b/python/llm/example/gptneox/gptneox.py
@ -1,120 +0,0 @@
-#
-# Copyright 2016 The BigDL Authors.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import time
-import argparse
-
-
-def convert_and_load(repo_id_or_model_path, n_threads):
-
-    from bigdl.llm.ggml.transformers import AutoModelForCausalLM
-
-    # here you may input the HuggingFace repo id directly as the value of `pretrained_model_name_or_path`.
-    # This will allow the pre-trained model to be downloaded directly from the HuggingFace repository.
-    # The downloaded model will then be converted to binary format with int4 dtype weights,
-    # and saved into the cache_dir folder.
-    #
-    # if you already have the pre-trained model downloaded, you can provide the path to
-    # the downloaded folder as the value of `pretrained_model_name_or_path``
-    llm = AutoModelForCausalLM.from_pretrained(
-        pretrained_model_name_or_path=repo_id_or_model_path,
-        model_family='gptneox',
-        dtype='int4',
-        cache_dir='./',
-        n_threads=n_threads)
-
-    # if you want to explicitly convert the pre-trained model, you can use the `convert_model` API 
-    # to convert the downloaded Huggungface checkpoint first,
-    # and then load the binary checkpoint directly.
-    #
-    # from bigdl.llm.ggml import convert_model
-    #
-    # model_path = repo_id_or_model_path
-    # output_ckpt_path = convert_model(
-    #     input_path=model_path,
-    #     output_path='./',
-    #     dtype='int4',
-    #     model_family='gptneox')
-    #
-    # llm = AutoModelForCausalLM.from_pretrained(
-    #     pretrained_model_name_or_path=output_ckpt_path,
-    #     model_family='gptneox',
-    #     n_threads=n_threads)
-
-    return llm
-
-def inference(llm, prompt, repo_id_or_model_path):
-
-    # Option 1: Use HuggingFace transformers tokenizer
-    print('-'*20, ' HuggingFace transformers tokenizer ', '-'*20)
-    from transformers import AutoTokenizer
-
-    print('Please note that the loading of HuggingFace transformers tokenizer may takes some time.\n')
-    tokenizer = AutoTokenizer.from_pretrained(repo_id_or_model_path)
-
-    st = time.time()
-
-    # please note that the prompt here can either be a string or a list of string
-    tokens_id = tokenizer(prompt).input_ids
-    output_tokens_id = llm.generate(tokens_id, max_new_tokens=32)
-    output = tokenizer.batch_decode(output_tokens_id)
-
-    print(f'Inference time: {time.time()-st} s')
-    print(f'Output:\n{output}')
-
-    # Option 2: Use bigdl-llm based tokenizer
-    print('-'*20, ' bigdl-llm based tokenizer ', '-'*20)
-    st = time.time()
-
-    # please note that the prompt here can either be a string or a list of string
-    tokens_id = llm.tokenize(prompt)
-    output_tokens_id = llm.generate(tokens_id, max_new_tokens=32)
-    output = llm.batch_decode(output_tokens_id)
-
-    print(f'Inference time: {time.time()-st} s')
-    print(f'Output:\n{output}')
-
-    # Option 3: fast forward
-    print('-'*20, ' fast forward ', '-'*20)
-    st = time.time()
-
-    output = llm(prompt, # please note that the prompt here can ONLY be a string
-                 max_tokens=32)
-
-    print(f'Inference time (fast forward): {time.time()-st} s')
-    print(f'Output:\n{output}')
-
-
-def main():
-    parser = argparse.ArgumentParser(description='GPT-NeoX pipeline example')
-    parser.add_argument('--thread-num', type=int, default=2, required=True,
-                        help='Number of threads to use for inference')
-    parser.add_argument('--repo-id-or-model-path', type=str, default="togethercomputer/RedPajama-INCITE-7B-Chat",
-                        help='The huggingface repo id for GPT-NeoX family model to be downloaded'
-                             ', or the path to the huggingface checkpoint folder')
-    parser.add_argument('--prompt', type=str, default='Q: What is AI? A:',
-                        help='Prompt to infer')
-    args = parser.parse_args()
-
-    # Step 1: convert and load int4 model
-    llm = convert_and_load(repo_id_or_model_path=args.repo_id_or_model_path, n_threads=args.thread_num)
-
-    # Step 2: conduct inference
-    inference(llm=llm, prompt=args.prompt, repo_id_or_model_path=args.repo_id_or_model_path)
-
-
-if __name__ == '__main__':
-    main()
--- a/python/llm/example/llama/README.md
+++ b/python/llm/example/llama/README.md
@ -1,49 +0,0 @@
-# Inference Pipeline for LLaMA Family Models in INT4 Data Type
-
-In this example, we show a pipeline to conduct inference on a converted low-precision (int4) large language model in LLaMA family, using `bigdl-llm`.
-
-## Prepare Environment
-We suggest using conda to manage environment:
-```bash
-conda create -n llm python=3.9
-conda activate llm
-
-pip install bigdl-llm[all]
-```
-
-## Run Example
-```bash
-python ./llama.py --thread-num THREAD_NUM
-```
-arguments info:
- `--thread-num THREAD_NUM`: required argument defining the number of threads to use for inference. It is default to be `2`.
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: optional argument defining the huggingface repo id from which the LLaMA family model is downloaded, or the path to the huggingface checkpoint folder for LLaMA family model. It is default to be `'decapoda-research/llama-7b-hf'`
- `--promp PROMPT`: optional argument defining the prompt to be infered. It is default to be `'Q: What is AI? A:'`.
-
-## Sample Output for Inference
-```log
--------------------  HuggingFace transformers tokenizer  --------------------
-Please note that the loading of HuggingFace transformers tokenizer may takes some time.
-
-The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
-The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
-The class this function is called from is 'LlamaTokenizer'.
-Inference time: xxxx s
-Output:
-['It’s the ability of computers to perform tasks that usually require human intelligence.\n WORLD WAR II: 75 YEARS LAT']
--------------------  bigdl-llm based tokenizer  --------------------
-Inference time: xxxx s
-Output:
-[" It's everything\nEthics and artificial intelligence have been a hot topic this year, as researchers and the public wrestle with the implications of this"]
--------------------  fast forward  --------------------
-Llama.generate: prefix-match hit
-
-llama_print_timings:        load time =  xxxx ms
-llama_print_timings:      sample time =  xxxx ms /    32 runs   (   xxxx ms per token)
-llama_print_timings: prompt eval time =  xxxx ms /     9 tokens (   xxxx ms per token)
-llama_print_timings:        eval time =  xxxx ms /    31 runs   (   xxxx ms per token)
-llama_print_timings:       total time =  xxxx ms
-Inference time (fast forward): xxxx s
-Output:
-{'id': 'cmpl-f3c5482a-b84e-4363-a85c-89cf7d23ff51', 'object': 'text_completion', 'created': 1686294953, 'model': '/disk5/yuwen/llama/bigdl_llm_llama_q4_0.bin', 'choices': [{'text': ' It’s the latest hot topic in tech. From virtual assistants to driverless cars, machine learning to big data analytics. We hear it a', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 10, 'completion_tokens': 32, 'total_tokens': 42}}
-```
--- a/python/llm/example/llama/llama.py
+++ b/python/llm/example/llama/llama.py
@ -1,120 +0,0 @@
-#
-# Copyright 2016 The BigDL Authors.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import time
-import argparse
-
-
-def convert_and_load(repo_id_or_model_path, n_threads):
-
-    from bigdl.llm.ggml.transformers import AutoModelForCausalLM
-
-    # here you may input the HuggingFace repo id directly as the value of `pretrained_model_name_or_path`.
-    # This will allow the pre-trained model to be downloaded directly from the HuggingFace repository.
-    # The downloaded model will then be converted to binary format with int4 dtype weights,
-    # and saved into the cache_dir folder.
-    #
-    # if you already have the pre-trained model downloaded, you can provide the path to
-    # the downloaded folder as the value of `pretrained_model_name_or_path``
-    llm = AutoModelForCausalLM.from_pretrained(
-        pretrained_model_name_or_path=repo_id_or_model_path,
-        model_family='llama',
-        dtype='int4',
-        cache_dir='./',
-        n_threads=n_threads)
-
-    # if you want to explicitly convert the pre-trained model, you can use the `convert_model` API 
-    # to convert the downloaded Huggungface checkpoint first,
-    # and then load the binary checkpoint directly.
-    #
-    # from bigdl.llm.ggml import convert_model
-    #
-    # model_path = repo_id_or_model_path
-    # output_ckpt_path = convert_model(
-    #     input_path=model_path,
-    #     output_path='./',
-    #     dtype='int4',
-    #     model_family='llama')
-    #
-    # llm = AutoModelForCausalLM.from_pretrained(
-    #     pretrained_model_name_or_path=output_ckpt_path,
-    #     model_family='llama',
-    #     n_threads=n_threads)
-
-    return llm
-
-def inference(llm, prompt, repo_id_or_model_path):
-
-    # Option 1: Use HuggingFace transformers tokenizer
-    print('-'*20, ' HuggingFace transformers tokenizer ', '-'*20)
-    from transformers import LlamaTokenizer
-
-    print('Please note that the loading of HuggingFace transformers tokenizer may takes some time.\n')
-    tokenizer = LlamaTokenizer.from_pretrained(repo_id_or_model_path)
-
-    st = time.time()
-
-    # please note that the prompt here can either be a string or a list of string
-    tokens_id = tokenizer(prompt).input_ids
-    output_tokens_id = llm.generate(tokens_id, max_new_tokens=32)
-    output = tokenizer.batch_decode(output_tokens_id)
-
-    print(f'Inference time: {time.time()-st} s')
-    print(f'Output:\n{output}')
-
-    # Option 2: Use bigdl-llm based tokenizer
-    print('-'*20, ' bigdl-llm based tokenizer ', '-'*20)
-    st = time.time()
-
-    # please note that the prompt here can either be a string or a list of string
-    tokens_id = llm.tokenize(prompt)
-    output_tokens_id = llm.generate(tokens_id, max_new_tokens=32)
-    output = llm.batch_decode(output_tokens_id)
-
-    print(f'Inference time: {time.time()-st} s')
-    print(f'Output:\n{output}')
-
-    # Option 3: fast forward
-    print('-'*20, ' fast forward ', '-'*20)
-    st = time.time()
-
-    output = llm(prompt, # please note that the prompt here can ONLY be a string
-                 max_tokens=32)
-
-    print(f'Inference time (fast forward): {time.time()-st} s')
-    print(f'Output:\n{output}')
-
-
-def main():
-    parser = argparse.ArgumentParser(description='LLaMA pipeline example')
-    parser.add_argument('--thread-num', type=int, default=2, required=True,
-                        help='Number of threads to use for inference')
-    parser.add_argument('--repo-id-or-model-path', type=str, default="decapoda-research/llama-7b-hf",
-                        help='The huggingface repo id for LLaMA family model to be downloaded'
-                             ', or the path to the huggingface checkpoint folder')
-    parser.add_argument('--prompt', type=str, default='Q: What is AI? A:',
-                        help='Prompt to infer')
-    args = parser.parse_args()
-
-    # Step 1: convert and load int4 model
-    llm = convert_and_load(repo_id_or_model_path=args.repo_id_or_model_path, n_threads=args.thread_num)
-
-    # Step 2: conduct inference
-    inference(llm=llm, prompt=args.prompt, repo_id_or_model_path=args.repo_id_or_model_path)
-
-
-if __name__ == '__main__':
-    main()
--- a/python/llm/example/transformers/README.md
+++ b/python/llm/example/transformers/README.md
@ -0,0 +1,96 @@
+# INT4 Inference Pipeline for Large Language Model using BigDL-LLM Transformers-like API
+
+In this example, we show a pipeline to convert a large language model to low precision (INT4), and then conduct inference on the converted INT4 model, using BigDL-LLM transformers-like API.
+
+> **Note**: BigDL-LLM currently supports model family LLaMA, GPT-NeoX, and BLOOM.
+
+## Prepare Environment
+We suggest using conda to manage environment:
+```bash
+conda create -n llm python=3.9
+conda activate llm
+
+pip install --pre --upgrade bigdl-llm[all]
+```
+
+## Run Example
+```bash
+python ./int4_pipeline.py --thread-num THREAD_NUM --model-family MODEL_FAMILY
+```
+arguments info:
+- `--thread-num THREAD_NUM`: **required** argument defining the number of threads to use for inference. It is default to be `2`.
+- `--model-family MODEL_FAMILY`: **required** argument defining the model family of the large language model (supported option: `'llama'`, `'gptneox'`, `'bloom'`). It is default to be `'llama'`.
+- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: optional argument defining the huggingface repo id from which the large language model is downloaded, or the path to the huggingface checkpoint folder for the model.
+
+  - When model family is `'llama'`, it is default to be `'decapoda-research/llama-7b-hf'`.
+  - When model family is `'gptneox'`, it is default to be `'togethercomputer/RedPajama-INCITE-7B-Chat'`.
+  - When model family is `'bloom'`, it is default to be `'bigscience/bloomz-7b1'`.
+
+  > **Note** `REPO_ID_OR_MODEL_PATH` should fits your inputed `MODEL_FAMILY`.
+- `--promp PROMPT`: optional argument defining the prompt to be infered. It is default to be `'Q: What is CPU? A:'`.
+
+## Sample Output for Inference
+### Model family LLaMA
+```log
+--------------------  HuggingFace transformers tokenizer  --------------------
+Please note that the loading of HuggingFace transformers tokenizer may take some time.
+
+The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
+The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
+The class this function is called from is 'LlamaTokenizer'.
+Inference time: xxxx s
+Output:
+["The Central Processing Unit (CPU) is the brains of your computer, and is also known as the microprocessor. It's where all the action"]
+--------------------  bigdl-llm based tokenizer  --------------------
+Inference time: xxxx s
+Output:
+[' It’s the acronym for “Central Processing Unit,” and in modern personal computers it means a single microprocessor chip that is used to control various']
+--------------------  fast forward  --------------------
+Llama.generate: prefix-match hit
+
+llama_print_timings:        load time =     xxxx ms
+llama_print_timings:      sample time =     xxxx ms /    32 runs   (    xxxx ms per token)
+llama_print_timings: prompt eval time =     xxxx ms /     8 tokens (    xxxx ms per token)
+llama_print_timings:        eval time =     xxxx ms /    31 runs   (    xxxx ms per token)
+llama_print_timings:       total time =     xxxx ms
+Inference time (fast forward): xxxx s
+Output:
+{'id': 'cmpl-5aa68120-c94b-4433-92f4-b75cc323c22f', 'object': 'text_completion', 'created': 1686557904, 'model': './bigdl_llm_llama_q4_0.bin', 'choices': [{'text': ' It’s a small, compact computer unit that runs on a single chip. This can be connected to various peripheral devices, including printers and displays', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 9, 'completion_tokens': 32, 'total_tokens': 41}}
+```
+
+### Model family GPT-NeoX
+```log
+--------------------  HuggingFace transformers tokenizer  --------------------
+Please note that the loading of HuggingFace transformers tokenizer may take some time.
+
+Inference time: xxxx s
+Output:
+[' The Central Processing Unit, or CPU, is the component of a computer that executes all instructions for carrying out different functions. It is the brains of the operation, and']
+--------------------  bigdl-llm based tokenizer  --------------------
+Inference time: xxxx s
+Output:
+[' Central processing unit, also known as processor, is a specialized microchip designed to execute all the instructions of computer programs rapidly and efficiently. Most personal computers have one or']
+--------------------  fast forward  --------------------
+Gptneox.generate: prefix-match hit
+
+gptneox_print_timings:        load time =     xxxx ms
+gptneox_print_timings:      sample time =     xxxx ms /    32 runs   (    xxxx ms per run)
+gptneox_print_timings: prompt eval time =     xxxx ms /     8 tokens (    xxxx ms per token)
+gptneox_print_timings:        eval time =     xxxx ms /    31 runs   (    xxxx ms per run)
+gptneox_print_timings:       total time =     xxxx ms
+Inference time (fast forward): xxxx s
+Output:
+{'id': 'cmpl-a20fc4a1-3a00-4e77-a6cf-0dd0da6b9a59', 'object': 'text_completion', 'created': 1686557799, 'model': './bigdl_llm_gptneox_q4_0.bin', 'choices': [{'text': ' Core Processing Unit  or Central Processing Unit  is the brain of your computer, system software runs on it and handles all important tasks in your computer. i', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 9, 'completion_tokens': 32, 'total_tokens': 41}}
+```
+
+### Model family BLOOM
+```log
+inference:    mem per token = 24471324 bytes
+inference:      sample time =     xxxx ms
+inference: evel prompt time =     xxxx ms / 5 tokens / xxxx ms per token
+inference:     predict time =     xxxx ms / 3 tokens / xxxx ms per token
+inference:       total time =     xxxx ms
+Inference time (fast forward): xxxx s
+Output:
+{'id': 'cmpl-a0ab2953-e08c-449c-b476-e21ad5bb84b0', 'object': 'text_completion', 'created': 1686557434, 'model': './bigdl_llm_bloom_q4_0.bin', 'choices': [{'text': 'Q: What is CPU? A: central processing unit</s>', 'index': 0, 'logprobs': None, 'finish_reason': None}], 'usage': {'prompt_tokens': None, 'completion_tokens': None, 'total_tokens': None}}
+```
--- a/python/llm/example/transformers/int4_pipeline.py
+++ b/python/llm/example/transformers/int4_pipeline.py
@ -0,0 +1,148 @@
+#
+# Copyright 2016 The BigDL Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import time
+import argparse
+
+
+def convert_and_load(repo_id_or_model_path, model_family, n_threads):
+
+    from bigdl.llm.ggml.transformers import AutoModelForCausalLM
+
+    # here you may input the HuggingFace repo id directly as the value of `pretrained_model_name_or_path`.
+    # This will allow the pre-trained model to be downloaded directly from the HuggingFace repository.
+    # The downloaded model will then be converted to binary format with int4 dtype weights,
+    # and saved into the cache_dir folder.
+    #
+    # if you already have the pre-trained model downloaded, you can provide the path to
+    # the downloaded folder as the value of `pretrained_model_name_or_path``
+    llm = AutoModelForCausalLM.from_pretrained(
+        pretrained_model_name_or_path=repo_id_or_model_path,
+        model_family=model_family,
+        dtype='int4',
+        cache_dir='./',
+        n_threads=n_threads)
+
+    # if you want to explicitly convert the pre-trained model, you can use the `convert_model` API 
+    # to convert the downloaded Huggungface checkpoint first,
+    # and then load the binary checkpoint directly.
+    #
+    # from bigdl.llm.ggml import convert_model
+    #
+    # model_path = repo_id_or_model_path
+    # output_ckpt_path = convert_model(
+    #     input_path=model_path,
+    #     output_path='./',
+    #     dtype='int4',
+    #     model_family=model_family)
+    #
+    # llm = AutoModelForCausalLM.from_pretrained(
+    #     pretrained_model_name_or_path=output_ckpt_path,
+    #     model_family=model_family,
+    #     n_threads=n_threads)
+
+    return llm
+
+def inference(llm, repo_id_or_model_path, model_family, prompt):
+
+    if model_family in ['llama', 'gptneox']:
+        # Option 1: Use HuggingFace transformers tokenizer
+        print('-'*20, ' HuggingFace transformers tokenizer ', '-'*20)
+        
+        print('Please note that the loading of HuggingFace transformers tokenizer may take some time.\n')
+        # here is only a workaround for default example model 'decapoda-research/llama-7b-hf' in LLaMA family,
+        # due to its out-of-date 'tokenizer_class' defined in its tokenizer_config.json.
+        #
+        # for most cases, you could use `AutoTokenizer`.
+        if model_family == 'llama':
+            from transformers import LlamaTokenizer
+            tokenizer = LlamaTokenizer.from_pretrained(repo_id_or_model_path)
+        else:
+            from transformers import AutoTokenizer
+            tokenizer = AutoTokenizer.from_pretrained(repo_id_or_model_path)
+
+        st = time.time()
+
+        # please note that the prompt here can either be a string or a list of string
+        tokens_id = tokenizer(prompt).input_ids
+        output_tokens_id = llm.generate(tokens_id, max_new_tokens=32)
+        output = tokenizer.batch_decode(output_tokens_id)
+
+        print(f'Inference time: {time.time()-st} s')
+        print(f'Output:\n{output}')
+
+        # Option 2: Use bigdl-llm based tokenizer
+        print('-'*20, ' bigdl-llm based tokenizer ', '-'*20)
+        st = time.time()
+
+        # please note that the prompt here can either be a string or a list of string
+        tokens_id = llm.tokenize(prompt)
+        output_tokens_id = llm.generate(tokens_id, max_new_tokens=32)
+        output = llm.batch_decode(output_tokens_id)
+
+        print(f'Inference time: {time.time()-st} s')
+        print(f'Output:\n{output}')
+
+    if model_family in ['llama', 'gptneox', 'bloom']:
+        # Option 3: fast forward
+        # note that currently Bloom family model only supports the fast forward inference method
+        print('-'*20, ' fast forward ', '-'*20)
+        st = time.time()
+
+        output = llm(prompt, # please note that the prompt here can ONLY be a string
+                    max_tokens=32)
+
+        print(f'Inference time (fast forward): {time.time()-st} s')
+        print(f'Output:\n{output}')
+
+
+def main():
+    parser = argparse.ArgumentParser(description='INT4 pipeline example')
+    parser.add_argument('--thread-num', type=int, default=2, required=True,
+                        help='Number of threads to use for inference')
+    parser.add_argument('--model-family', type=str, default='llama', required=True,
+                        help="The model family of the large language model (supported option: 'llama', "
+                             "'gptneox', 'bloom')")
+    parser.add_argument('--repo-id-or-model-path', type=str,
+                        help='The huggingface repo id for the larga language model to be downloaded'
+                             ', or the path to the huggingface checkpoint folder')
+    parser.add_argument('--prompt', type=str, default='Q: What is CPU? A:',
+                        help='Prompt to infer')
+    args = parser.parse_args()
+
+    repo_id_or_model_path = args.repo_id_or_model_path
+    if args.repo_id_or_model_path is None:
+        if args.model_family == 'llama':
+            repo_id_or_model_path = 'decapoda-research/llama-7b-hf'
+        elif args.model_family == 'gptneox':
+            repo_id_or_model_path = 'togethercomputer/RedPajama-INCITE-7B-Chat'
+        elif args.model_family == 'bloom':
+            repo_id_or_model_path = 'bigscience/bloomz-7b1'
+
+    # Step 1: convert and load int4 model
+    llm = convert_and_load(repo_id_or_model_path=repo_id_or_model_path,
+                           model_family=args.model_family,
+                           n_threads=args.thread_num)
+
+    # Step 2: conduct inference
+    inference(llm=llm,
+              repo_id_or_model_path=repo_id_or_model_path,
+              model_family=args.model_family,
+              prompt=args.prompt)
+
+
+if __name__ == '__main__':
+    main()