LLM: add llama2-7b native int4 example (#8629)

2023-07-28 10:56:16 +08:00 · 2023-07-28 10:56:16 +08:00 · 3dbab9087b
commit 3dbab9087b
parent fb32fefcbe
2 changed files with 31 additions and 28 deletions
--- a/python/llm/example/transformers/native_int4/README.md
+++ b/python/llm/example/transformers/native_int4/README.md
@ -2,7 +2,7 @@

 In this example, we show a pipeline to convert a large language model to BigDL-LLM native INT4 format, and then run inference on the converted INT4 model.

-> **Note**: BigDL-LLM native INT4 format currently supports model family **LLaMA** (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.), **LLaMA 2** (such as Llama-2-13B), **GPT-NeoX** (such as RedPajama), **BLOOM** (such as Phoenix) and **StarCoder**.
+> **Note**: BigDL-LLM native INT4 format currently supports model family **LLaMA** (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.), **LLaMA 2** (such as Llama-2-7B-chat, Llama-2-13B-chat), **GPT-NeoX** (such as RedPajama), **BLOOM** (such as Phoenix) and **StarCoder**.

 ## Prepare Environment
 We suggest using conda to manage environment:
@ -23,48 +23,49 @@ arguments info:
 - `--repo-id-or-model-path MODEL_PATH`: **required** argument defining the path to the huggingface checkpoint folder for the model.

  > **Note** `MODEL_PATH` should fits your inputed `MODEL_FAMILY`.
- `--promp PROMPT`: optional argument defining the prompt to be infered. It is default to be `'Q: What is CPU? A:'`.
+- `--promp PROMPT`: optional argument defining the prompt to be infered. It is default to be `'Once upon a time, there existed a little girl who liked to have adventures. '`.
 - `--tmp-path TMP_PATH`: optional argument defining the path to store intermediate model during the conversion process. It is default to be `'/tmp'`.

 ## Sample Output for Inference
 ### Model family LLaMA
+#### [lmsys/vicuna-13b-v1.3](https://huggingface.co/lmsys/vicuna-13b-v1.3)
 ```log
 --------------------  bigdl-llm based tokenizer  --------------------
 Inference time: xxxx s
 Output:
-[' It stands for Central Processing Unit. It’s the part of your computer that does the actual computing, or calculating. The first computers were all about adding machines']
+['\n She was always exploring new places and meeting new people.  One day, she stumbled upon a mysterious door in the woods that led her to']
 --------------------  HuggingFace transformers tokenizer  --------------------
 Please note that the loading of HuggingFace transformers tokenizer may take some time.

 Inference time: xxxx s
 Output:
-['Central Processing Unit (CPU) is the main component of a computer system, also known as microprocessor. It executes the instructions of software programmes (also']
+['\nShe had read so many stories about brave heroes and their magical journeys that she decided to set out on her own adventure. \n']
 --------------------  fast forward  --------------------

 bigdl-llm timings:        load time =    xxxx ms
 bigdl-llm timings:      sample time =    xxxx ms /    32 runs   (    xxxx ms per token)
-bigdl-llm timings: prompt eval time =    xxxx ms /     9 tokens (    xxxx ms per token)
-bigdl-llm timings:        eval time =    xxxx ms /    31 runs   (    xxxx ms per token)
+bigdl-llm timings: prompt eval time =    xxxx ms /     1 tokens (    xxxx ms per token)
+bigdl-llm timings:        eval time =    xxxx ms /    32 runs   (    xxxx ms per token)
 bigdl-llm timings:       total time =    xxxx ms
 Inference time (fast forward): xxxx s
 Output:
-{'id': 'cmpl-c87e5562-281a-4837-8665-7b122948e0e8', 'object': 'text_completion', 'created': 1688368515, 'model': './bigdl_llm_llama_q4_0.bin', 'choices': [{'text': ' CPU stands for Central Processing Unit. This means that the processors in your computer are what make it run, so if you have a Pentium 4', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 9, 'completion_tokens': 32, 'total_tokens': 41}}
+{'id': 'cmpl-e5811030-cc60-462b-9857-13d43e3a1896', 'object': 'text_completion', 'created': 1690450682, 'model': './bigdl_llm_llama_q4_0.bin', 'choices': [{'text': '\nShe was a curious and brave child, always eager to explore the world around her. She loved nothing more than setting off into the woods or down to the', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 19, 'completion_tokens': 32, 'total_tokens': 51}}
 ```

 ### Model family LLaMA 2
+#### [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
 ```log
 --------------------  bigdl-llm based tokenizer  --------------------
 Inference time: xxxx s
 Output:
-[' The CPU (Central Processing Unit) is the brain of your computer. It is responsible for executing most instructions that your computer receives from the operating system and']
+[' She lived in a small village surrounded by vast fields of golden wheat and blue skies.  One day, she decided to go on an adventure to']
 --------------------  HuggingFace transformers tokenizer  --------------------
 Please note that the loading of HuggingFace transformers tokenizer may take some time.

-You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
 Llama.generate: prefix-match hit
 Inference time: xxxx s
 Output:
-['Central Processing Unit (CPU) is the brain of any computer system. It performs all the calculations and executes all the instructions that are given to it by']
+['She was so curious and eager to explore the world around her that she would often find herself in unexpected situations. \nOne day, while wandering through the']
 --------------------  fast forward  --------------------
 Llama.generate: prefix-match hit

@ -75,80 +76,82 @@ bigdl-llm timings:        eval time =     xxxx ms /    32 runs   (    xxxx ms pe
 bigdl-llm timings:       total time =     xxxx ms
 Inference time (fast forward): xxxx s
 Output:
-{'id': 'cmpl-680b5482-2ce8-4a04-a799-41845aa76939', 'object': 'text_completion', 'created': 1690275575, 'model': './bigdl_llm_llama_q4_0.bin', 'choices': [{'text': ' CPU stands for Central Processing Unit. It is the brain of any computer, responsible for executing most instructions that make up a computer program. The CPU retrieves', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 9, 'completion_tokens': 32, 'total_tokens': 41}}
+{'id': 'cmpl-556b831b-749f-4b06-801e-c920620cb8f5', 'object': 'text_completion', 'created': 1690449478, 'model': './bigdl_llm_llama_q4_0.bin', 'choices': [{'text': ' She lived in a small village at the edge of a big forest, surrounded by tall trees and sparkling streams.  One day, while wandering around the', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 19, 'completion_tokens': 32, 'total_tokens': 51}}
 ```

 ### Model family GPT-NeoX
+#### [togethercomputer/RedPajama-INCITE-7B-Chat](https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Chat)
 ```log
 --------------------  bigdl-llm based tokenizer  --------------------
 Inference time: xxxx s
 Output:
-[' Central processing unit, also known as processor, is a specialized microchip designed to execute all the instructions of computer programs rapidly and efficiently. Most personal computers have one or']
+['\nThis was no surprise since her mom and dad both loved adventure too. But what really stood out about this little girl is that she loved the stories! Her']
 --------------------  HuggingFace transformers tokenizer  --------------------
 Please note that the loading of HuggingFace transformers tokenizer may take some time.

 Inference time: xxxx s
 Output:
-[' The Central Processing Unit, or CPU, is the component of a computer that executes all instructions for carrying out different functions. It is the brains of the operation, and']
+['\nFirst she got lost in the woods and it took some really tough searching by her parents to find her. But they did! Then one day when she was']
 --------------------  fast forward  --------------------
-Gptneox.generate: prefix-match hit

 gptneox_print_timings:        load time =     xxxx ms
 gptneox_print_timings:      sample time =     xxxx ms /    32 runs   (    xxxx ms per run)
-gptneox_print_timings: prompt eval time =     xxxx ms /     8 tokens (    xxxx ms per token)
+gptneox_print_timings: prompt eval time =     xxxx ms /    18 tokens (    xxxx ms per token)
 gptneox_print_timings:        eval time =     xxxx ms /    31 runs   (    xxxx ms per run)
 gptneox_print_timings:       total time =     xxxx ms
 Inference time (fast forward): xxxx s
 Output:
-{'id': 'cmpl-a20fc4a1-3a00-4e77-a6cf-0dd0da6b9a59', 'object': 'text_completion', 'created': 1686557799, 'model': './bigdl_llm_gptneox_q4_0.bin', 'choices': [{'text': ' Core Processing Unit  or Central Processing Unit  is the brain of your computer, system software runs on it and handles all important tasks in your computer. i', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 9, 'completion_tokens': 32, 'total_tokens': 41}}
+{'id': 'cmpl-8b17585d-635a-43af-94a0-bd9c19ffc5a8', 'object': 'text_completion', 'created': 1690451587, 'model': './bigdl_llm_gptneox_q4_0.bin', 'choices': [{'text': '\nOn one fine day her mother brought home an old shoe box full of toys and gave it to her daughter as she was not able to make the toy house', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 18, 'completion_tokens': 32, 'total_tokens': 50}}
 ```

 ### Model family BLOOM
+#### [FreedomIntelligence/phoenix-inst-chat-7b](https://huggingface.co/FreedomIntelligence/phoenix-inst-chat-7b)
 ```log
 --------------------  bigdl-llm based tokenizer  --------------------
 Inference time: xxxx s
 Output:
-[' Central Processing Unit</s>The present invention relates to a method of manufacturing an LED device, and more particularly to the manufacture of high-powered LED devices. The inventive']
+[' She was always eager to explore new places and meet new people. One day, she decided to embark on an epic journey across the land of the giants']
 --------------------  HuggingFace transformers tokenizer  --------------------
 Please note that the loading of HuggingFace transformers tokenizer may take some time.

 Inference time: xxxx s
 Output:
-[' Central Processing Unit</s>The present invention relates to a method of manufacturing an LED device, and more particularly to the manufacture of high-powered LED devices. The inventive']
+[' She loved exploring the world and trying new things. One day, she decided to embark on an epic journey across the land of the giants. The little']
 --------------------  fast forward  --------------------


-inference:    mem per token = 24471324 bytes
+inference:    mem per token =     xxxx bytes
 inference:      sample time =     xxxx ms
-inference: evel prompt time =     xxxx ms / 1 tokens / xxxx ms per token
-inference:     predict time =     xxxx ms / 4 tokens / xxxx ms per token
+inference: evel prompt time =     xxxx ms / 12 tokens / xxxx ms per token
+inference:     predict time =     xxxx ms / 31 tokens / xxxx ms per token
 inference:       total time =     xxxx ms
 Inference time (fast forward): xxxx s
 Output:
-{'id': 'cmpl-4ec29030-f0c4-43d6-80b0-5f5fb76c169d', 'object': 'text_completion', 'created': 1687852341, 'model': './bigdl_llm_bloom_q4_0.bin', 'choices': [{'text': ' the Central Processing Unit</s>', 'index': 0, 'logprobs': None, 'finish_reason': None}], 'usage': {'prompt_tokens': 6, 'completion_tokens': 5, 'total_tokens': 11}}
+{'id': 'cmpl-e7039a29-dc80-4729-a446-301573a5315f', 'object': 'text_completion', 'created': 1690449783, 'model': './bigdl_llm_bloom_q4_0.bin', 'choices': [{'text': ' She had the spirit of exploration, and her adventurous nature drove her to seek out new things every day. Little did she know that her adventures would take an', 'index': 0, 'logprobs': None, 'finish_reason': None}], 'usage': {'prompt_tokens': 17, 'completion_tokens': 32, 'total_tokens': 49}}
 ```

 ### Model family StarCoder
+#### [bigcode/starcoder](https://huggingface.co/bigcode/starcoder)
 ```log
 --------------------  bigdl-llm based tokenizer  --------------------
 Inference time: xxxx s
 Output:
-[' 2.56 GHz, 2.56 GHz, 2.56 GHz, 2.56 GHz, ']
+['\nOne day, she went on an adventure with a dragon. \nThe dragon was very angry, and he wanted to eat her.']
 --------------------  HuggingFace transformers tokenizer  --------------------
 Please note that the loading of HuggingFace transformers tokenizer may take some time.

 Inference time: xxxx s
 Output:
-[' 2.56 GHz, 2.56 GHz, 2.56 GHz, 2.56 GHz, ']
+[' She was called "Alice".  She was very clever, and she loved to play with puzzles.  One day, she was playing with']
 --------------------  fast forward  --------------------


-bigdl-llm:    mem per token =   313720 bytes
+bigdl-llm:    mem per token =     xxxx bytes
 bigdl-llm:      sample time =     xxxx ms
-bigdl-llm: evel prompt time =     xxxx ms
+bigdl-llm: evel prompt time =     xxxx ms / 11 tokens / xxxx ms per token
 bigdl-llm:     predict time =     xxxx ms / 31 tokens / xxxx ms per token
 bigdl-llm:       total time =     xxxx ms
 Inference time (fast forward): xxxx s
 Output:
-{'id': 'cmpl-72bc4d13-d8c9-4bcb-b3f4-50a69863d534', 'object': 'text_completion', 'created': 1687852580, 'model': './bigdl_llm_starcoder_q4_0.bin', 'choices': [{'text': ' 0.50, B: 0.25, C: 0.125, D: 0.0625', 'index': 0, 'logprobs': None, 'finish_reason': None}], 'usage': {'prompt_tokens': 8, 'completion_tokens': 32, 'total_tokens': 40}}
+{'id': 'cmpl-d0266eb2-5e18-4fbc-bcc4-dec236f506f6', 'object': 'text_completion', 'created': 1690450075, 'model': './bigdl_llm_starcoder_q4_0.bin', 'choices': [{'text': ' She loved to play with dolls and other stuff, but she loved the most to play with cats and other dogs.  She loved to', 'index': 0, 'logprobs': None, 'finish_reason': None}], 'usage': {'prompt_tokens': 21, 'completion_tokens': 32, 'total_tokens': 53}}
 ```
--- a/python/llm/example/transformers/native_int4/native_int4_pipeline.py
+++ b/python/llm/example/transformers/native_int4/native_int4_pipeline.py
@ -100,7 +100,7 @@ def main():
                             "'gptneox', 'bloom', 'starcoder')")
    parser.add_argument('--repo-id-or-model-path', type=str, required=True,
                        help='The path to the huggingface checkpoint folder')
-    parser.add_argument('--prompt', type=str, default='Q: What is CPU? A:',
+    parser.add_argument('--prompt', type=str, default='Once upon a time, there existed a little girl who liked to have adventures. ',
                        help='Prompt to infer')
    parser.add_argument('--tmp-path', type=str, default='/tmp',
                        help='path to store intermediate model during the conversion process')