LLM: add readme for transformer examples (#8444)

This commit is contained in:
binbin Deng 2023-07-04 17:25:58 +08:00 committed by GitHub
parent e3e95e92ca
commit 1970bcf14e
3 changed files with 170 additions and 3 deletions

View file

@ -0,0 +1,127 @@
# BigDL-LLM Native INT4 Inference Pipeline for Large Language Model
In this example, we show a pipeline to convert a large language model to BigDL-LLM native INT4 format, and then run inference on the converted INT4 model.
> **Note**: BigDL-LLM native INT4 format currently supports model family LLaMA, GPT-NeoX, BLOOM and StarCoder.
## Prepare Environment
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9
conda activate llm
pip install --pre --upgrade bigdl-llm[all]
```
## Run Example
```bash
python ./native_int4_pipeline.py --thread-num THREAD_NUM --model-family MODEL_FAMILY --repo-id-or-model-path MODEL_PATH
```
arguments info:
- `--thread-num THREAD_NUM`: **required** argument defining the number of threads to use for inference. It is default to be `2`.
- `--model-family MODEL_FAMILY`: **required** argument defining the model family of the large language model (supported option: `'llama'`, `'gptneox'`, `'bloom'`, `'starcoder'`). It is default to be `'llama'`.
- `--repo-id-or-model-path MODEL_PATH`: **required** argument defining the path to the huggingface checkpoint folder for the model.
> **Note** `MODEL_PATH` should fits your inputed `MODEL_FAMILY`.
- `--promp PROMPT`: optional argument defining the prompt to be infered. It is default to be `'Q: What is CPU? A:'`.
- `--tmp-path TMP_PATH`: optional argument defining the path to store intermediate model during the conversion process. It is default to be `'/tmp'`.
## Sample Output for Inference
### Model family LLaMA
```log
-------------------- bigdl-llm based tokenizer --------------------
Inference time: xxxx s
Output:
[' It stands for Central Processing Unit. Its the part of your computer that does the actual computing, or calculating. The first computers were all about adding machines']
-------------------- HuggingFace transformers tokenizer --------------------
Please note that the loading of HuggingFace transformers tokenizer may take some time.
Inference time: xxxx s
Output:
['Central Processing Unit (CPU) is the main component of a computer system, also known as microprocessor. It executes the instructions of software programmes (also']
-------------------- fast forward --------------------
bigdl-llm timings: load time = xxxx ms
bigdl-llm timings: sample time = xxxx ms / 32 runs ( xxxx ms per token)
bigdl-llm timings: prompt eval time = xxxx ms / 9 tokens ( xxxx ms per token)
bigdl-llm timings: eval time = xxxx ms / 31 runs ( xxxx ms per token)
bigdl-llm timings: total time = xxxx ms
Inference time (fast forward): xxxx s
Output:
{'id': 'cmpl-c87e5562-281a-4837-8665-7b122948e0e8', 'object': 'text_completion', 'created': 1688368515, 'model': './bigdl_llm_llama_q4_0.bin', 'choices': [{'text': ' CPU stands for Central Processing Unit. This means that the processors in your computer are what make it run, so if you have a Pentium 4', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 9, 'completion_tokens': 32, 'total_tokens': 41}}
```
### Model family GPT-NeoX
```log
-------------------- bigdl-llm based tokenizer --------------------
Inference time: xxxx s
Output:
[' Central processing unit, also known as processor, is a specialized microchip designed to execute all the instructions of computer programs rapidly and efficiently. Most personal computers have one or']
-------------------- HuggingFace transformers tokenizer --------------------
Please note that the loading of HuggingFace transformers tokenizer may take some time.
Inference time: xxxx s
Output:
[' The Central Processing Unit, or CPU, is the component of a computer that executes all instructions for carrying out different functions. It is the brains of the operation, and']
-------------------- fast forward --------------------
Gptneox.generate: prefix-match hit
gptneox_print_timings: load time = xxxx ms
gptneox_print_timings: sample time = xxxx ms / 32 runs ( xxxx ms per run)
gptneox_print_timings: prompt eval time = xxxx ms / 8 tokens ( xxxx ms per token)
gptneox_print_timings: eval time = xxxx ms / 31 runs ( xxxx ms per run)
gptneox_print_timings: total time = xxxx ms
Inference time (fast forward): xxxx s
Output:
{'id': 'cmpl-a20fc4a1-3a00-4e77-a6cf-0dd0da6b9a59', 'object': 'text_completion', 'created': 1686557799, 'model': './bigdl_llm_gptneox_q4_0.bin', 'choices': [{'text': ' Core Processing Unit or Central Processing Unit is the brain of your computer, system software runs on it and handles all important tasks in your computer. i', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 9, 'completion_tokens': 32, 'total_tokens': 41}}
```
### Model family BLOOM
```log
-------------------- bigdl-llm based tokenizer --------------------
Inference time: xxxx s
Output:
[' Central Processing Unit</s>The present invention relates to a method of manufacturing an LED device, and more particularly to the manufacture of high-powered LED devices. The inventive']
-------------------- HuggingFace transformers tokenizer --------------------
Please note that the loading of HuggingFace transformers tokenizer may take some time.
Inference time: xxxx s
Output:
[' Central Processing Unit</s>The present invention relates to a method of manufacturing an LED device, and more particularly to the manufacture of high-powered LED devices. The inventive']
-------------------- fast forward --------------------
inference: mem per token = 24471324 bytes
inference: sample time = xxxx ms
inference: evel prompt time = xxxx ms / 1 tokens / xxxx ms per token
inference: predict time = xxxx ms / 4 tokens / xxxx ms per token
inference: total time = xxxx ms
Inference time (fast forward): xxxx s
Output:
{'id': 'cmpl-4ec29030-f0c4-43d6-80b0-5f5fb76c169d', 'object': 'text_completion', 'created': 1687852341, 'model': './bigdl_llm_bloom_q4_0.bin', 'choices': [{'text': ' the Central Processing Unit</s>', 'index': 0, 'logprobs': None, 'finish_reason': None}], 'usage': {'prompt_tokens': 6, 'completion_tokens': 5, 'total_tokens': 11}}
```
### Model family StarCoder
```log
-------------------- bigdl-llm based tokenizer --------------------
Inference time: xxxx s
Output:
[' 2.56 GHz, 2.56 GHz, 2.56 GHz, 2.56 GHz, ']
-------------------- HuggingFace transformers tokenizer --------------------
Please note that the loading of HuggingFace transformers tokenizer may take some time.
Inference time: xxxx s
Output:
[' 2.56 GHz, 2.56 GHz, 2.56 GHz, 2.56 GHz, ']
-------------------- fast forward --------------------
bigdl-llm: mem per token = 313720 bytes
bigdl-llm: sample time = xxxx ms
bigdl-llm: evel prompt time = xxxx ms
bigdl-llm: predict time = xxxx ms / 31 tokens / xxxx ms per token
bigdl-llm: total time = xxxx ms
Inference time (fast forward): xxxx s
Output:
{'id': 'cmpl-72bc4d13-d8c9-4bcb-b3f4-50a69863d534', 'object': 'text_completion', 'created': 1687852580, 'model': './bigdl_llm_starcoder_q4_0.bin', 'choices': [{'text': ' 0.50, B: 0.25, C: 0.125, D: 0.0625', 'index': 0, 'logprobs': None, 'finish_reason': None}], 'usage': {'prompt_tokens': 8, 'completion_tokens': 32, 'total_tokens': 40}}
```

View file

@ -25,7 +25,7 @@ if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Transformer INT4 example') parser = argparse.ArgumentParser(description='Transformer INT4 example')
parser.add_argument('--repo-id-or-model-path', type=str, default="decapoda-research/llama-7b-hf", parser.add_argument('--repo-id-or-model-path', type=str, default="decapoda-research/llama-7b-hf",
choices=['decapoda-research/llama-7b-hf', 'THUDM/chatglm-6b'], choices=['decapoda-research/llama-7b-hf', 'THUDM/chatglm-6b'],
help='The huggingface repo id for the larga language model to be downloaded' help='The huggingface repo id for the large language model to be downloaded'
', or the path to the huggingface checkpoint folder') ', or the path to the huggingface checkpoint folder')
args = parser.parse_args() args = parser.parse_args()
model_path = args.repo_id_or_model_path model_path = args.repo_id_or_model_path
@ -43,7 +43,8 @@ if __name__ == '__main__':
output = model.generate(input_ids, do_sample=False, max_new_tokens=32) output = model.generate(input_ids, do_sample=False, max_new_tokens=32)
output_str = tokenizer.decode(output[0], skip_special_tokens=True) output_str = tokenizer.decode(output[0], skip_special_tokens=True)
end = time.time() end = time.time()
print(output_str) print('Prompt:', input_str)
print('Output:', output_str)
print(f'Inference time: {end-st} s') print(f'Inference time: {end-st} s')
elif model_path == 'THUDM/chatglm-6b': elif model_path == 'THUDM/chatglm-6b':
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True) model = AutoModel.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True)
@ -57,5 +58,6 @@ if __name__ == '__main__':
output = model.generate(input_ids, do_sample=False, max_new_tokens=32) output = model.generate(input_ids, do_sample=False, max_new_tokens=32)
output_str = tokenizer.decode(output[0], skip_special_tokens=True) output_str = tokenizer.decode(output[0], skip_special_tokens=True)
end = time.time() end = time.time()
print(output_str) print('Prompt:', input_str)
print('Output:', output_str)
print(f'Inference time: {end-st} s') print(f'Inference time: {end-st} s')

View file

@ -0,0 +1,38 @@
# BigDL-LLM Transformers INT4 Inference Pipeline for Large Language Model
In this example, we show a pipeline to apply BigDL-LLM INT4 optimizations to any Hugging Face Transformers model, and then run inference on the optimized INT4 model.
## Prepare Environment
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9
conda activate llm
pip install --pre --upgrade bigdl-llm[all]
```
## Run Example
```bash
python ./transformers_int4_pipeline.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH
```
arguments info:
- `--repo-id-or-model-path MODEL_PATH`: argument defining the huggingface repo id for the large language model to be downloaded, or the path to the huggingface checkpoint folder.
> **Note** In this example, `--repo-id-or-model-path MODEL_PATH` is limited be one of `['decapoda-research/llama-7b-hf', 'THUDM/chatglm-6b']` to better demonstrate English and Chinese support. And it is default to be `'decapoda-research/llama-7b-hf'`.
## Sample Output for Inference
### 'decapoda-research/llama-7b-hf' Model
```log
Prompt: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun
Output: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. She wanted to be a hero. She wanted to be a hero, but she didn't know how. She didn't know how to be a
Inference time: xxxx s
```
### 'THUDM/chatglm-6b' Model
```log
Prompt: 晚上睡不着应该怎么办
Output: 晚上睡不着应该怎么办 晚上睡不着可能会让人感到焦虑和不安,但以下是一些可能有用的建议:
1. 放松身体和思维:尝试进行深呼吸、渐进性
Inference time: xxxx s
```