LLM: add CPU More-Data-Types and Save-Load examples (#9179)
This commit is contained in:
parent
c0497ab41b
commit
d946bd7c55
6 changed files with 268 additions and 0 deletions
|
|
@ -0,0 +1,54 @@
|
|||
# BigDL-LLM Low Bit Optimization for Large Language Model
|
||||
|
||||
In this example, we show how to apply BigDL-LLM low-bit optimizations (including INT8/INT5/INT4) to Llama2 model, and then run inference on the optimized low-bit model.
|
||||
|
||||
## 0. Requirements
|
||||
To run this example with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../../README.md#system-support) for more information.
|
||||
|
||||
## Example: Load Model in Low-Bit Optimization
|
||||
In the example [generate.py](./generate.py), we show a basic use case of low-bit optimizations (including INT8/INT5/INT4) on a Llama2 model to predict the next N tokens using `generate()` API. By specifying `--low-bit` argument, you could apply other low-bit optimization (e.g. INT8/INT5) on model.
|
||||
### 1. Install
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.9
|
||||
conda activate llm
|
||||
|
||||
pip install --pre --upgrade bigdl-llm[all] # install bigdl-llm with 'all' option
|
||||
```
|
||||
|
||||
### 2. Run
|
||||
Following command will load model in symmetric int 8 optimization:
|
||||
```
|
||||
python ./generate.py --low-bit sym_int8
|
||||
```
|
||||
In the example, several arguments can be passed to satisfy your requirements:
|
||||
|
||||
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`.
|
||||
- `--low-bit`: argument defining the low-bit optimization data type, options are sym_int4, asym_int4, sym_int5, asym_int5 or sym_int8. (sym_int4 means symmetric int 4, asym_int4 means asymmetric int 4, etc.). Relevant low bit optimizations will be applied to the model. It is default to be `sym_int8`.
|
||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
|
||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
|
||||
|
||||
### 3 Sample Output
|
||||
#### [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
|
||||
```log
|
||||
Inference time: xxxx s
|
||||
-------------------- Output --------------------
|
||||
### HUMAN:
|
||||
What is AI?
|
||||
|
||||
### RESPONSE:
|
||||
|
||||
AI is a term used to describe the development of computer systems that can perform tasks that typically require human intelligence, such as understanding natural language, recognizing images
|
||||
```
|
||||
|
||||
#### [meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)
|
||||
```log
|
||||
Inference time: xxxx s
|
||||
-------------------- Output --------------------
|
||||
### HUMAN:
|
||||
What is AI?
|
||||
|
||||
### RESPONSE:
|
||||
|
||||
AI, or artificial intelligence, refers to the ability of machines to perform tasks that would normally require human intelligence, such as learning, problem-solving,
|
||||
```
|
||||
|
|
@ -0,0 +1,71 @@
|
|||
#
|
||||
# Copyright 2016 The BigDL Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
#
|
||||
|
||||
import torch
|
||||
import time
|
||||
import argparse
|
||||
|
||||
from transformers import AutoModelForCausalLM, LlamaTokenizer
|
||||
from bigdl.llm import optimize_model
|
||||
|
||||
# you could tune the prompt based on your own model,
|
||||
# here the prompt tuning refers to https://huggingface.co/georgesung/llama2_7b_chat_uncensored#prompt-style
|
||||
LLAMA2_PROMPT_FORMAT = """### HUMAN:
|
||||
{prompt}
|
||||
|
||||
### RESPONSE:
|
||||
"""
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(description='Example of applying low-bit optimizations on model')
|
||||
parser.add_argument('--repo-id-or-model-path', type=str, default="meta-llama/Llama-2-7b-chat-hf",
|
||||
help='The huggingface repo id for the Llama2 (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded'
|
||||
', or the path to the huggingface checkpoint folder')
|
||||
parser.add_argument('--low-bit', type=str, default="sym_int8",
|
||||
choices=['sym_int4', 'asym_int4', 'sym_int5', 'asym_int5', 'sym_int8'],
|
||||
help='The quantization type the model will convert to.')
|
||||
parser.add_argument('--prompt', type=str, default="What is AI?",
|
||||
help='Prompt to infer')
|
||||
parser.add_argument('--n-predict', type=int, default=32,
|
||||
help='Max tokens to predict')
|
||||
|
||||
args = parser.parse_args()
|
||||
model_path = args.repo_id_or_model_path
|
||||
low_bit = args.low_bit
|
||||
|
||||
# Load model
|
||||
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
|
||||
|
||||
# With only one line to enable BigDL-LLM optimization on model
|
||||
# `low_bit` param support `sym_int4`, `asym_int4`, `sym_int5`, `asym_int5` and `sym_int8`
|
||||
# By specifying `low_bit` param, relevant low bit optimizations will be applied to the model
|
||||
model = optimize_model(model, low_bit=low_bit)
|
||||
|
||||
# Load tokenizer
|
||||
tokenizer = LlamaTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
||||
|
||||
# Generate predicted tokens
|
||||
with torch.inference_mode():
|
||||
prompt = LLAMA2_PROMPT_FORMAT.format(prompt=args.prompt)
|
||||
input_ids = tokenizer.encode(prompt, return_tensors="pt")
|
||||
st = time.time()
|
||||
output = model.generate(input_ids,
|
||||
max_new_tokens=args.n_predict)
|
||||
end = time.time()
|
||||
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
|
||||
print(f'Inference time: {end-st} s')
|
||||
print('-'*20, 'Output', '-'*20)
|
||||
print(output_str)
|
||||
62
python/llm/example/CPU/PyTorch-Models/Save-Load/README.md
Normal file
62
python/llm/example/CPU/PyTorch-Models/Save-Load/README.md
Normal file
|
|
@ -0,0 +1,62 @@
|
|||
# Save/Load Low-Bit Models with BigDL-LLM Optimizations
|
||||
|
||||
In this example, we show how to save/load model with BigDL-LLM low-bit optimizations (including INT8/INT5/INT4), and then run inference on the optimized low-bit model.
|
||||
|
||||
## 0. Requirements
|
||||
To run this example with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../../README.md#system-support) for more information.
|
||||
|
||||
## Example: Save/Load Model in Low-Bit Optimization
|
||||
In the example [generate.py](./generate.py), we show a basic use case of saving/loading model in low-bit optimizations to predict the next N tokens using `generate()` API. Also, saving and loading operations are platform-independent, so you could run it on different platforms.
|
||||
### 1. Install
|
||||
We suggest using conda to manage environment:
|
||||
```bash
|
||||
conda create -n llm python=3.9
|
||||
conda activate llm
|
||||
|
||||
pip install --pre --upgrade bigdl-llm[all] # install bigdl-llm with 'all' option
|
||||
```
|
||||
|
||||
### 2. Run
|
||||
If you want to save the optimized low-bit model, run:
|
||||
```
|
||||
python ./generate.py --save-path path/to/save/model
|
||||
```
|
||||
|
||||
If you want to load the optimized low-bit model, run:
|
||||
```
|
||||
python ./generate.py --load-path path/to/load/model
|
||||
```
|
||||
|
||||
In the example, several arguments can be passed to satisfy your requirements:
|
||||
|
||||
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`.
|
||||
- `--low-bit`: argument defining the low-bit optimization data type, options are sym_int4, asym_int4, sym_int5, asym_int5 or sym_int8. (sym_int4 means symmetric int 4, asym_int4 means asymmetric int 4, etc.). Relevant low bit optimizations will be applied to the model.
|
||||
- `--save-path`: argument defining the path to save the low-bit model. Then you can load the low-bit directly.
|
||||
- `--load-path`: argument defining the path to load low-bit model.
|
||||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
|
||||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
|
||||
|
||||
### 3 Sample Output
|
||||
#### [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
|
||||
```log
|
||||
Inference time: xxxx s
|
||||
-------------------- Output --------------------
|
||||
### HUMAN:
|
||||
What is AI?
|
||||
|
||||
### RESPONSE:
|
||||
|
||||
AI is a term used to describe the development of computer systems that can perform tasks that typically require human intelligence, such as understanding natural language, recognizing images
|
||||
```
|
||||
|
||||
#### [meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)
|
||||
```log
|
||||
Inference time: xxxx s
|
||||
-------------------- Output --------------------
|
||||
### HUMAN:
|
||||
What is AI?
|
||||
|
||||
### RESPONSE:
|
||||
|
||||
AI, or artificial intelligence, refers to the ability of machines to perform tasks that would typically require human intelligence, such as learning, problem-solving,
|
||||
```
|
||||
81
python/llm/example/CPU/PyTorch-Models/Save-Load/generate.py
Normal file
81
python/llm/example/CPU/PyTorch-Models/Save-Load/generate.py
Normal file
|
|
@ -0,0 +1,81 @@
|
|||
#
|
||||
# Copyright 2016 The BigDL Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
#
|
||||
|
||||
import torch
|
||||
import time
|
||||
import argparse
|
||||
from bigdl.llm import optimize_model
|
||||
from bigdl.llm.optimize import low_memory_init, load_low_bit
|
||||
from transformers import AutoModelForCausalLM, LlamaTokenizer
|
||||
|
||||
# you could tune the prompt based on your own model,
|
||||
# here the prompt tuning refers to https://huggingface.co/georgesung/llama2_7b_chat_uncensored#prompt-style
|
||||
LLAMA2_PROMPT_FORMAT = """### HUMAN:
|
||||
{prompt}
|
||||
|
||||
### RESPONSE:
|
||||
"""
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(description='Example of saving and loading the optimized model')
|
||||
parser.add_argument('--repo-id-or-model-path', type=str, default="meta-llama/Llama-2-7b-chat-hf",
|
||||
help='The huggingface repo id for the Llama2 (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded'
|
||||
', or the path to the huggingface checkpoint folder')
|
||||
parser.add_argument('--low-bit', type=str, default="sym_int4",
|
||||
choices=['sym_int4', 'asym_int4', 'sym_int5', 'asym_int5', 'sym_int8'],
|
||||
help='The quantization type the model will convert to.')
|
||||
parser.add_argument('--save-path', type=str, default=None,
|
||||
help='The path to save the low-bit model.')
|
||||
parser.add_argument('--load-path', type=str, default=None,
|
||||
help='The path to load the low-bit model.')
|
||||
parser.add_argument('--prompt', type=str, default="What is AI?",
|
||||
help='Prompt to infer')
|
||||
parser.add_argument('--n-predict', type=int, default=32,
|
||||
help='Max tokens to predict')
|
||||
args = parser.parse_args()
|
||||
model_path = args.repo_id_or_model_path
|
||||
low_bit = args.low_bit
|
||||
load_path = args.load_path
|
||||
if load_path:
|
||||
# Fast and low cost by loading model on meta device
|
||||
with low_memory_init():
|
||||
model = AutoModelForCausalLM.from_pretrained(load_path, torch_dtype="auto", trust_remote_code=True)
|
||||
model = load_low_bit(model, load_path)
|
||||
tokenizer = LlamaTokenizer.from_pretrained(load_path)
|
||||
else:
|
||||
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
|
||||
model = optimize_model(model, low_bit=low_bit)
|
||||
tokenizer = LlamaTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
||||
|
||||
# Generate predicted tokens
|
||||
with torch.inference_mode():
|
||||
prompt = LLAMA2_PROMPT_FORMAT.format(prompt=args.prompt)
|
||||
input_ids = tokenizer.encode(prompt, return_tensors="pt")
|
||||
st = time.time()
|
||||
output = model.generate(input_ids,
|
||||
max_new_tokens=args.n_predict)
|
||||
end = time.time()
|
||||
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
|
||||
print(f'Inference time: {end-st} s')
|
||||
print('-'*20, 'Output', '-'*20)
|
||||
print(output_str)
|
||||
|
||||
|
||||
save_path = args.save_path
|
||||
if save_path:
|
||||
model.save_low_bit(save_path)
|
||||
tokenizer.save_pretrained(save_path)
|
||||
print(f"Model and tokenizer are saved to {save_path}")
|
||||
Loading…
Reference in a new issue