LLM: update example layout (#9046)

This commit is contained in:
binbin Deng 2023-10-09 15:36:39 +08:00 committed by GitHub
parent 4c4f8d1663
commit 5e9962b60e
118 changed files with 204 additions and 185 deletions

View file

@ -12,8 +12,8 @@
> *It is built on top of the excellent work of [llama.cpp](https://github.com/ggerganov/llama.cpp), [ggml](https://github.com/ggerganov/ggml), [gptq](https://github.com/IST-DASLab/gptq), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [qlora](https://github.com/artidoro/qlora), [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), [gptq_for_llama](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [chatglm.cpp](https://github.com/li-plus/chatglm.cpp), [redpajama.cpp](https://github.com/togethercomputer/redpajama.cpp), [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp), [bloomz.cpp](https://github.com/NouamaneTazi/bloomz.cpp/), etc.* > *It is built on top of the excellent work of [llama.cpp](https://github.com/ggerganov/llama.cpp), [ggml](https://github.com/ggerganov/ggml), [gptq](https://github.com/IST-DASLab/gptq), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [qlora](https://github.com/artidoro/qlora), [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), [gptq_for_llama](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [chatglm.cpp](https://github.com/li-plus/chatglm.cpp), [redpajama.cpp](https://github.com/togethercomputer/redpajama.cpp), [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp), [bloomz.cpp](https://github.com/NouamaneTazi/bloomz.cpp/), etc.*
### Latest update ### Latest update
- **[New]** `bigdl-llm` now supports QLoRA fintuning on Intel GPU; see the the example [here](python/llm/example/gpu/qlora_finetuning). - **[New]** `bigdl-llm` now supports QLoRA fintuning on Intel GPU; see the the example [here](python/llm/example/GPU/QLoRA-FineTuning).
- `bigdl-llm` now supports Intel GPU (including Arc, Flex and MAX); see the the latest GPU examples [here](python/llm/example/gpu). - `bigdl-llm` now supports Intel GPU (including Arc, Flex and MAX); see the the latest GPU examples [here](python/llm/example/GPU).
- `bigdl-llm` tutorial is released [here](https://github.com/intel-analytics/bigdl-llm-tutorial). - `bigdl-llm` tutorial is released [here](https://github.com/intel-analytics/bigdl-llm-tutorial).
- Over 20 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLaMA2, ChatGLM/ChatGLM2, MPT, Falcon, Dolly, StarCoder, Whisper, InternLM, QWen, Baichuan, Aquila, MOSS,* and more; see the complete list [here](python/llm/README.md#verified-models). - Over 20 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLaMA2, ChatGLM/ChatGLM2, MPT, Falcon, Dolly, StarCoder, Whisper, InternLM, QWen, Baichuan, Aquila, MOSS,* and more; see the complete list [here](python/llm/README.md#verified-models).
@ -76,7 +76,7 @@ input_ids = tokenizer.encode(input_str, ...)
output_ids = model.generate(input_ids, ...) output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids) output = tokenizer.batch_decode(output_ids)
``` ```
*See the complete examples [here](python/llm/example/transformers/transformers_int4/).* *See the complete examples [here](python/llm/example/CPU/HF-Transformers-AutoModels/Model).*
#### GPU INT4 #### GPU INT4
##### Install ##### Install
@ -105,7 +105,7 @@ input_ids = tokenizer.encode(input_str, ...).to('xpu')
output_ids = model.generate(input_ids, ...) output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids.cpu()) output = tokenizer.batch_decode(output_ids.cpu())
``` ```
*See the complete examples [here](python/llm/example/gpu/).* *See the complete examples [here](python/llm/example/GPU).*
#### More Low-Bit Support #### More Low-Bit Support
##### Save and load ##### Save and load
@ -115,7 +115,7 @@ After the model is optimized using `bigdl-llm`, you may save and load the model
model.save_low_bit(model_path) model.save_low_bit(model_path)
new_model = AutoModelForCausalLM.load_low_bit(model_path) new_model = AutoModelForCausalLM.load_low_bit(model_path)
``` ```
*See the complete example [here](python/llm/example/transformers/transformers_low_bit/).* *See the complete example [here](python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load).*
##### Additonal data types ##### Additonal data types
@ -123,7 +123,7 @@ In addition to INT4, You may apply other low bit optimizations (such as *INT8*,
```python ```python
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int8") model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int8")
``` ```
*See the complete example [here](python/llm/example/transformers/transformers_low_bit/).* *See the complete example [here](python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types).*
***For more details, please refer to the `bigdl-llm` [Document](https://test-bigdl-llm.readthedocs.io/en/main/doc/LLM/index.html), [Readme](python/llm), [Tutorial](https://github.com/intel-analytics/bigdl-llm-tutorial) and [API Doc](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/index.html).*** ***For more details, please refer to the `bigdl-llm` [Document](https://test-bigdl-llm.readthedocs.io/en/main/doc/LLM/index.html), [Readme](python/llm), [Tutorial](https://github.com/intel-analytics/bigdl-llm-tutorial) and [API Doc](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/index.html).***

View file

@ -40,23 +40,24 @@ Over 20 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLa
| Model | Example | | Model | Example |
|-----------|----------------------------------------------------------| |-----------|----------------------------------------------------------|
| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* | [link1](example/transformers/native_int4), [link2](example/transformers/transformers_int4/vicuna) | | LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* | [link1](example/CPU/Native-Models), [link2](example/CPU/HF-Transformers-AutoModels/Model/vicuna) |
| LLaMA 2 | [link](example/transformers/transformers_int4/llama2) | | LLaMA 2 | [link](example/CPU/HF-Transformers-AutoModels/Model/llama2) |
| MPT | [link](example/transformers/transformers_int4/mpt) | | MPT | [link](example/CPU/HF-Transformers-AutoModels/Model/mpt) |
| Falcon | [link](example/transformers/transformers_int4/falcon) | | Falcon | [link](example/CPU/HF-Transformers-AutoModels/Model/falcon) |
| ChatGLM | [link](example/transformers/transformers_int4/chatglm) | | ChatGLM | [link](example/CPU/HF-Transformers-AutoModels/Model/chatglm) |
| ChatGLM2 | [link](example/transformers/transformers_int4/chatglm2) | | ChatGLM2 | [link](example/CPU/HF-Transformers-AutoModels/Model/chatglm2) |
| Qwen | [link](example/transformers/transformers_int4/qwen) | | Qwen | [link](example/CPU/HF-Transformers-AutoModels/Model/qwen) |
| MOSS | [link](example/transformers/transformers_int4/moss) | | MOSS | [link](example/CPU/HF-Transformers-AutoModels/Model/moss) |
| Baichuan | [link](example/transformers/transformers_int4/baichuan) | | Baichuan | [link](example/CPU/HF-Transformers-AutoModels/Model/baichuan) |
| Baichuan2 | [link](example/transformers/transformers_int4/baichuan2) | | Baichuan2 | [link](example/CPU/HF-Transformers-AutoModels/Model/baichuan2) |
| Dolly-v1 | [link](example/transformers/transformers_int4/dolly_v1) | | Dolly-v1 | [link](example/CPU/HF-Transformers-AutoModels/Model/dolly_v1) |
| Dolly-v2 | [link](example/transformers/transformers_int4/dolly_v2) | | Dolly-v2 | [link](example/CPU/HF-Transformers-AutoModels/Model/dolly_v2) |
| RedPajama | [link1](example/transformers/native_int4), [link2](example/transformers/transformers_int4/redpajama) | | RedPajama | [link1](example/CPU/Native-Models), [link2](example/CPU/HF-Transformers-AutoModels/Model/redpajama) |
| Phoenix | [link1](example/transformers/native_int4), [link2](example/transformers/transformers_int4/phoenix) | | Phoenix | [link1](example/CPU/Native-Models), [link2](example/CPU/HF-Transformers-AutoModels/Model/phoenix) |
| StarCoder | [link1](example/transformers/native_int4), [link2](example/transformers/transformers_int4/starcoder) | | StarCoder | [link1](example/CPU/Native-Models), [link2](example/CPU/HF-Transformers-AutoModels/Model/starcoder) |
| InternLM | [link](example/transformers/transformers_int4/internlm) | | InternLM | [link](example/CPU/HF-Transformers-AutoModels/Model/internlm) |
| Whisper | [link](example/transformers/transformers_int4/whisper) | | Whisper | [link](example/CPU/HF-Transformers-AutoModels/Model/whisper) |
| Aquila | [link](example/CPU/HF-Transformers-AutoModels/Model/aquila) |
</details> </details>
@ -119,7 +120,7 @@ output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids) output = tokenizer.batch_decode(output_ids)
``` ```
See the complete examples [here](example/transformers/transformers_int4/). See the complete examples [here](example/CPU/HF-Transformers-AutoModels/Model/).
###### GPU INT4 ###### GPU INT4
You may apply INT4 optimizations to any Hugging Face *Transformers* model on Intel GPU as follows. You may apply INT4 optimizations to any Hugging Face *Transformers* model on Intel GPU as follows.
@ -138,7 +139,7 @@ input_ids = tokenizer.encode(input_str, ...).to('xpu')
output_ids = model.generate(input_ids, ...) output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids.cpu()) output = tokenizer.batch_decode(output_ids.cpu())
``` ```
See the complete examples [here](example/gpu/). See the complete examples [here](example/GPU).
###### More Low-Bit Support ###### More Low-Bit Support
- Save and load - Save and load
@ -148,7 +149,7 @@ See the complete examples [here](example/gpu/).
model.save_low_bit(model_path) model.save_low_bit(model_path)
new_model = AutoModelForCausalLM.load_low_bit(model_path) new_model = AutoModelForCausalLM.load_low_bit(model_path)
``` ```
*See the complete example [here](example/transformers/transformers_low_bit/).* *See the complete example [here](example/CPU/HF-Transformers-AutoModels/Save-Load).*
- Additonal data types - Additonal data types
@ -157,7 +158,7 @@ See the complete examples [here](example/gpu/).
```python ```python
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int8") model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int8")
``` ```
*See the complete example [here](example/transformers/transformers_low_bit/).* *See the complete example [here](example/CPU/HF-Transformers-AutoModels/More-Data-Types).*
##### 2. Native INT4 model ##### 2. Native INT4 model
@ -182,7 +183,7 @@ output_ids = llm.generate(input_ids, ...)
output = llm.batch_decode(output_ids) output = llm.batch_decode(output_ids)
``` ```
See the complete example [here](example/transformers/native_int4/native_int4_pipeline.py). See the complete example [here](example/CPU/Native-Models/native_int4_pipeline.py).
##### 3. LangChain API ##### 3. LangChain API
You may run the models using the LangChain API in `bigdl-llm`. You may run the models using the LangChain API in `bigdl-llm`.
@ -202,7 +203,7 @@ You may run the models using the LangChain API in `bigdl-llm`.
doc_chain = load_qa_chain(bigdl_llm, ...) doc_chain = load_qa_chain(bigdl_llm, ...)
output = doc_chain.run(...) output = doc_chain.run(...)
``` ```
See the examples [here](example/langchain/transformers_int4). See the examples [here](example/CPU/LangChain/transformers_int4).
- **Using native INT4 model** - **Using native INT4 model**
@ -224,7 +225,7 @@ You may run the models using the LangChain API in `bigdl-llm`.
doc_chain.run(...) doc_chain.run(...)
``` ```
See the examples [here](example/langchain/native_int4). See the examples [here](example/CPU/LangChain/native_int4).
##### 4. CLI Tool ##### 4. CLI Tool
>**Note**: Currently `bigdl-llm` CLI supports *LLaMA* (e.g., *vicuna*), *GPT-NeoX* (e.g., *redpajama*), *BLOOM* (e.g., *pheonix*) and *GPT2* (e.g., *starcoder*) model architecture; for other models, you may use the Hugging Face `transformers` or LangChain APIs. >**Note**: Currently `bigdl-llm` CLI supports *LLaMA* (e.g., *vicuna*), *GPT-NeoX* (e.g., *redpajama*), *BLOOM* (e.g., *pheonix*) and *GPT2* (e.g., *starcoder*) model architecture; for other models, you may use the Hugging Face `transformers` or LangChain APIs.

View file

@ -21,6 +21,7 @@ You can use BigDL-LLM to run any Huggingface Transformer models with INT4 optimi
| InternLM | [link](internlm) | | InternLM | [link](internlm) |
| Whisper | [link](whisper) | | Whisper | [link](whisper) |
| Qwen | [link](qwen) | | Qwen | [link](qwen) |
| Aquila | [link](aquila) |
## Recommended Requirements ## Recommended Requirements
To run the examples, we recommend using Intel® Xeon® processors (server), or >= 12th Gen Intel® Core™ processor (client). To run the examples, we recommend using Intel® Xeon® processors (server), or >= 12th Gen Intel® Core™ processor (client).

View file

@ -0,0 +1,7 @@
# Running Hugging Face Transformers model using BigDL-LLM on Intel CPU
This folder contains examples of running any Hugging Face Transformers model on BigDL-LLM (using the standard AutoModel APIs):
- [Model](Model): examples of running Hugging Face Transformers models (e.g., LLaMA2, ChatGLM2, Falcon, MPT, Baichuan2, etc.) using INT4 optimizations
- [More-Data-Types](More-Data-Types): examples of applying other low bit optimizations (NF4/INT5/INT8, etc.)
- [Save-Load](Save-Load): examples of saving and loading low-bit models

View file

@ -0,0 +1,43 @@
# BigDL-LLM Transformers Low-Bit Inference Pipeline for Large Language Model
In this example, we show a pipeline to apply BigDL-LLM low-bit optimizations (including INT8/INT5/INT4) to any Hugging Face Transformers model, and then run inference on the optimized low-bit model.
## Prepare Environment
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9
conda activate llm
pip install --pre --upgrade bigdl-llm[all]
```
## Run Example
```bash
python ./transformers_low_bit_pipeline.py --repo-id-or-model-path decapoda-research/llama-7b-hf --low-bit sym_int5 --save-path ./llama-7b-sym_int5
```
arguments info:
- `--repo-id-or-model-path`: str value, argument defining the huggingface repo id for the large language model to be downloaded, or the path to the huggingface checkpoint folder, the value is 'decapoda-research/llama-7b-hf' by default.
- `--low-bit`: str value, options are sym_int4, asym_int4, sym_int5, asym_int5 or sym_int8. (sym_int4 means symmetric int 4, asym_int4 means asymmetric int 4, etc.). Relevant low bit optimizations will be applied to the model.
- `--save-path`: str value, the path to save the low-bit model. Then you can load the low-bit directly.
- `--load-path`: optional str value. The path to load low-bit model.
## Sample Output for Inference
### 'decapoda-research/llama-7b-hf' Model
```log
Prompt: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun
Output: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. She wanted to be a princess, and she wanted to be a pirate. She wanted to be a superhero, and she wanted to be
Model and tokenizer are saved to ./llama-7b-sym_int5
```
### Load low-bit model
Command to run:
```bash
python ./transformers_low_bit_pipeline.py --load-path ./llama-7b-sym_int5
```
Output log:
```log
Prompt: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun
Output: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. She wanted to be a princess, and she wanted to be a pirate. She wanted to be a superhero, and she wanted to be
```

View file

@ -0,0 +1,56 @@
#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import argparse
from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import LlamaTokenizer, TextGenerationPipeline
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Transformer save_load example')
parser.add_argument('--repo-id-or-model-path', type=str, default="decapoda-research/llama-7b-hf",
help='The huggingface repo id for the large language model to be downloaded'
', or the path to the huggingface checkpoint folder')
parser.add_argument('--low-bit', type=str, default="sym_int4",
choices=['sym_int4', 'asym_int4', 'sym_int5', 'asym_int5', 'sym_int8'],
help='The quantization type the model will convert to.')
parser.add_argument('--save-path', type=str, default=None,
help='The path to save the low-bit model.')
parser.add_argument('--load-path', type=str, default=None,
help='The path to load the low-bit model.')
args = parser.parse_args()
model_path = args.repo_id_or_model_path
low_bit = args.low_bit
load_path = args.load_path
if load_path:
model = AutoModelForCausalLM.load_low_bit(load_path)
tokenizer = LlamaTokenizer.from_pretrained(load_path)
else:
# load_in_low_bit in bigdl.llm.transformers will convert
# the relevant layers in the model into corresponding int X format
model = AutoModelForCausalLM.from_pretrained(model_path, load_in_low_bit=low_bit, trust_remote_code=True)
tokenizer = LlamaTokenizer.from_pretrained(model_path, trust_remote_code=True)
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, max_new_tokens=32)
input_str = "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun"
output = pipeline(input_str)[0]["generated_text"]
print(f"Prompt: {input_str}")
print(f"Output: {output}")
save_path = args.save_path
if save_path:
model.save_low_bit(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer are saved to {save_path}")

View file

@ -0,0 +1,7 @@
# Running PyTorch model using BigDL-LLM on Intel CPU
This folder contains examples of running any PyTorch model on BigDL-LLM (with "one-line code change"):
- [Model](Model): examples of running PyTorch models (e.g., Openai Whisper, LLaMA2, ChatGLM2, Falcon, MPT, Baichuan2, etc.) using INT4 optimizations
- [More-Data-Types](More-Data-Types): examples of applying other low bit optimizations (NF4/INT5/INT8, etc.)
- [Save-Load](Save-Load): examples of saving and loading low-bit models

View file

@ -0,0 +1,18 @@
# BigDL-LLM Examples on Intel CPU
This folder contains examples of running BigDL-LLM on Intel CPU:
- [HF-Transformers-AutoModels](HF-Transformers-AutoModels): running any Hugging Face Transformers model on BigDL-LLM (using the standard AutoModel APIs)
- [PyTorch-Models](PyTorch-Models): running any PyTorch model on BigDL-LLM (with "one-line code change")
- [Native-Models](Native-Models): converting & running LLM in `llama`/`chatglm`/`bloom`/`gptneox`/`starcoder` model family using native (cpp) implementation
- [LangChain](LangChain): running LangChain applications on BigDL-LLM
## System Support
**Hardware**:
- Intel® Core™ processors
- Intel® Xeon® processors
**Operating System**:
- Ubuntu 20.04 or later
- CentOS 7 or later
- Windows 10/11, with or without WSL

View file

@ -21,6 +21,7 @@ You can use BigDL-LLM to run almost every Huggingface Transformer models with IN
- Intel Arc™ A-Series Graphics - Intel Arc™ A-Series Graphics
- Intel Data Center GPU Flex Series - Intel Data Center GPU Flex Series
- Intel Data Center GPU Max Series
## Recommended Requirements ## Recommended Requirements
To apply Intel GPU acceleration, therere several steps for tools installation and environment preparation. To apply Intel GPU acceleration, therere several steps for tools installation and environment preparation.

Some files were not shown because too many files have changed in this diff Show more