Update llm readme (#9005)

This commit is contained in:
Jason Dai 2023-09-19 20:01:33 +08:00 committed by GitHub
parent 249386261c
commit 51518e029d
4 changed files with 103 additions and 116 deletions

View file

@ -12,9 +12,9 @@
> *It is built on top of the excellent work of [llama.cpp](https://github.com/ggerganov/llama.cpp), [gptq](https://github.com/IST-DASLab/gptq), [ggml](https://github.com/ggerganov/ggml), [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [qlora](https://github.com/artidoro/qlora), [gptq_for_llama](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [chatglm.cpp](https://github.com/li-plus/chatglm.cpp), [redpajama.cpp](https://github.com/togethercomputer/redpajama.cpp), [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp), [bloomz.cpp](https://github.com/NouamaneTazi/bloomz.cpp/), etc.*
### Latest update
- `bigdl-llm` now supports Intel Arc or Flex GPU; see the the latest GPU examples [here](python/llm/example/gpu).
- `bigdl-llm` now supports Intel GPU (including Arc, Flex and MAX); see the the latest GPU examples [here](python/llm/example/gpu).
- `bigdl-llm` tutorial is released [here](https://github.com/intel-analytics/bigdl-llm-tutorial).
- Over 20 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLaMA2, ChatGLM/ChatGLM2, MPT, Falcon, Dolly-v1/Dolly-v2, StarCoder, Whisper, QWen, Baichuan, MOSS,* and more; see the complete list [here](python/llm/README.md#verified-models).
- Over 20 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLaMA2, ChatGLM/ChatGLM2, MPT, Falcon, Dolly-v1/Dolly-v2, StarCoder, Whisper, InternLM, QWen, Baichuan, MOSS,* and more; see the complete list [here](python/llm/README.md#verified-models).
### `bigdl-llm` Demos
See the ***optimized performance*** of `chatglm2-6b` and `llama-2-13b-chat` models on 12th Gen Intel Core CPU and Intel Arc GPU below.
@ -104,7 +104,7 @@ input_ids = tokenizer.encode(input_str, ...).to('xpu')
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids.cpu())
```
*See the complete examples [here](python/llm/example/transformers/transformers_int4/).*
*See the complete examples [here](python/llm/example/gpu/).*
#### More Low-Bit Support
##### Save and load

View file

@ -24,9 +24,9 @@ BigDL-LLM: low-Bit LLM library
============================================
Latest update
============================================
- ``bigdl-llm`` now supports Intel Arc and Flex GPU; see the the latest GPU examples `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/gpu>`_.
- ``bigdl-llm`` now supports Intel GPU (including Arc, Flex and MAX); see the the latest GPU examples `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/gpu>`_.
- ``bigdl-llm`` tutorial is released `here <https://github.com/intel-analytics/bigdl-llm-tutorial>`_.
- Over 20 models have been verified on ``bigdl-llm``, including *LLaMA/LLaMA2, ChatGLM/ChatGLM2, MPT, Falcon, Dolly-v1/Dolly-v2, StarCoder, Whisper, QWen, Baichuan,* and more; see the complete list `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/README.md#verified-models>`_.
- Over 20 models have been verified on ``bigdl-llm``, including *LLaMA/LLaMA2, ChatGLM/ChatGLM2, MPT, Falcon, Dolly-v1/Dolly-v2, StarCoder, Whisper, InternLM, QWen, Baichuan, MOSS* and more; see the complete list `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/README.md#verified-models>`_.
============================================

View file

@ -1,12 +1,8 @@
## BigDL-LLM
**`bigdl-llm`** is a library for running ***LLM*** (large language model) on your Intel ***laptop*** or ***GPU*** using INT4 with very low latency[^1] (for any Hugging Face *Transformers* model).
**[`bigdl-llm`](https://bigdl.readthedocs.io/en/latest/doc/LLM/index.html)** is a library for running **LLM** (large language model) on Intel **XPU** (from *Laptop* to *GPU* to *Cloud*) using **INT4** with very low latency[^1] (for any **PyTorch** model).
> *It is built on top of the excellent work of [llama.cpp](https://github.com/ggerganov/llama.cpp), [gptq](https://github.com/IST-DASLab/gptq), [ggml](https://github.com/ggerganov/ggml), [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [qlora](https://github.com/artidoro/qlora), [gptq_for_llama](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [chatglm.cpp](https://github.com/li-plus/chatglm.cpp), [redpajama.cpp](https://github.com/togethercomputer/redpajama.cpp), [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp), [bloomz.cpp](https://github.com/NouamaneTazi/bloomz.cpp/), etc.*
### Latest update
- `bigdl-llm` now supports Intel Arc or Flex GPU; see the the latest GPU examples [here](example/gpu).
### Demos
See the ***optimized performance*** of `chatglm2-6b` and `llama-2-13b-chat` models on 12th Gen Intel Core CPU and Intel Arc GPU below.
@ -37,9 +33,11 @@ See the ***optimized performance*** of `chatglm2-6b` and `llama-2-13b-chat` mode
</tr>
</table>
### Verified models
We may use any Hugging Face Transfomer models on `bigdl-llm`, and the following models have been verified on Intel laptops.
Over 20 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLaMA2, ChatGLM/ChatGLM2, MPT, Falcon, Dolly-v1/Dolly-v2, StarCoder, Whisper, InternLM, QWen, Baichuan, MOSS,* and more; see the complete list below.
<details><summary>Table of verified models</summary>
| Model | Example |
|-----------|----------------------------------------------------------|
| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* | [link1](example/transformers/native_int4), [link2](example/transformers/transformers_int4/vicuna) |
@ -51,6 +49,7 @@ We may use any Hugging Face Transfomer models on `bigdl-llm`, and the following
| Qwen | [link](example/transformers/transformers_int4/qwen) |
| MOSS | [link](example/transformers/transformers_int4/moss) |
| Baichuan | [link](example/transformers/transformers_int4/baichuan) |
| Baichuan2 | [link](example/transformers/transformers_int4/baichuan2) |
| Dolly-v1 | [link](example/transformers/transformers_int4/dolly_v1) |
| Dolly-v2 | [link](example/transformers/transformers_int4/dolly_v2) |
| RedPajama | [link1](example/transformers/native_int4), [link2](example/transformers/transformers_int4/redpajama) |
@ -59,109 +58,136 @@ We may use any Hugging Face Transfomer models on `bigdl-llm`, and the following
| InternLM | [link](example/transformers/transformers_int4/internlm) |
| Whisper | [link](example/transformers/transformers_int4/whisper) |
</details>
### Working with `bigdl-llm`
<details><summary>Table of Contents</summary>
- [Install](#install)
- [Download Model](#download-model)
- [Run Model](#run-model)
- [Hugging Face `transformers` API](#hugging-face-transformers-api)
- [LangChain API](#langchain-api)
- [CLI Tool](#cli-tool)
- [Hugging Face `transformers` API](#1-hugging-face-transformers-api)
- [Native INT4 Model](#2-native-int4-model)
- [LangChain API](#l3-angchain-api)
- [CLI Tool](#4-cli-tool)
- [`bigdl-llm` API Doc](#bigdl-llm-api-doc)
- [`bigdl-llm` Dependence](#bigdl-llm-dependence)
- [`bigdl-llm` Dependency](#bigdl-llm-dependency)
</details>
#### Install
You may install **`bigdl-llm`** as follows:
##### CPU
You may install **`bigdl-llm`** on Intel CPU as follows:
```bash
pip install --pre --upgrade bigdl-llm[all]
```
#### Download Model
> Note: `bigdl-llm` has been tested on Python 3.9
You may download any PyTorch model in Hugging Face *Transformers* format (including *FP16* or *FP32* or *GPTQ-4bit*).
##### GPU
You may install **`bigdl-llm`** on Intel GPU as follows:
```bash
# below command will install intel_extension_for_pytorch==2.0.110+xpu as default
# you can install specific ipex/torch version for your need
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
```
> Note: `bigdl-llm` has been tested on Python 3.9
#### Run Model
You may run the models using **`bigdl-llm`** through one of the following APIs:
1. [Hugging Face `transformers` API](#hugging-face-transformers-api)
2. [LangChain API](#langchain-api)
3. [CLI (command line interface) Tool](#cli-tool)
1. [Hugging Face `transformers` API](#1-hugging-face-transformers-api)
2. [Native INT4 Model](#2-native-int4-model)
3. [LangChain API](#3-langchain-api)
4. [CLI (command line interface) Tool](#4-cli-tool)
#### Hugging Face `transformers` API
You may run the models using `transformers`-style API in `bigdl-llm`.
##### 1. Hugging Face `transformers` API
You may run any Hugging Face *Transformers* model as follows:
- ##### Using Hugging Face `transformers` INT4 format
###### CPU INT4
You may apply INT4 optimizations to any Hugging Face *Transformers* model on Intel CPU as follows.
You may apply INT4 optimizations to any Hugging Face *Transformers* models as follows.
```python
#load Hugging Face Transformers model with INT4 optimizations
from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
```python
#load Hugging Face Transformers model with INT4 optimizations
from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
```
#run the optimized model on Intel CPU
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...)
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids)
```
After loading the Hugging Face Transformers model, you may easily run the optimized model as follows.
See the complete examples [here](example/transformers/transformers_int4/).
```python
#run the optimized model
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...)
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids)
```
###### GPU INT4
You may apply INT4 optimizations to any Hugging Face *Transformers* model on Intel GPU as follows.
See the complete examples [here](example/transformers/transformers_int4/).
```python
#load Hugging Face Transformers model with INT4 optimizations
from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
>**Note**: You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows:
>```python
>model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5")
>```
>See the complete example [here](example/transformers/transformers_low_bit/).
#run the optimized model on Intel GPU
model = model.to('xpu')
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...).to('xpu')
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids.cpu())
```
See the complete examples [here](example/gpu/).
After the model is optimizaed using INT4 (or INT8/INT5), you may save and load the optimized model as follows:
###### More Low-Bit Support
- Save and load
After the model is optimized using `bigdl-llm`, you may save and load the model as follows:
```python
model.save_low_bit(model_path)
new_model = AutoModelForCausalLM.load_low_bit(model_path)
```
See the example [here](example/transformers/transformers_low_bit/).
*See the complete example [here](example/transformers/transformers_low_bit/).*
- ##### Using native INT4 format
- Additonal data types
You may also convert Hugging Face *Transformers* models into native INT4 format for maximum performance as follows.
>**Notes**: Currently only llama/bloom/gptneox/starcoder/chatglm model families are supported; you may use the corresponding API to load the converted model. (For other models, you can use the Transformers INT4 format as described above).
In addition to INT4, You may apply other low bit optimizations (such as *INT8*, *INT5*, *NF4*, etc.) as follows:
```python
#convert the model
from bigdl.llm import llm_convert
bigdl_llm_path = llm_convert(model='/path/to/model/',
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int8")
```
*See the complete example [here](example/transformers/transformers_low_bit/).*
##### 2. Native INT4 model
You may also convert Hugging Face *Transformers* models into native INT4 model format for maximum performance as follows.
>**Notes**: Currently only llama/bloom/gptneox/starcoder/chatglm model families are supported; for other models, you may use the Hugging Face `transformers` model format as described above).
```python
#convert the model
from bigdl.llm import llm_convert
bigdl_llm_path = llm_convert(model='/path/to/model/',
outfile='/path/to/output/', outtype='int4', model_family="llama")
#load the converted model
#switch to ChatGLMForCausalLM/GptneoxForCausalLM/BloomForCausalLM/StarcoderForCausalLM to load other models
from bigdl.llm.transformers import LlamaForCausalLM
llm = LlamaForCausalLM.from_pretrained("/path/to/output/model.bin", native=True, ...)
#load the converted model
#switch to ChatGLMForCausalLM/GptneoxForCausalLM/BloomForCausalLM/StarcoderForCausalLM to load other models
from bigdl.llm.transformers import LlamaForCausalLM
llm = LlamaForCausalLM.from_pretrained("/path/to/output/model.bin", native=True, ...)
#run the converted model
input_ids = llm.tokenize(prompt)
output_ids = llm.generate(input_ids, ...)
output = llm.batch_decode(output_ids)
```
#run the converted model
input_ids = llm.tokenize(prompt)
output_ids = llm.generate(input_ids, ...)
output = llm.batch_decode(output_ids)
```
See the complete example [here](example/transformers/native_int4/native_int4_pipeline.py).
See the complete example [here](example/transformers/native_int4/native_int4_pipeline.py).
#### LangChain API
##### 3. LangChain API
You may run the models using the LangChain API in `bigdl-llm`.
- **Using Hugging Face `transformers` INT4 format**
- **Using Hugging Face `transformers` model**
You may run any Hugging Face *Transformers* model (with INT4 optimiztions applied) using the LangChain API as follows:
@ -178,15 +204,11 @@ You may run the models using the LangChain API in `bigdl-llm`.
```
See the examples [here](example/langchain/transformers_int4).
- **Using native INT4 format**
- **Using native INT4 model**
You may also convert Hugging Face *Transformers* models into *native INT4* format, and then run the converted models using the LangChain API as follows.
>**Notes**:
>* Currently only llama/bloom/gptneox/starcoder/chatglm model families are supported; for other models, you may use the Hugging Face `transformers` INT4 format as described above).
>* You may choose the corresponding API developed for specific native models to load the converted model.
>**Notes**:* Currently only llama/bloom/gptneox/starcoder/chatglm model families are supported; for other models, you may use the Hugging Face `transformers` model format as described above).
```python
from bigdl.llm.langchain.llms import LlamaLLM
@ -204,43 +226,8 @@ You may run the models using the LangChain API in `bigdl-llm`.
See the examples [here](example/langchain/native_int4).
#### CLI Tool
>**Note**: Currently `bigdl-llm` CLI supports *LLaMA* (e.g., *vicuna*), *GPT-NeoX* (e.g., *redpajama*), *BLOOM* (e.g., *pheonix*) and *GPT2* (e.g., *starcoder*) model architecture; for other models, you may use the `transformers`-style or LangChain APIs.
- ##### Convert model
You may convert the downloaded model into native INT4 format using `llm-convert`.
```bash
#convert PyTorch (fp16 or fp32) model;
#llama/bloom/gptneox/starcoder model family is currently supported
llm-convert "/path/to/model/" --model-format pth --model-family "bloom" --outfile "/path/to/output/"
#convert GPTQ-4bit model
#only llama model family is currently supported
llm-convert "/path/to/model/" --model-format gptq --model-family "llama" --outfile "/path/to/output/"
```
- ##### Run model
You may run the converted model using `llm-cli` or `llm-chat` (*built on top of `main.cpp` in [llama.cpp](https://github.com/ggerganov/llama.cpp)*)
```bash
#help
#llama/bloom/gptneox/starcoder model family is currently supported
llm-cli -x gptneox -h
#text completion
#llama/bloom/gptneox/starcoder model family is currently supported
llm-cli -t 16 -x gptneox -m "/path/to/output/model.bin" -p 'Once upon a time,'
#chat mode
#llama/gptneox model family is currently supported
llm-chat -m "/path/to/output/model.bin" -x llama
```
#### CLI Tool
>**Note**: Currently `bigdl-llm` CLI supports *LLaMA* (e.g., *vicuna*), *GPT-NeoX* (e.g., *redpajama*), *BLOOM* (e.g., *pheonix*) and *GPT2* (e.g., *starcoder*) model architecture; for other models, you may use the `transformers`-style or LangChain APIs.
##### 4. CLI Tool
>**Note**: Currently `bigdl-llm` CLI supports *LLaMA* (e.g., *vicuna*), *GPT-NeoX* (e.g., *redpajama*), *BLOOM* (e.g., *pheonix*) and *GPT2* (e.g., *starcoder*) model architecture; for other models, you may use the Hugging Face `transformers` or LangChain APIs.
- ##### Convert model
@ -279,7 +266,7 @@ See the inital `bigdl-llm` API Doc [here](https://bigdl.readthedocs.io/en/latest
[^1]: Performance varies by use, configuration and other factors. `bigdl-llm` may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.
### `bigdl-llm` Dependencies
### `bigdl-llm` Dependency
The native code/lib in `bigdl-llm` has been built using the following tools.
Note that lower `LIBC` version on your Linux system may be incompatible with `bigdl-llm`.

View file

@ -1,4 +1,4 @@
# Baichuan
# Baichuan2
In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Baichuan2 models. For illustration purposes, we utilize the [baichuan-inc/Baichuan2-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat) as a reference Baichuan model.
## 0. Requirements