Update Readme (#8959)
This commit is contained in:
parent
dcaa4dc130
commit
448a9e813a
2 changed files with 110 additions and 33 deletions
67
README.md
67
README.md
|
|
@ -7,7 +7,8 @@
|
||||||
---
|
---
|
||||||
## BigDL-LLM
|
## BigDL-LLM
|
||||||
|
|
||||||
**[`bigdl-llm`](python/llm)** is a library for running ***LLM*** (large language model) on your Intel ***laptop*** or ***GPU*** using INT4 with very low latency[^1] (for any **PyTorch** model).
|
**[`bigdl-llm`](python/llm)** is a library for running **LLM** (large language model) on Intel **XPU** (from *Laptop* to *GPU* to *Cloud*) using **INT4** with very low latency[^1] (for any **PyTorch** model).
|
||||||
|
|
||||||
> *It is built on top of the excellent work of [llama.cpp](https://github.com/ggerganov/llama.cpp), [gptq](https://github.com/IST-DASLab/gptq), [ggml](https://github.com/ggerganov/ggml), [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [qlora](https://github.com/artidoro/qlora), [gptq_for_llama](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [chatglm.cpp](https://github.com/li-plus/chatglm.cpp), [redpajama.cpp](https://github.com/togethercomputer/redpajama.cpp), [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp), [bloomz.cpp](https://github.com/NouamaneTazi/bloomz.cpp/), etc.*
|
> *It is built on top of the excellent work of [llama.cpp](https://github.com/ggerganov/llama.cpp), [gptq](https://github.com/IST-DASLab/gptq), [ggml](https://github.com/ggerganov/ggml), [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [qlora](https://github.com/artidoro/qlora), [gptq_for_llama](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [chatglm.cpp](https://github.com/li-plus/chatglm.cpp), [redpajama.cpp](https://github.com/togethercomputer/redpajama.cpp), [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp), [bloomz.cpp](https://github.com/NouamaneTazi/bloomz.cpp/), etc.*
|
||||||
|
|
||||||
### Latest update
|
### Latest update
|
||||||
|
|
@ -25,15 +26,19 @@ See the ***optimized performance*** of `chatglm2-6b`, `llama-2-13b-chat`, and `s
|
||||||
|
|
||||||
### `bigdl-llm` quickstart
|
### `bigdl-llm` quickstart
|
||||||
|
|
||||||
#### Install
|
- [CPU INT4](#cpu-int4)
|
||||||
You may install **`bigdl-llm`** as follows:
|
- [GPU INT4](#gpu-int4)
|
||||||
|
- [More Low-Bit Support](#more-low-bit-support)
|
||||||
|
|
||||||
|
#### CPU INT4
|
||||||
|
##### Install
|
||||||
|
You may install **`bigdl-llm`** on Intel CPU as follows:
|
||||||
```bash
|
```bash
|
||||||
pip install --pre --upgrade bigdl-llm[all]
|
pip install --pre --upgrade bigdl-llm[all]
|
||||||
```
|
```
|
||||||
> Note: `bigdl-llm` has been tested on Python 3.9
|
> Note: `bigdl-llm` has been tested on Python 3.9
|
||||||
|
|
||||||
#### Run Model
|
##### Run Model
|
||||||
|
|
||||||
You may apply INT4 optimizations to any Hugging Face *Transformers* models as follows.
|
You may apply INT4 optimizations to any Hugging Face *Transformers* models as follows.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
|
@ -41,7 +46,7 @@ You may apply INT4 optimizations to any Hugging Face *Transformers* models as fo
|
||||||
from bigdl.llm.transformers import AutoModelForCausalLM
|
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||||
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
|
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
|
||||||
|
|
||||||
#run the optimized model
|
#run the optimized model on CPU
|
||||||
from transformers import AutoTokenizer
|
from transformers import AutoTokenizer
|
||||||
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||||
input_ids = tokenizer.encode(input_str, ...)
|
input_ids = tokenizer.encode(input_str, ...)
|
||||||
|
|
@ -50,23 +55,55 @@ output = tokenizer.batch_decode(output_ids)
|
||||||
```
|
```
|
||||||
*See the complete examples [here](python/llm/example/transformers/transformers_int4/).*
|
*See the complete examples [here](python/llm/example/transformers/transformers_int4/).*
|
||||||
|
|
||||||
>**Note**: You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows:
|
#### GPU INT4
|
||||||
>```python
|
##### Install
|
||||||
>model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5")
|
You may install **`bigdl-llm`** on Intel GPU as follows:
|
||||||
>```
|
```bash
|
||||||
>*See the complete example [here](python/llm/example/transformers/transformers_low_bit/).*
|
# below command will install intel_extension_for_pytorch==2.0.110+xpu as default
|
||||||
|
# you can install specific ipex/torch version for your need
|
||||||
|
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
|
||||||
|
```
|
||||||
|
> Note: `bigdl-llm` has been tested on Python 3.9
|
||||||
|
|
||||||
|
##### Run Model
|
||||||
After the model is optimizaed using INT4 (or INT8/INT5), you may also save and load the optimized model as follows:
|
You may apply INT4 optimizations to any Hugging Face *Transformers* models as follows.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
model.save_low_bit(model_path)
|
#load Hugging Face Transformers model with INT4 optimizations
|
||||||
|
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||||
|
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
|
||||||
|
|
||||||
|
#run the optimized model on Intel GPU
|
||||||
|
model = model.to('xpu')
|
||||||
|
|
||||||
|
from transformers import AutoTokenizer
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||||
|
input_ids = tokenizer.encode(input_str, ...).to('xpu')
|
||||||
|
output_ids = model.generate(input_ids, ...)
|
||||||
|
output = tokenizer.batch_decode(output_ids.cpu())
|
||||||
|
```
|
||||||
|
*See the complete examples [here](python/llm/example/transformers/transformers_int4/).*
|
||||||
|
|
||||||
|
#### More Low-Bit Support
|
||||||
|
##### Save and load
|
||||||
|
|
||||||
|
After the model is optimized using `bigdl-llm`, you may save and load the model as follows:
|
||||||
|
```python
|
||||||
|
model.save_low_bit(model_path)
|
||||||
new_model = AutoModelForCausalLM.load_low_bit(model_path)
|
new_model = AutoModelForCausalLM.load_low_bit(model_path)
|
||||||
```
|
```
|
||||||
*See the complete example [here](python/llm/example/transformers/transformers_low_bit/).*
|
*See the complete example [here](python/llm/example/transformers/transformers_low_bit/).*
|
||||||
|
|
||||||
***For more details, please refer to the `bigdl-llm` [Readme](python/llm), [Tutorial](https://github.com/intel-analytics/bigdl-llm-tutorial) and [API Doc](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/index.html).***
|
##### Additonal data types
|
||||||
|
|
||||||
|
In addition to INT4, You may apply other low bit optimizations (such as *INT8*, *INT5*, *NF4*, etc.) as follows:
|
||||||
|
```python
|
||||||
|
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int8")
|
||||||
|
```
|
||||||
|
*See the complete example [here](python/llm/example/transformers/transformers_low_bit/).*
|
||||||
|
|
||||||
|
|
||||||
|
***For more details, please refer to the `bigdl-llm` [Document](https://test-bigdl-llm.readthedocs.io/en/main/doc/LLM/index.html), [Readme](python/llm), [Tutorial](https://github.com/intel-analytics/bigdl-llm-tutorial) and [API Doc](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/index.html).***
|
||||||
|
|
||||||
---
|
---
|
||||||
## Overview of the complete BigDL project
|
## Overview of the complete BigDL project
|
||||||
|
|
|
||||||
|
|
@ -1,37 +1,37 @@
|
||||||
.. meta::
|
.. meta::
|
||||||
:google-site-verification: S66K6GAclKw1RroxU0Rka_2d1LZFVe27M0gRneEsIVI
|
:google-site-verification: S66K6GAclKw1RroxU0Rka_2d1LZFVe27M0gRneEsIVI
|
||||||
|
|
||||||
=================================================
|
################################################
|
||||||
The BigDL Project
|
The BigDL Project
|
||||||
=================================================
|
################################################
|
||||||
|
|
||||||
------
|
------
|
||||||
|
|
||||||
---------------------------------
|
************************************************
|
||||||
BigDL-LLM: low-Bit LLM library
|
BigDL-LLM: low-Bit LLM library
|
||||||
---------------------------------
|
************************************************
|
||||||
|
|
||||||
.. raw:: html
|
.. raw:: html
|
||||||
|
|
||||||
<p>
|
<p>
|
||||||
<a href="https://github.com/intel-analytics/BigDL/tree/main/python/llm"><code><span>bigdl-llm</span></code></a> is a library for running <strong>LLM</strong> (large language model) on your Intel <strong>laptop</strong> or <strong>GPU</strong> using INT4 with very low latency <sup><a href="#footnote-perf" id="ref-perf">[1]</a></sup> (for any <strong>PyTorch</strong> model).
|
<a href="https://github.com/intel-analytics/BigDL/tree/main/python/llm"><code><span>bigdl-llm</span></code></a> is a library for running <strong>LLM</strong> (large language model) on Intel <strong>XPU</strong> (from <em>Laptop</em> to <em>GPU</em> to <em>Cloud</em>) using <strong>INT4</strong> with very low latency <sup><a href="#footnote-perf" id="ref-perf">[1]</a></sup> (for any <strong>PyTorch</strong> model).
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
It is built on top of the excellent work of `llama.cpp <https://github.com/ggerganov/llama.cpp>`_, `gptq <https://github.com/IST-DASLab/gptq>`_, `bitsandbytes <https://github.com/TimDettmers/bitsandbytes>`_, `qlora <https://github.com/artidoro/qlora>`_, etc.
|
It is built on top of the excellent work of `llama.cpp <https://github.com/ggerganov/llama.cpp>`_, `gptq <https://github.com/IST-DASLab/gptq>`_, `bitsandbytes <https://github.com/TimDettmers/bitsandbytes>`_, `qlora <https://github.com/artidoro/qlora>`_, etc.
|
||||||
|
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
============================================
|
||||||
Latest update
|
Latest update
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
============================================
|
||||||
- ``bigdl-llm`` now supports Intel Arc and Flex GPU; see the the latest GPU examples `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/gpu>`_.
|
- ``bigdl-llm`` now supports Intel Arc and Flex GPU; see the the latest GPU examples `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/gpu>`_.
|
||||||
- ``bigdl-llm`` tutorial is released `here <https://github.com/intel-analytics/bigdl-llm-tutorial>`_.
|
- ``bigdl-llm`` tutorial is released `here <https://github.com/intel-analytics/bigdl-llm-tutorial>`_.
|
||||||
- Over 20 models have been verified on ``bigdl-llm``, including *LLaMA/LLaMA2, ChatGLM/ChatGLM2, MPT, Falcon, Dolly-v1/Dolly-v2, StarCoder, Whisper, QWen, Baichuan,* and more; see the complete list `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/README.md#verified-models>`_.
|
- Over 20 models have been verified on ``bigdl-llm``, including *LLaMA/LLaMA2, ChatGLM/ChatGLM2, MPT, Falcon, Dolly-v1/Dolly-v2, StarCoder, Whisper, QWen, Baichuan,* and more; see the complete list `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/README.md#verified-models>`_.
|
||||||
|
|
||||||
|
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
============================================
|
||||||
``bigdl-llm`` demos
|
``bigdl-llm`` demos
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
============================================
|
||||||
|
|
||||||
See the **optimized performance** of ``chatglm2-6b``, ``llama-2-13b-chat``, and ``starcoder-15.5b`` models on a 12th Gen Intel Core CPU below.
|
See the **optimized performance** of ``chatglm2-6b``, ``llama-2-13b-chat``, and ``starcoder-15.5b`` models on a 12th Gen Intel Core CPU below.
|
||||||
|
|
||||||
|
|
@ -42,11 +42,18 @@ See the **optimized performance** of ``chatglm2-6b``, ``llama-2-13b-chat``, and
|
||||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/llm-models3.png" width='76%'>
|
<img src="https://llm-assets.readthedocs.io/en/latest/_images/llm-models3.png" width='76%'>
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
============================================
|
||||||
``bigdl-llm`` quickstart
|
``bigdl-llm`` quickstart
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
============================================
|
||||||
|
|
||||||
You may install ``bigdl-llm`` as follows:
|
- `CPU <#cpu-quickstart>`_
|
||||||
|
- `GPU <#gpu-quickstart>`_
|
||||||
|
|
||||||
|
--------------------------------------------
|
||||||
|
CPU Quickstart
|
||||||
|
--------------------------------------------
|
||||||
|
|
||||||
|
You may install ``bigdl-llm`` on Intel CPU as follows as follows:
|
||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
|
|
@ -64,20 +71,53 @@ You can then apply INT4 optimizations to any Hugging Face *Transformers* models
|
||||||
from bigdl.llm.transformers import AutoModelForCausalLM
|
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||||
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
|
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
|
||||||
|
|
||||||
#run the optimized model
|
#run the optimized model on Intel CPU
|
||||||
from transformers import AutoTokenizer
|
from transformers import AutoTokenizer
|
||||||
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||||
input_ids = tokenizer.encode(input_str, ...)
|
input_ids = tokenizer.encode(input_str, ...)
|
||||||
output_ids = model.generate(input_ids, ...)
|
output_ids = model.generate(input_ids, ...)
|
||||||
output = tokenizer.batch_decode(output_ids)
|
output = tokenizer.batch_decode(output_ids)
|
||||||
|
|
||||||
**For more details, please refer to the bigdl-llm** `Readme <https://github.com/intel-analytics/BigDL/tree/main/python/llm>`_, `Tutorial <https://github.com/intel-analytics/bigdl-llm-tutorial>`_ and `API Doc <https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/index.html>`_.
|
--------------------------------------------
|
||||||
|
GPU Quickstart
|
||||||
|
--------------------------------------------
|
||||||
|
|
||||||
|
You may install ``bigdl-llm`` on Intel GPU as follows as follows:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# below command will install intel_extension_for_pytorch==2.0.110+xpu as default
|
||||||
|
# you can install specific ipex/torch version for your need
|
||||||
|
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
``bigdl-llm`` has been tested on Python 3.9.
|
||||||
|
|
||||||
|
You can then apply INT4 optimizations to any Hugging Face *Transformers* models on Intel GPU as follows.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
#load Hugging Face Transformers model with INT4 optimizations
|
||||||
|
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||||
|
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
|
||||||
|
|
||||||
|
#run the optimized model on Intel GPU
|
||||||
|
model = model.to('xpu')
|
||||||
|
|
||||||
|
from transformers import AutoTokenizer
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||||
|
input_ids = tokenizer.encode(input_str, ...).to('xpu')
|
||||||
|
output_ids = model.generate(input_ids, ...)
|
||||||
|
output = tokenizer.batch_decode(output_ids.cpu())
|
||||||
|
|
||||||
|
**For more details, please refer to the bigdl-llm** `Document <doc/LLM/index.html>`_, `Readme <https://github.com/intel-analytics/BigDL/tree/main/python/llm>`_, `Tutorial <https://github.com/intel-analytics/bigdl-llm-tutorial>`_ and `API Doc <doc/PythonAPI/LLM/index.html>`_.
|
||||||
|
|
||||||
------
|
------
|
||||||
|
|
||||||
---------------------------------
|
************************************************
|
||||||
Overview of the complete BigDL project
|
Overview of the complete BigDL project
|
||||||
---------------------------------
|
************************************************
|
||||||
`BigDL <https://github.com/intel-analytics/bigdl>`_ seamlessly scales your data analytics & AI applications from laptop to cloud, with the following libraries:
|
`BigDL <https://github.com/intel-analytics/bigdl>`_ seamlessly scales your data analytics & AI applications from laptop to cloud, with the following libraries:
|
||||||
|
|
||||||
- `LLM <https://github.com/intel-analytics/BigDL/tree/main/python/llm>`_: Low-bit (INT3/INT4/INT5/INT8) large language model library for Intel CPU/GPU
|
- `LLM <https://github.com/intel-analytics/BigDL/tree/main/python/llm>`_: Low-bit (INT3/INT4/INT5/INT8) large language model library for Intel CPU/GPU
|
||||||
|
|
@ -90,9 +130,9 @@ Overview of the complete BigDL project
|
||||||
|
|
||||||
------
|
------
|
||||||
|
|
||||||
---------------------------------
|
************************************************
|
||||||
Choosing the right BigDL library
|
Choosing the right BigDL library
|
||||||
---------------------------------
|
************************************************
|
||||||
|
|
||||||
.. graphviz::
|
.. graphviz::
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue