From 448a9e813ae37e831d0c606c5fd927ee194c4849 Mon Sep 17 00:00:00 2001 From: Jason Dai Date: Tue, 12 Sep 2023 17:27:26 +0800 Subject: [PATCH] Update Readme (#8959) --- README.md | 67 +++++++++++++++++++++------ docs/readthedocs/source/index.rst | 76 +++++++++++++++++++++++-------- 2 files changed, 110 insertions(+), 33 deletions(-) diff --git a/README.md b/README.md index 0e39d113..ae2935a2 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,8 @@ --- ## BigDL-LLM -**[`bigdl-llm`](python/llm)** is a library for running ***LLM*** (large language model) on your Intel ***laptop*** or ***GPU*** using INT4 with very low latency[^1] (for any **PyTorch** model). +**[`bigdl-llm`](python/llm)** is a library for running **LLM** (large language model) on Intel **XPU** (from *Laptop* to *GPU* to *Cloud*) using **INT4** with very low latency[^1] (for any **PyTorch** model). + > *It is built on top of the excellent work of [llama.cpp](https://github.com/ggerganov/llama.cpp), [gptq](https://github.com/IST-DASLab/gptq), [ggml](https://github.com/ggerganov/ggml), [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [qlora](https://github.com/artidoro/qlora), [gptq_for_llama](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [chatglm.cpp](https://github.com/li-plus/chatglm.cpp), [redpajama.cpp](https://github.com/togethercomputer/redpajama.cpp), [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp), [bloomz.cpp](https://github.com/NouamaneTazi/bloomz.cpp/), etc.* ### Latest update @@ -25,15 +26,19 @@ See the ***optimized performance*** of `chatglm2-6b`, `llama-2-13b-chat`, and `s ### `bigdl-llm` quickstart -#### Install -You may install **`bigdl-llm`** as follows: +- [CPU INT4](#cpu-int4) +- [GPU INT4](#gpu-int4) +- [More Low-Bit Support](#more-low-bit-support) + +#### CPU INT4 +##### Install +You may install **`bigdl-llm`** on Intel CPU as follows: ```bash pip install --pre --upgrade bigdl-llm[all] ``` > Note: `bigdl-llm` has been tested on Python 3.9 -#### Run Model - +##### Run Model You may apply INT4 optimizations to any Hugging Face *Transformers* models as follows. ```python @@ -41,7 +46,7 @@ You may apply INT4 optimizations to any Hugging Face *Transformers* models as fo from bigdl.llm.transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True) -#run the optimized model +#run the optimized model on CPU from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_path) input_ids = tokenizer.encode(input_str, ...) @@ -50,23 +55,55 @@ output = tokenizer.batch_decode(output_ids) ``` *See the complete examples [here](python/llm/example/transformers/transformers_int4/).* ->**Note**: You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows: - >```python - >model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5") - >``` - >*See the complete example [here](python/llm/example/transformers/transformers_low_bit/).* - +#### GPU INT4 +##### Install +You may install **`bigdl-llm`** on Intel GPU as follows: +```bash +# below command will install intel_extension_for_pytorch==2.0.110+xpu as default +# you can install specific ipex/torch version for your need +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +``` +> Note: `bigdl-llm` has been tested on Python 3.9 -After the model is optimizaed using INT4 (or INT8/INT5), you may also save and load the optimized model as follows: +##### Run Model +You may apply INT4 optimizations to any Hugging Face *Transformers* models as follows. ```python -model.save_low_bit(model_path) +#load Hugging Face Transformers model with INT4 optimizations +from bigdl.llm.transformers import AutoModelForCausalLM +model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True) +#run the optimized model on Intel GPU +model = model.to('xpu') + +from transformers import AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained(model_path) +input_ids = tokenizer.encode(input_str, ...).to('xpu') +output_ids = model.generate(input_ids, ...) +output = tokenizer.batch_decode(output_ids.cpu()) +``` +*See the complete examples [here](python/llm/example/transformers/transformers_int4/).* + +#### More Low-Bit Support +##### Save and load + +After the model is optimized using `bigdl-llm`, you may save and load the model as follows: +```python +model.save_low_bit(model_path) new_model = AutoModelForCausalLM.load_low_bit(model_path) ``` *See the complete example [here](python/llm/example/transformers/transformers_low_bit/).* -***For more details, please refer to the `bigdl-llm` [Readme](python/llm), [Tutorial](https://github.com/intel-analytics/bigdl-llm-tutorial) and [API Doc](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/index.html).*** +##### Additonal data types + +In addition to INT4, You may apply other low bit optimizations (such as *INT8*, *INT5*, *NF4*, etc.) as follows: +```python +model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int8") +``` +*See the complete example [here](python/llm/example/transformers/transformers_low_bit/).* + + +***For more details, please refer to the `bigdl-llm` [Document](https://test-bigdl-llm.readthedocs.io/en/main/doc/LLM/index.html), [Readme](python/llm), [Tutorial](https://github.com/intel-analytics/bigdl-llm-tutorial) and [API Doc](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/index.html).*** --- ## Overview of the complete BigDL project diff --git a/docs/readthedocs/source/index.rst b/docs/readthedocs/source/index.rst index 3fb1f454..3d7ae561 100644 --- a/docs/readthedocs/source/index.rst +++ b/docs/readthedocs/source/index.rst @@ -1,37 +1,37 @@ .. meta:: :google-site-verification: S66K6GAclKw1RroxU0Rka_2d1LZFVe27M0gRneEsIVI -================================================= +################################################ The BigDL Project -================================================= +################################################ ------ ---------------------------------- +************************************************ BigDL-LLM: low-Bit LLM library ---------------------------------- +************************************************ .. raw:: html

- bigdl-llm is a library for running LLM (large language model) on your Intel laptop or GPU using INT4 with very low latency [1] (for any PyTorch model). + bigdl-llm is a library for running LLM (large language model) on Intel XPU (from Laptop to GPU to Cloud) using INT4 with very low latency [1] (for any PyTorch model).

.. note:: It is built on top of the excellent work of `llama.cpp `_, `gptq `_, `bitsandbytes `_, `qlora `_, etc. -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +============================================ Latest update -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +============================================ - ``bigdl-llm`` now supports Intel Arc and Flex GPU; see the the latest GPU examples `here `_. - ``bigdl-llm`` tutorial is released `here `_. - Over 20 models have been verified on ``bigdl-llm``, including *LLaMA/LLaMA2, ChatGLM/ChatGLM2, MPT, Falcon, Dolly-v1/Dolly-v2, StarCoder, Whisper, QWen, Baichuan,* and more; see the complete list `here `_. -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +============================================ ``bigdl-llm`` demos -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +============================================ See the **optimized performance** of ``chatglm2-6b``, ``llama-2-13b-chat``, and ``starcoder-15.5b`` models on a 12th Gen Intel Core CPU below. @@ -42,11 +42,18 @@ See the **optimized performance** of ``chatglm2-6b``, ``llama-2-13b-chat``, and

-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +============================================ ``bigdl-llm`` quickstart -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +============================================ -You may install ``bigdl-llm`` as follows: +- `CPU <#cpu-quickstart>`_ +- `GPU <#gpu-quickstart>`_ + +-------------------------------------------- +CPU Quickstart +-------------------------------------------- + +You may install ``bigdl-llm`` on Intel CPU as follows as follows: .. code-block:: console @@ -64,20 +71,53 @@ You can then apply INT4 optimizations to any Hugging Face *Transformers* models from bigdl.llm.transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True) - #run the optimized model + #run the optimized model on Intel CPU from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_path) input_ids = tokenizer.encode(input_str, ...) output_ids = model.generate(input_ids, ...) output = tokenizer.batch_decode(output_ids) -**For more details, please refer to the bigdl-llm** `Readme `_, `Tutorial `_ and `API Doc `_. +-------------------------------------------- +GPU Quickstart +-------------------------------------------- + +You may install ``bigdl-llm`` on Intel GPU as follows as follows: + +.. code-block:: console + + # below command will install intel_extension_for_pytorch==2.0.110+xpu as default + # you can install specific ipex/torch version for your need + pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu + +.. note:: + + ``bigdl-llm`` has been tested on Python 3.9. + +You can then apply INT4 optimizations to any Hugging Face *Transformers* models on Intel GPU as follows. + +.. code-block:: python + + #load Hugging Face Transformers model with INT4 optimizations + from bigdl.llm.transformers import AutoModelForCausalLM + model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True) + + #run the optimized model on Intel GPU + model = model.to('xpu') + + from transformers import AutoTokenizer + tokenizer = AutoTokenizer.from_pretrained(model_path) + input_ids = tokenizer.encode(input_str, ...).to('xpu') + output_ids = model.generate(input_ids, ...) + output = tokenizer.batch_decode(output_ids.cpu()) + +**For more details, please refer to the bigdl-llm** `Document `_, `Readme `_, `Tutorial `_ and `API Doc `_. ------ ---------------------------------- +************************************************ Overview of the complete BigDL project ---------------------------------- +************************************************ `BigDL `_ seamlessly scales your data analytics & AI applications from laptop to cloud, with the following libraries: - `LLM `_: Low-bit (INT3/INT4/INT5/INT8) large language model library for Intel CPU/GPU @@ -90,9 +130,9 @@ Overview of the complete BigDL project ------ ---------------------------------- +************************************************ Choosing the right BigDL library ---------------------------------- +************************************************ .. graphviz::