From bda404fc8f9302423e870cdbb19015dfcbc4df15 Mon Sep 17 00:00:00 2001 From: Jason Dai Date: Thu, 30 Nov 2023 22:45:52 +0800 Subject: [PATCH] Update readme (#9575) --- README.md | 16 ++++++++++------ docs/readthedocs/source/index.rst | 13 ++++++++----- .../llm/example/CPU/GGUF-Models/llama2/README.md | 11 ++++++----- 3 files changed, 24 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index 4fb996de..10bcbade 100644 --- a/README.md +++ b/README.md @@ -7,15 +7,19 @@ --- ## BigDL-LLM -**[`bigdl-llm`](python/llm)** is a library for running **LLM** (large language model) on Intel **XPU** (from *Laptop* to *GPU* to *Cloud*) using **INT4** with very low latency[^1] (for any **PyTorch** model). +**[`bigdl-llm`](python/llm)** is a library for running **LLM** (large language model) on Intel **XPU** (from *Laptop* to *GPU* to *Cloud*) using **INT4/FP4/INT8/FP8** with very low latency[^1] (for any **PyTorch** model). -> *It is built on top of the excellent work of [llama.cpp](https://github.com/ggerganov/llama.cpp), [ggml](https://github.com/ggerganov/ggml), [gptq](https://github.com/IST-DASLab/gptq), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [qlora](https://github.com/artidoro/qlora), [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), [gptq_for_llama](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [chatglm.cpp](https://github.com/li-plus/chatglm.cpp), [redpajama.cpp](https://github.com/togethercomputer/redpajama.cpp), [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp), [bloomz.cpp](https://github.com/NouamaneTazi/bloomz.cpp/), etc.* +> *It is built on the excellent work of [llama.cpp](https://github.com/ggerganov/llama.cpp), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [qlora](https://github.com/artidoro/qlora), [gptq](https://github.com/IST-DASLab/gptq), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [awq](https://github.com/mit-han-lab/llm-awq), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [vLLM](https://github.com/vllm-project/vllm), [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), [gptq_for_llama](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [chatglm.cpp](https://github.com/li-plus/chatglm.cpp), [redpajama.cpp](https://github.com/togethercomputer/redpajama.cpp), [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp), [bloomz.cpp](https://github.com/NouamaneTazi/bloomz.cpp/), etc.* ### Latest update -- **[New]** `bigdl-llm` now supports QLoRA fintuning on Intel GPU; see the the example [here](python/llm/example/GPU/QLoRA-FineTuning). -- `bigdl-llm` now supports Intel GPU (including Arc, Flex and MAX); see the the latest GPU examples [here](python/llm/example/GPU). -- `bigdl-llm` tutorial is released [here](https://github.com/intel-analytics/bigdl-llm-tutorial). -- Over 30 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLaMA2, ChatGLM2/ChatGLM3, Mistral, Falcon, MPT, LLaVA, WizardCoder, Dolly, Whisper, Baichuan/Baichuan2, InternLM, Skywork, QWen/Qwen-VL, Aquila, MOSS,* and more; see the complete list [here](#verified-models). +- [2023/11] Initial support for directly loading [GGUF](python/llm/example/CPU/GGUF-Models/llama2), [AWQ](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/AWQ) and [GPTQ](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ) models in to `bigdl-llm` is available. +- [2023/11] Initial support for [vLLM continuous batching](python/llm/example/CPU/vLLM-Serving) is availabe on Intel ***CPU***. +- [2023/11] Initial support for [vLLM continuous batching](python/llm/example/GPU/vLLM-Serving) is availabe on Intel ***GPU***. +- [2023/10] [QLoRA finetuning](python/llm/example/CPU/QLoRA-FineTuning) on Intel ***CPU*** is available. +- [2023/10] [QLoRA finetuning](python/llm/example/GPU/QLoRA-FineTuning) on Intel ***GPU*** is available. +- [2023/09] `bigdl-llm` now supports [Intel GPU](python/llm/example/GPU) (including Arc, Flex and MAX) +- [2023/09] `bigdl-llm` [tutorial](https://github.com/intel-analytics/bigdl-llm-tutorial) is released. +- [2023/09] Over 30 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLaMA2, ChatGLM2/ChatGLM3, Mistral, Falcon, MPT, LLaVA, WizardCoder, Dolly, Whisper, Baichuan/Baichuan2, InternLM, Skywork, QWen/Qwen-VL, Aquila, MOSS,* and more; see the complete list [here](#verified-models). ### `bigdl-llm` Demos See the ***optimized performance*** of `chatglm2-6b` and `llama-2-13b-chat` models on 12th Gen Intel Core CPU and Intel Arc GPU below. diff --git a/docs/readthedocs/source/index.rst b/docs/readthedocs/source/index.rst index c78a3398..560905fe 100644 --- a/docs/readthedocs/source/index.rst +++ b/docs/readthedocs/source/index.rst @@ -14,7 +14,7 @@ BigDL-LLM: low-Bit LLM library .. raw:: html

- bigdl-llm is a library for running LLM (large language model) on Intel XPU (from Laptop to GPU to Cloud) using INT4 with very low latency [1] (for any PyTorch model). + bigdl-llm is a library for running LLM (large language model) on Intel XPU (from Laptop to GPU to Cloud) using INT4/FP4/INT8/FP8 with very low latency [1] (for any PyTorch model).

.. note:: @@ -24,12 +24,15 @@ BigDL-LLM: low-Bit LLM library ============================================ Latest update ============================================ -- **[New]** ``bigdl-llm`` now supports QLoRA fintuning on Intel GPU; see the the example `here `_. -- ``bigdl-llm`` now supports Intel GPU (including Arc, Flex and MAX); see the the latest GPU examples `here `_. -- ``bigdl-llm`` tutorial is released `here `_. +- [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``bigdl-llm`` is available. +- [2023/11] Initial support for `vLLM continuous batching `_ is availabe on Intel **CPU**. +- [2023/11] Initial support for `vLLM continuous batching `_ is availabe on Intel **GPU**. +- [2023/10] `QLoRA finetuning `_ on Intel **CPU** is available. +- [2023/10] `QLoRA finetuning `_ on Intel **GPU** is available. +- [2023/09] ``bigdl-llm`` now supports `Intel GPU `_ (including Arc, Flex and MAX) +- [2023/09] ``bigdl-llm`` `tutorial `_ is released. - Over 30 models have been verified on ``bigdl-llm``, including *LLaMA/LLaMA2, ChatGLM2/ChatGLM3, Mistral, Falcon, MPT, LLaVA, WizardCoder, Dolly, Whisper, Baichuan/Baichuan2, InternLM, Skywork, QWen/Qwen-VL, Aquila, MOSS* and more; see the complete list `here `_. - ============================================ ``bigdl-llm`` demos ============================================ diff --git a/python/llm/example/CPU/GGUF-Models/llama2/README.md b/python/llm/example/CPU/GGUF-Models/llama2/README.md index ea0f5051..962389ba 100644 --- a/python/llm/example/CPU/GGUF-Models/llama2/README.md +++ b/python/llm/example/CPU/GGUF-Models/llama2/README.md @@ -1,11 +1,12 @@ -# Llama2 -In this directory, you will find examples on how you could load gguf Llama2 model and convert it to bigdl-llm model. For illustration purposes, we utilize the [llama-2-7b-chat.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main) and [llama-2-7b-chat.Q4_1.gguf](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main) as reference Llama2 gguf models. +# Loading GGUF models +In this directory, you will find examples on how to load GGUF model into `bigdl-llm`. For illustration purposes, we utilize the [llama-2-7b-chat.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main) and [llama-2-7b-chat.Q4_1.gguf](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main) as reference LLaMA2 GGUF models. +>Note: Only LLaMA2 family models are currently supported ## Requirements To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. ## Example: Load gguf model using `from_gguf()` API -In the example [generate.py](./generate.py), we show a basic use case to load a gguf Llama2 model and convert it to a bigdl-llm model using `from_gguf()` API, with BigDL-LLM optimizations. +In the example [generate.py](./generate.py), we show a basic use case to load a GGUF LLaMA2 model into `bigdl-llm` using `from_gguf()` API, with BigDL-LLM optimizations. ### 1. Install We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). @@ -45,7 +46,7 @@ More information about arguments can be found in [Arguments Info](#23-arguments- #### 2.3 Arguments Info In the example, several arguments can be passed to satisfy your requirements: -- `--model`: path to gguf model, it should be a file with name like `llama-2-7b-chat.Q4_0.gguf` +- `--model`: path to GGUF model, it should be a file with name like `llama-2-7b-chat.Q4_0.gguf` - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. @@ -72,4 +73,4 @@ What is AI? ### RESPONSE: Artificial intelligence (AI) is the field of study focused on creating machines that can perform tasks that typically require human intelligence, such as understanding language, -``` \ No newline at end of file +```