diff --git a/docs/readthedocs/source/_toc.yml b/docs/readthedocs/source/_toc.yml index 4e25302f..507aa869 100644 --- a/docs/readthedocs/source/_toc.yml +++ b/docs/readthedocs/source/_toc.yml @@ -59,5 +59,3 @@ subtrees: - file: doc/LLM/Overview/FAQ/faq title: "FAQ" - - entries: - - file: doc/Application/blogs diff --git a/docs/readthedocs/source/doc/LLM/Quickstart/bigdl_llm_migration.md b/docs/readthedocs/source/doc/LLM/Quickstart/bigdl_llm_migration.md index afc79586..5ccdd457 100644 --- a/docs/readthedocs/source/doc/LLM/Quickstart/bigdl_llm_migration.md +++ b/docs/readthedocs/source/doc/LLM/Quickstart/bigdl_llm_migration.md @@ -20,12 +20,18 @@ pip install --pre --upgrade ipex-llm[all] # for cpu ### For GPU ```eval_rst .. tabs:: + .. tab:: US + .. code-block:: cmd + pip uninstall -y bigdl-llm pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + .. tab:: CN + .. code-block:: cmd + pip uninstall -y bigdl-llm pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ ``` diff --git a/docs/readthedocs/source/index.rst b/docs/readthedocs/source/index.rst index f664973a..a9e90c9a 100644 --- a/docs/readthedocs/source/index.rst +++ b/docs/readthedocs/source/index.rst @@ -1,45 +1,74 @@ .. meta:: :google-site-verification: S66K6GAclKw1RroxU0Rka_2d1LZFVe27M0gRneEsIVI +.. important:: + + .. raw:: html + +

+ + bigdl-llm has now become ipex-llm (see the migration guide here); you may you may find the original BigDL project here. + +

+ +------ + ################################################ -IPEX-LLM +💫 IPEX-LLM ################################################ .. raw:: html

- ipex-llm is a library for running LLM (large language model) on Intel XPU (from Laptop to GPU to Cloud) using INT4/FP4/INT8/FP8 with very low latency [1] (for any PyTorch model). + IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency [1].

.. note:: - It is built on top of the excellent work of `llama.cpp `_, `gptq `_, `bitsandbytes `_, `qlora `_, etc. + .. raw:: html + +

+

+

************************************************ Latest update 🔥 ************************************************ -- [2024/03] **LangChain** added support for ``ipex-llm``; see the details `here `_. -- [2024/02] ``ipex-llm`` now supports directly loading model from `ModelScope `_ (`魔搭 `_). -- [2024/02] ``ipex-llm`` added inital **INT2** support (based on llama.cpp `IQ2 `_ mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. -- [2024/02] Users can now use ``ipex-llm`` through `Text-Generation-WebUI `_ GUI. -- [2024/02] ``ipex-llm`` now supports `Self-Speculative Decoding `_, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel `GPU `_ and `CPU `_ respectively. -- [2024/02] ``ipex-llm`` now supports a comprehensive list of LLM finetuning on Intel GPU (including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_). -- [2024/01] Using ``ipex-llm`` `QLoRA `_, we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for `Standford-Alpaca `_ (see the blog `here `_). -- [2024/01] 🔔🔔🔔 **The default** ``ipex-llm`` **GPU Linux installation has switched from PyTorch 2.0 to PyTorch 2.1, which requires new oneAPI and GPU driver versions. (See the** `GPU installation guide `_ **for more details.)** -- [2023/12] ``ipex-llm`` now supports `ReLoRA `_ (see `"ReLoRA: High-Rank Training Through Low-Rank Updates" `_). -- [2023/12] ``ipex-llm`` now supports `Mixtral-8x7B `_ on both Intel `GPU `_ and `CPU `_. -- [2023/12] ``ipex-llm`` now supports `QA-LoRA `_ (see `"QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models" `_). -- [2023/12] ``ipex-llm`` now supports `FP8 and FP4 inference `_ on Intel **GPU**. -- [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``ipex-llm`` is available. -- [2023/11] ``ipex-llm`` now supports `vLLM continuous batching `_ on both Intel `GPU `_ and `CPU `_. -- [2023/10] ``ipex-llm`` now supports `QLoRA finetuning `_ on both Intel `GPU `_ and `CPU `_. -- [2023/10] ``ipex-llm`` now supports `FastChat serving `_ on on both Intel CPU and GPU. -- [2023/09] ``ipex-llm`` now supports `Intel GPU `_ (including Arc, Flex and MAX) -- [2023/09] ``ipex-llm`` `tutorial `_ is released. -- Over 30 models have been verified on ``ipex-llm``, including *LLaMA/LLaMA2, ChatGLM2/ChatGLM3, Mistral, Falcon, MPT, LLaVA, WizardCoder, Dolly, Whisper, Baichuan/Baichuan2, InternLM, Skywork, QWen/Qwen-VL, Aquila, MOSS* and more; see the complete list `here `_. +* [2024/03] ``bigdl-llm`` has now become ``ipex-llm`` (see the migration guide `here `_); you may find the original ``BigDL`` project `here `_. +* [2024/02] ``ipex-llm`` now supports directly loading model from `ModelScope `_ (`魔搭 `_). +* [2024/02] ``ipex-llm`` added inital **INT2** support (based on llama.cpp `IQ2 `_ mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. +* [2024/02] Users can now use ``ipex-llm`` through `Text-Generation-WebUI `_ GUI. +* [2024/02] ``ipex-llm`` now supports `*Self-Speculative Decoding* `_, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel `GPU `_ and `CPU `_ respectively. +* [2024/02] ``ipex-llm`` now supports a comprehensive list of LLM finetuning on Intel GPU (including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_). +* [2024/01] Using ``ipex-llm`` `QLoRA `_, we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for `Standford-Alpaca `_ (see the blog `here `_). + + +.. dropdown:: More updates + :color: primary + + * [2023/12] ``ipex-llm`` now supports `ReLoRA `_ (see `"ReLoRA: High-Rank Training Through Low-Rank Updates" `_). + * [2023/12] ``ipex-llm`` now supports `Mixtral-8x7B `_ on both Intel `GPU `_ and `CPU `_. + * [2023/12] ``ipex-llm`` now supports `QA-LoRA `_ (see `"QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models" `_). + * [2023/12] ``ipex-llm`` now supports `FP8 and FP4 inference `_ on Intel **GPU**. + * [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``ipex-llm`` is available. + * [2023/11] ``ipex-llm`` now supports `vLLM continuous batching `_ on both Intel `GPU `_ and `CPU `_. + * [2023/10] ``ipex-llm`` now supports `QLoRA finetuning `_ on both Intel `GPU `_ and `CPU `_. + * [2023/10] ``ipex-llm`` now supports `FastChat serving `_ on on both Intel CPU and GPU. + * [2023/09] ``ipex-llm`` now supports `Intel GPU `_ (including iGPU, Arc, Flex and MAX). + * [2023/09] ``ipex-llm`` `tutorial `_ is released. ************************************************ -``ipex-llm`` demos +``ipex-llm`` Demos ************************************************ See the **optimized performance** of ``chatglm2-6b`` and ``llama-2-13b-chat`` models on 12th Gen Intel Core CPU and Intel Arc GPU below. @@ -74,82 +103,85 @@ See the **optimized performance** of ``chatglm2-6b`` and ``llama-2-13b-chat`` mo ************************************************ -``ipex-llm`` quickstart +``ipex-llm`` Quickstart ************************************************ -- `Windows GPU installation `_ -- `Run IPEX-LLM in Text-Generation-WebUI `_ -- `Run IPEX-LLM using Docker `_ -- `CPU quickstart <#cpu-quickstart>`_ -- `GPU quickstart <#gpu-quickstart>`_ +* `Windows GPU `_: installing ``ipex-llm`` on Windows with Intel GPU +* `Linux GPU `_: installing ``ipex-llm`` on Linux with Intel GPU +* `Docker `_: using ``ipex-llm`` dockers on Intel CPU and GPU + +.. seealso:: + + For more details, please refer to the `installation guide `_ ============================================ -CPU Quickstart +Run ``ipex-llm`` ============================================ -You may install ``ipex-llm`` on Intel CPU as follows as follows: - -.. note:: - - See the `CPU installation guide `_ for more details. - -.. code-block:: console - - pip install --pre --upgrade ipex-llm[all] - -.. note:: - - ``ipex-llm`` has been tested on Python 3.9, 3.10 and 3.11 - -You can then apply INT4 optimizations to any Hugging Face *Transformers* models as follows. - -.. code-block:: python - - #load Hugging Face Transformers model with INT4 optimizations - from ipex_llm.transformers import AutoModelForCausalLM - model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True) - - #run the optimized model on Intel CPU - from transformers import AutoTokenizer - tokenizer = AutoTokenizer.from_pretrained(model_path) - input_ids = tokenizer.encode(input_str, ...) - output_ids = model.generate(input_ids, ...) - output = tokenizer.batch_decode(output_ids) +* `llama.cpp `_: running **ipex-llm for llama.cpp** (*using C++ interface of* ``ipex-llm`` *as an accelerated backend for* ``llama.cpp`` *on Intel GPU*) +* `vLLM `_: running ``ipex-llm`` in ``vLLM`` on both Intel `GPU `_ and `CPU `_ +* `FastChat `_: running ``ipex-llm`` in ``FastChat`` serving on on both Intel GPU and CPU +* `LangChain-Chatchat RAG `_: running ``ipex-llm`` in ``LangChain-Chatchat`` (*Knowledge Base QA using* **RAG** *pipeline*) +* `Text-Generation-WebUI `_: running ``ipex-llm`` in ``oobabooga`` **WebUI** +* `Benchmarking `_: running (latency and throughput) benchmarks for ``ipex-llm`` on Intel CPU and GPU ============================================ -GPU Quickstart +Code Examples ============================================ +* Low bit inference -You may install ``ipex-llm`` on Intel GPU as follows as follows: + * `INT4 inference `_: **INT4** LLM inference on Intel `GPU `_ and `CPU `_ + * `FP8/FP4 inference `_: **FP8** and **FP4** LLM inference on Intel `GPU `_ + * `INT8 inference `_: **INT8** LLM inference on Intel `GPU `_ and `CPU `_ + * `INT2 inference `_: **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel `GPU `_ -.. note:: +* FP16/BF16 inference - See the `GPU installation guide `_ for more details. + * **FP16** LLM inference on Intel `GPU `_, with possible `self-speculative decoding `_ optimization + * **BF16** LLM inference on Intel `CPU `_, with possible `self-speculative decoding `_ optimization -.. code-block:: console +* Save and load - # below command will install intel_extension_for_pytorch==2.1.10+xpu as default - pip install --pre --upgrade ipex-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu + * `Low-bit models `_: saving and loading ``ipex-llm`` low-bit models + * `GGUF `_: directly loading GGUF models into ``ipex-llm`` + * `AWQ `_: directly loading AWQ models into ``ipex-llm`` + * `GPTQ `_: directly loading GPTQ models into ``ipex-llm`` -.. note:: +* Finetuning - ``ipex-llm`` has been tested on Python 3.9, 3.10 and 3.11 + * LLM finetuning on Intel `GPU `_, including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_ + * QLoRA finetuning on Intel `CPU `_ -You can then apply INT4 optimizations to any Hugging Face *Transformers* models on Intel GPU as follows. +* Integration with community libraries -.. code-block:: python + * `HuggingFace tansformers `_ + * `Standard PyTorch model `_ + * `DeepSpeed-AutoTP `_ + * `HuggingFace PEFT `_ + * `HuggingFace TRL `_ + * `LangChain `_ + * `LlamaIndex `_ + * `AutoGen `_ + * `ModeScope `_ - #load Hugging Face Transformers model with INT4 optimizations - from ipex_llm.transformers import AutoModelForCausalLM - model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True) +* `Tutorials `_ - #run the optimized model on Intel GPU - model = model.to('xpu') - from transformers import AutoTokenizer - tokenizer = AutoTokenizer.from_pretrained(model_path) - input_ids = tokenizer.encode(input_str, ...).to('xpu') - output_ids = model.generate(input_ids, ...) - output = tokenizer.batch_decode(output_ids.cpu()) +.. seealso:: -**For more details, please refer to the ipex-llm** `Document `_, `Readme `_, `Tutorial `_ and `API Doc `_. + For more details, please refer to the |ipex_llm_document|_. + +.. |ipex_llm_document| replace:: ``ipex-llm`` document +.. _ipex_llm_document: doc/LLM/index.html + +------ + +.. raw:: html + +
+

+ [1] + Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. + +

+