diff --git a/docs/readthedocs/source/_toc.yml b/docs/readthedocs/source/_toc.yml
index 4e25302f..507aa869 100644
--- a/docs/readthedocs/source/_toc.yml
+++ b/docs/readthedocs/source/_toc.yml
@@ -59,5 +59,3 @@ subtrees:
- file: doc/LLM/Overview/FAQ/faq
title: "FAQ"
- - entries:
- - file: doc/Application/blogs
diff --git a/docs/readthedocs/source/doc/LLM/Quickstart/bigdl_llm_migration.md b/docs/readthedocs/source/doc/LLM/Quickstart/bigdl_llm_migration.md
index afc79586..5ccdd457 100644
--- a/docs/readthedocs/source/doc/LLM/Quickstart/bigdl_llm_migration.md
+++ b/docs/readthedocs/source/doc/LLM/Quickstart/bigdl_llm_migration.md
@@ -20,12 +20,18 @@ pip install --pre --upgrade ipex-llm[all] # for cpu
### For GPU
```eval_rst
.. tabs::
+
.. tab:: US
+
.. code-block:: cmd
+
pip uninstall -y bigdl-llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+
.. tab:: CN
+
.. code-block:: cmd
+
pip uninstall -y bigdl-llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
```
diff --git a/docs/readthedocs/source/index.rst b/docs/readthedocs/source/index.rst
index f664973a..a9e90c9a 100644
--- a/docs/readthedocs/source/index.rst
+++ b/docs/readthedocs/source/index.rst
@@ -1,45 +1,74 @@
.. meta::
:google-site-verification: S66K6GAclKw1RroxU0Rka_2d1LZFVe27M0gRneEsIVI
+.. important::
+
+ .. raw:: html
+
+
+
+ bigdl-llm has now become ipex-llm (see the migration guide here); you may you may find the original BigDL project here.
+
+
+
+------
+
################################################
-IPEX-LLM
+💫 IPEX-LLM
################################################
.. raw:: html
- ipex-llm is a library for running LLM (large language model) on Intel XPU (from Laptop to GPU to Cloud) using INT4/FP4/INT8/FP8 with very low latency [1] (for any PyTorch model).
+ IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency [1].
.. note::
- It is built on top of the excellent work of `llama.cpp `_, `gptq `_, `bitsandbytes `_, `qlora `_, etc.
+ .. raw:: html
+
+
+
+ -
+ It is built on top of Intel Extension for PyTorch (IPEX), as well as the excellent work of
llama.cpp, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc.
+
+ -
+ It provides seamless integration with llama.cpp, Text-Generation-WebUI, HuggingFace tansformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
+
+ -
+ 50+ models have been optimized/verified on
ipex-llm (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.
+
+
+
************************************************
Latest update 🔥
************************************************
-- [2024/03] **LangChain** added support for ``ipex-llm``; see the details `here `_.
-- [2024/02] ``ipex-llm`` now supports directly loading model from `ModelScope `_ (`魔搭 `_).
-- [2024/02] ``ipex-llm`` added inital **INT2** support (based on llama.cpp `IQ2 `_ mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
-- [2024/02] Users can now use ``ipex-llm`` through `Text-Generation-WebUI `_ GUI.
-- [2024/02] ``ipex-llm`` now supports `Self-Speculative Decoding `_, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel `GPU `_ and `CPU `_ respectively.
-- [2024/02] ``ipex-llm`` now supports a comprehensive list of LLM finetuning on Intel GPU (including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_).
-- [2024/01] Using ``ipex-llm`` `QLoRA `_, we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for `Standford-Alpaca `_ (see the blog `here `_).
-- [2024/01] 🔔🔔🔔 **The default** ``ipex-llm`` **GPU Linux installation has switched from PyTorch 2.0 to PyTorch 2.1, which requires new oneAPI and GPU driver versions. (See the** `GPU installation guide `_ **for more details.)**
-- [2023/12] ``ipex-llm`` now supports `ReLoRA `_ (see `"ReLoRA: High-Rank Training Through Low-Rank Updates" `_).
-- [2023/12] ``ipex-llm`` now supports `Mixtral-8x7B `_ on both Intel `GPU `_ and `CPU `_.
-- [2023/12] ``ipex-llm`` now supports `QA-LoRA `_ (see `"QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models" `_).
-- [2023/12] ``ipex-llm`` now supports `FP8 and FP4 inference `_ on Intel **GPU**.
-- [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``ipex-llm`` is available.
-- [2023/11] ``ipex-llm`` now supports `vLLM continuous batching `_ on both Intel `GPU `_ and `CPU `_.
-- [2023/10] ``ipex-llm`` now supports `QLoRA finetuning `_ on both Intel `GPU `_ and `CPU `_.
-- [2023/10] ``ipex-llm`` now supports `FastChat serving `_ on on both Intel CPU and GPU.
-- [2023/09] ``ipex-llm`` now supports `Intel GPU `_ (including Arc, Flex and MAX)
-- [2023/09] ``ipex-llm`` `tutorial `_ is released.
-- Over 30 models have been verified on ``ipex-llm``, including *LLaMA/LLaMA2, ChatGLM2/ChatGLM3, Mistral, Falcon, MPT, LLaVA, WizardCoder, Dolly, Whisper, Baichuan/Baichuan2, InternLM, Skywork, QWen/Qwen-VL, Aquila, MOSS* and more; see the complete list `here `_.
+* [2024/03] ``bigdl-llm`` has now become ``ipex-llm`` (see the migration guide `here `_); you may find the original ``BigDL`` project `here `_.
+* [2024/02] ``ipex-llm`` now supports directly loading model from `ModelScope `_ (`魔搭 `_).
+* [2024/02] ``ipex-llm`` added inital **INT2** support (based on llama.cpp `IQ2 `_ mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
+* [2024/02] Users can now use ``ipex-llm`` through `Text-Generation-WebUI `_ GUI.
+* [2024/02] ``ipex-llm`` now supports `*Self-Speculative Decoding* `_, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel `GPU `_ and `CPU `_ respectively.
+* [2024/02] ``ipex-llm`` now supports a comprehensive list of LLM finetuning on Intel GPU (including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_).
+* [2024/01] Using ``ipex-llm`` `QLoRA `_, we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for `Standford-Alpaca `_ (see the blog `here `_).
+
+
+.. dropdown:: More updates
+ :color: primary
+
+ * [2023/12] ``ipex-llm`` now supports `ReLoRA `_ (see `"ReLoRA: High-Rank Training Through Low-Rank Updates" `_).
+ * [2023/12] ``ipex-llm`` now supports `Mixtral-8x7B `_ on both Intel `GPU `_ and `CPU `_.
+ * [2023/12] ``ipex-llm`` now supports `QA-LoRA `_ (see `"QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models" `_).
+ * [2023/12] ``ipex-llm`` now supports `FP8 and FP4 inference `_ on Intel **GPU**.
+ * [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``ipex-llm`` is available.
+ * [2023/11] ``ipex-llm`` now supports `vLLM continuous batching `_ on both Intel `GPU `_ and `CPU `_.
+ * [2023/10] ``ipex-llm`` now supports `QLoRA finetuning `_ on both Intel `GPU `_ and `CPU `_.
+ * [2023/10] ``ipex-llm`` now supports `FastChat serving `_ on on both Intel CPU and GPU.
+ * [2023/09] ``ipex-llm`` now supports `Intel GPU `_ (including iGPU, Arc, Flex and MAX).
+ * [2023/09] ``ipex-llm`` `tutorial `_ is released.
************************************************
-``ipex-llm`` demos
+``ipex-llm`` Demos
************************************************
See the **optimized performance** of ``chatglm2-6b`` and ``llama-2-13b-chat`` models on 12th Gen Intel Core CPU and Intel Arc GPU below.
@@ -74,82 +103,85 @@ See the **optimized performance** of ``chatglm2-6b`` and ``llama-2-13b-chat`` mo
************************************************
-``ipex-llm`` quickstart
+``ipex-llm`` Quickstart
************************************************
-- `Windows GPU installation `_
-- `Run IPEX-LLM in Text-Generation-WebUI `_
-- `Run IPEX-LLM using Docker `_
-- `CPU quickstart <#cpu-quickstart>`_
-- `GPU quickstart <#gpu-quickstart>`_
+* `Windows GPU `_: installing ``ipex-llm`` on Windows with Intel GPU
+* `Linux GPU `_: installing ``ipex-llm`` on Linux with Intel GPU
+* `Docker `_: using ``ipex-llm`` dockers on Intel CPU and GPU
+
+.. seealso::
+
+ For more details, please refer to the `installation guide `_
============================================
-CPU Quickstart
+Run ``ipex-llm``
============================================
-You may install ``ipex-llm`` on Intel CPU as follows as follows:
-
-.. note::
-
- See the `CPU installation guide `_ for more details.
-
-.. code-block:: console
-
- pip install --pre --upgrade ipex-llm[all]
-
-.. note::
-
- ``ipex-llm`` has been tested on Python 3.9, 3.10 and 3.11
-
-You can then apply INT4 optimizations to any Hugging Face *Transformers* models as follows.
-
-.. code-block:: python
-
- #load Hugging Face Transformers model with INT4 optimizations
- from ipex_llm.transformers import AutoModelForCausalLM
- model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
-
- #run the optimized model on Intel CPU
- from transformers import AutoTokenizer
- tokenizer = AutoTokenizer.from_pretrained(model_path)
- input_ids = tokenizer.encode(input_str, ...)
- output_ids = model.generate(input_ids, ...)
- output = tokenizer.batch_decode(output_ids)
+* `llama.cpp `_: running **ipex-llm for llama.cpp** (*using C++ interface of* ``ipex-llm`` *as an accelerated backend for* ``llama.cpp`` *on Intel GPU*)
+* `vLLM `_: running ``ipex-llm`` in ``vLLM`` on both Intel `GPU `_ and `CPU `_
+* `FastChat `_: running ``ipex-llm`` in ``FastChat`` serving on on both Intel GPU and CPU
+* `LangChain-Chatchat RAG `_: running ``ipex-llm`` in ``LangChain-Chatchat`` (*Knowledge Base QA using* **RAG** *pipeline*)
+* `Text-Generation-WebUI `_: running ``ipex-llm`` in ``oobabooga`` **WebUI**
+* `Benchmarking `_: running (latency and throughput) benchmarks for ``ipex-llm`` on Intel CPU and GPU
============================================
-GPU Quickstart
+Code Examples
============================================
+* Low bit inference
-You may install ``ipex-llm`` on Intel GPU as follows as follows:
+ * `INT4 inference `_: **INT4** LLM inference on Intel `GPU `_ and `CPU `_
+ * `FP8/FP4 inference `_: **FP8** and **FP4** LLM inference on Intel `GPU `_
+ * `INT8 inference `_: **INT8** LLM inference on Intel `GPU `_ and `CPU `_
+ * `INT2 inference `_: **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel `GPU `_
-.. note::
+* FP16/BF16 inference
- See the `GPU installation guide `_ for more details.
+ * **FP16** LLM inference on Intel `GPU `_, with possible `self-speculative decoding `_ optimization
+ * **BF16** LLM inference on Intel `CPU `_, with possible `self-speculative decoding `_ optimization
-.. code-block:: console
+* Save and load
- # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
- pip install --pre --upgrade ipex-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
+ * `Low-bit models `_: saving and loading ``ipex-llm`` low-bit models
+ * `GGUF `_: directly loading GGUF models into ``ipex-llm``
+ * `AWQ `_: directly loading AWQ models into ``ipex-llm``
+ * `GPTQ `_: directly loading GPTQ models into ``ipex-llm``
-.. note::
+* Finetuning
- ``ipex-llm`` has been tested on Python 3.9, 3.10 and 3.11
+ * LLM finetuning on Intel `GPU `_, including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_
+ * QLoRA finetuning on Intel `CPU `_
-You can then apply INT4 optimizations to any Hugging Face *Transformers* models on Intel GPU as follows.
+* Integration with community libraries
-.. code-block:: python
+ * `HuggingFace tansformers `_
+ * `Standard PyTorch model `_
+ * `DeepSpeed-AutoTP `_
+ * `HuggingFace PEFT `_
+ * `HuggingFace TRL `_
+ * `LangChain `_
+ * `LlamaIndex `_
+ * `AutoGen `_
+ * `ModeScope `_
- #load Hugging Face Transformers model with INT4 optimizations
- from ipex_llm.transformers import AutoModelForCausalLM
- model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
+* `Tutorials `_
- #run the optimized model on Intel GPU
- model = model.to('xpu')
- from transformers import AutoTokenizer
- tokenizer = AutoTokenizer.from_pretrained(model_path)
- input_ids = tokenizer.encode(input_str, ...).to('xpu')
- output_ids = model.generate(input_ids, ...)
- output = tokenizer.batch_decode(output_ids.cpu())
+.. seealso::
-**For more details, please refer to the ipex-llm** `Document `_, `Readme `_, `Tutorial `_ and `API Doc `_.
+ For more details, please refer to the |ipex_llm_document|_.
+
+.. |ipex_llm_document| replace:: ``ipex-llm`` document
+.. _ipex_llm_document: doc/LLM/index.html
+
+------
+
+.. raw:: html
+
+
+
+
+ Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.
+
+
+