.. meta::
:google-site-verification: S66K6GAclKw1RroxU0Rka_2d1LZFVe27M0gRneEsIVI
.. important::
.. raw:: html
bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.
------
################################################
💫 IPEX-LLM
################################################
.. raw:: html
IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency [1].
.. note::
.. raw:: html
-
It is built on top of Intel Extension for PyTorch (IPEX), as well as the excellent work of
llama.cpp, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc.
-
It provides seamless integration with llama.cpp, Text-Generation-WebUI, HuggingFace tansformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
-
50+ models have been optimized/verified on
ipex-llm (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.
************************************************
Latest update 🔥
************************************************
* [2024/03] ``bigdl-llm`` has now become ``ipex-llm`` (see the migration guide `here `_); you may find the original ``BigDL`` project `here `_.
* [2024/02] ``ipex-llm`` now supports directly loading model from `ModelScope `_ (`魔搭 `_).
* [2024/02] ``ipex-llm`` added inital **INT2** support (based on llama.cpp `IQ2 `_ mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
* [2024/02] Users can now use ``ipex-llm`` through `Text-Generation-WebUI `_ GUI.
* [2024/02] ``ipex-llm`` now supports `*Self-Speculative Decoding* `_, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel `GPU `_ and `CPU `_ respectively.
* [2024/02] ``ipex-llm`` now supports a comprehensive list of LLM finetuning on Intel GPU (including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_).
* [2024/01] Using ``ipex-llm`` `QLoRA `_, we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for `Standford-Alpaca `_ (see the blog `here `_).
.. dropdown:: More updates
:color: primary
* [2023/12] ``ipex-llm`` now supports `ReLoRA `_ (see `"ReLoRA: High-Rank Training Through Low-Rank Updates" `_).
* [2023/12] ``ipex-llm`` now supports `Mixtral-8x7B `_ on both Intel `GPU `_ and `CPU `_.
* [2023/12] ``ipex-llm`` now supports `QA-LoRA `_ (see `"QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models" `_).
* [2023/12] ``ipex-llm`` now supports `FP8 and FP4 inference `_ on Intel **GPU**.
* [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``ipex-llm`` is available.
* [2023/11] ``ipex-llm`` now supports `vLLM continuous batching `_ on both Intel `GPU `_ and `CPU `_.
* [2023/10] ``ipex-llm`` now supports `QLoRA finetuning `_ on both Intel `GPU `_ and `CPU `_.
* [2023/10] ``ipex-llm`` now supports `FastChat serving `_ on on both Intel CPU and GPU.
* [2023/09] ``ipex-llm`` now supports `Intel GPU `_ (including iGPU, Arc, Flex and MAX).
* [2023/09] ``ipex-llm`` `tutorial `_ is released.
************************************************
``ipex-llm`` Demos
************************************************
See the **optimized performance** of ``chatglm2-6b`` and ``llama-2-13b-chat`` models on 12th Gen Intel Core CPU and Intel Arc GPU below.
.. raw:: html
| 12th Gen Intel Core CPU |
Intel Arc GPU |
|
|
|
|
chatglm2-6b |
llama-2-13b-chat |
chatglm2-6b |
llama-2-13b-chat |
************************************************
``ipex-llm`` Quickstart
************************************************
* `Windows GPU `_: installing ``ipex-llm`` on Windows with Intel GPU
* `Linux GPU `_: installing ``ipex-llm`` on Linux with Intel GPU
* `Docker `_: using ``ipex-llm`` dockers on Intel CPU and GPU
.. seealso::
For more details, please refer to the `installation guide `_
============================================
Run ``ipex-llm``
============================================
* `llama.cpp `_: running **ipex-llm for llama.cpp** (*using C++ interface of* ``ipex-llm`` *as an accelerated backend for* ``llama.cpp`` *on Intel GPU*)
* `vLLM `_: running ``ipex-llm`` in ``vLLM`` on both Intel `GPU `_ and `CPU `_
* `FastChat `_: running ``ipex-llm`` in ``FastChat`` serving on on both Intel GPU and CPU
* `LangChain-Chatchat RAG `_: running ``ipex-llm`` in ``LangChain-Chatchat`` (*Knowledge Base QA using* **RAG** *pipeline*)
* `Text-Generation-WebUI `_: running ``ipex-llm`` in ``oobabooga`` **WebUI**
* `Benchmarking `_: running (latency and throughput) benchmarks for ``ipex-llm`` on Intel CPU and GPU
============================================
Code Examples
============================================
* Low bit inference
* `INT4 inference `_: **INT4** LLM inference on Intel `GPU `_ and `CPU `_
* `FP8/FP4 inference `_: **FP8** and **FP4** LLM inference on Intel `GPU `_
* `INT8 inference `_: **INT8** LLM inference on Intel `GPU `_ and `CPU `_
* `INT2 inference `_: **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel `GPU `_
* FP16/BF16 inference
* **FP16** LLM inference on Intel `GPU `_, with possible `self-speculative decoding `_ optimization
* **BF16** LLM inference on Intel `CPU `_, with possible `self-speculative decoding `_ optimization
* Save and load
* `Low-bit models `_: saving and loading ``ipex-llm`` low-bit models
* `GGUF `_: directly loading GGUF models into ``ipex-llm``
* `AWQ `_: directly loading AWQ models into ``ipex-llm``
* `GPTQ `_: directly loading GPTQ models into ``ipex-llm``
* Finetuning
* LLM finetuning on Intel `GPU `_, including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_
* QLoRA finetuning on Intel `CPU `_
* Integration with community libraries
* `HuggingFace tansformers `_
* `Standard PyTorch model `_
* `DeepSpeed-AutoTP `_
* `HuggingFace PEFT `_
* `HuggingFace TRL `_
* `LangChain `_
* `LlamaIndex `_
* `AutoGen `_
* `ModeScope `_
* `Tutorials `_
.. seealso::
For more details, please refer to the |ipex_llm_document|_.
.. |ipex_llm_document| replace:: ``ipex-llm`` document
.. _ipex_llm_document: doc/LLM/index.html
------
.. raw:: html
Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.