.. meta::
:google-site-verification: S66K6GAclKw1RroxU0Rka_2d1LZFVe27M0gRneEsIVI
.. important::
.. raw:: html
bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.
------
################################################
💫 Intel® LLM library for PyTorch*
################################################
.. raw:: html
IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency [1].
.. note::
.. raw:: html
-
It runs on top of Intel Extension for PyTorch (
IPEX), and is built on top of the excellent work of llama.cpp, transfromers, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc.
-
It provides seamless integration with llama.cpp, ollama, Text-Generation-WebUI, HuggingFace transformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
-
50+ models have been optimized/verified on
ipex-llm (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.
************************************************
Latest update 🔥
************************************************
* [2024/05] ``ipex-llm`` now supports **Axolotl** for LLM finetuning on Intel GPU; see the quickstart `here `_.
* [2024/04] You can now run **Open WebUI** on Intel GPU using ``ipex-llm``; see the quickstart `here `_.
* [2024/04] You can now run **Llama 3** on Intel GPU using ``llama.cpp`` and ``ollama``; see the quickstart `here `_.
* [2024/04] ``ipex-llm`` now supports **Llama 3** on Intel `GPU `_ and `CPU `_.
* [2024/04] ``ipex-llm`` now provides C++ interface, which can be used as an accelerated backend for running `llama.cpp `_ and `ollama `_ on Intel GPU.
* [2024/03] ``bigdl-llm`` has now become ``ipex-llm`` (see the migration guide `here `_); you may find the original ``BigDL`` project `here `_.
* [2024/02] ``ipex-llm`` now supports directly loading model from `ModelScope `_ (`魔搭 `_).
* [2024/02] ``ipex-llm`` added inital **INT2** support (based on llama.cpp `IQ2 `_ mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
* [2024/02] Users can now use ``ipex-llm`` through `Text-Generation-WebUI `_ GUI.
* [2024/02] ``ipex-llm`` now supports `Self-Speculative Decoding `_, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel `GPU `_ and `CPU `_ respectively.
* [2024/02] ``ipex-llm`` now supports a comprehensive list of LLM finetuning on Intel GPU (including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_).
* [2024/01] Using ``ipex-llm`` `QLoRA `_, we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for `Standford-Alpaca `_ (see the blog `here `_).
.. dropdown:: More updates
:color: primary
* [2023/12] ``ipex-llm`` now supports `ReLoRA `_ (see `"ReLoRA: High-Rank Training Through Low-Rank Updates" `_).
* [2023/12] ``ipex-llm`` now supports `Mixtral-8x7B `_ on both Intel `GPU `_ and `CPU `_.
* [2023/12] ``ipex-llm`` now supports `QA-LoRA `_ (see `"QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models" `_).
* [2023/12] ``ipex-llm`` now supports `FP8 and FP4 inference `_ on Intel **GPU**.
* [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``ipex-llm`` is available.
* [2023/11] ``ipex-llm`` now supports `vLLM continuous batching `_ on both Intel `GPU `_ and `CPU `_.
* [2023/10] ``ipex-llm`` now supports `QLoRA finetuning `_ on both Intel `GPU `_ and `CPU `_.
* [2023/10] ``ipex-llm`` now supports `FastChat serving `_ on on both Intel CPU and GPU.
* [2023/09] ``ipex-llm`` now supports `Intel GPU `_ (including iGPU, Arc, Flex and MAX).
* [2023/09] ``ipex-llm`` `tutorial `_ is released.
************************************************
``ipex-llm`` Demos
************************************************
See the **optimized performance** of ``chatglm2-6b`` and ``llama-2-13b-chat`` models on 12th Gen Intel Core CPU and Intel Arc GPU below.
.. raw:: html
| 12th Gen Intel Core CPU |
Intel Arc GPU |
|
|
|
|
chatglm2-6b |
llama-2-13b-chat |
chatglm2-6b |
llama-2-13b-chat |
************************************************
``ipex-llm`` Quickstart
************************************************
============================================
Docker
============================================
* `GPU Inference in C++ `_: running ``llama.cpp``, ``ollama``, ``OpenWebUI``, etc., with ``ipex-llm`` on Intel GPU
* `GPU Inference in Python `_: running HuggingFace ``transformers``, ``LangChain``, ``LlamaIndex``, ``ModelScope``, etc. with ``ipex-llm`` on Intel GPU
* `vLLM on GPU `_: running ``vLLM`` serving with ``ipex-llm`` on Intel GPU
* `FastChat on GPU `_: running ``FastChat`` serving with ``ipex-llm`` on Intel GPU
============================================
Run
============================================
* `llama.cpp `_: running **llama.cpp** (*using C++ interface of* ``ipex-llm`` *as an accelerated backend for* ``llama.cpp``) on Intel GPU
* `ollama `_: running **ollama** (*using C++ interface of* ``ipex-llm`` *as an accelerated backend for* ``ollama``) on Intel GPU
* `vLLM `_: running ``ipex-llm`` in ``vLLM`` on both Intel `GPU `_ and `CPU `_
* `FastChat `_: running ``ipex-llm`` in ``FastChat`` serving on on both Intel GPU and CPU
* `LangChain-Chatchat RAG `_: running ``ipex-llm`` in ``LangChain-Chatchat`` (*Knowledge Base QA using* **RAG** *pipeline*)
* `Text-Generation-WebUI `_: running ``ipex-llm`` in ``oobabooga`` **WebUI**
* `Benchmarking `_: running (latency and throughput) benchmarks for ``ipex-llm`` on Intel CPU and GPU
============================================
Install
============================================
* `Windows GPU `_: installing ``ipex-llm`` on Windows with Intel GPU
* `Linux GPU `_: installing ``ipex-llm`` on Linux with Intel GPU
.. seealso::
For more details, please refer to the `installation guide `_
============================================
Code Examples
============================================
* Low bit inference
* `INT4 inference `_: **INT4** LLM inference on Intel `GPU `_ and `CPU `_
* `FP8/FP4 inference `_: **FP8** and **FP4** LLM inference on Intel `GPU `_
* `INT8 inference `_: **INT8** LLM inference on Intel `GPU `_ and `CPU `_
* `INT2 inference `_: **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel `GPU `_
* FP16/BF16 inference
* **FP16** LLM inference on Intel `GPU `_, with possible `self-speculative decoding `_ optimization
* **BF16** LLM inference on Intel `CPU `_, with possible `self-speculative decoding `_ optimization
* Save and load
* `Low-bit models `_: saving and loading ``ipex-llm`` low-bit models
* `GGUF `_: directly loading GGUF models into ``ipex-llm``
* `AWQ `_: directly loading AWQ models into ``ipex-llm``
* `GPTQ `_: directly loading GPTQ models into ``ipex-llm``
* Finetuning
* LLM finetuning on Intel `GPU `_, including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_
* QLoRA finetuning on Intel `CPU `_
* Integration with community libraries
* `HuggingFace transformers `_
* `Standard PyTorch model `_
* `DeepSpeed-AutoTP `_
* `HuggingFace PEFT `_
* `HuggingFace TRL `_
* `LangChain `_
* `LlamaIndex `_
* `AutoGen `_
* `ModeScope `_
* `Tutorials `_
.. seealso::
For more details, please refer to the |ipex_llm_document|_.
.. |ipex_llm_document| replace:: ``ipex-llm`` document
.. _ipex_llm_document: doc/LLM/index.html
************************************************
Verified Models
************************************************
.. raw:: html
| Model |
CPU Example |
GPU Example |
| LLaMA
(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.) |
link1,
link2 |
link
link |
| LLaMA 2 |
link1,
link2 |
link
link |
| LLaMA 3 |
link |
link |
| ChatGLM |
link |
|
| ChatGLM2 |
link |
link |
| ChatGLM3 |
link |
link |
| Mistral |
link |
link |
| Mixtral |
link |
link |
| Falcon |
link |
link |
| MPT |
link |
link |
| Dolly-v1 |
link |
link |
| Dolly-v2 |
link |
link |
| Replit Code |
link |
link |
| RedPajama |
link1,
link2 |
|
| Phoenix |
link1,
link2 |
|
| StarCoder |
link1,
link2 |
link |
| Baichuan |
link |
link |
| Baichuan2 |
link |
link |
| InternLM |
link |
link |
| Qwen |
link |
link |
| Qwen1.5 |
link |
link |
| Qwen-VL |
link |
link |
| Aquila |
link |
link |
| Aquila2 |
link |
link |
| MOSS |
link |
|
| Whisper |
link |
link |
| Phi-1_5 |
link |
link |
| Flan-t5 |
link |
link |
| LLaVA |
link |
link |
| CodeLlama |
link |
link |
| Skywork |
link |
|
| InternLM-XComposer |
link |
|
| WizardCoder-Python |
link |
|
| CodeShell |
link |
|
| Fuyu |
link |
|
| Distil-Whisper |
link |
link |
| Yi |
link |
link |
| BlueLM |
link |
link |
| Mamba |
link |
link |
| SOLAR |
link |
link |
| Phixtral |
link |
link |
| InternLM2 |
link |
link |
| RWKV4 |
|
link |
| RWKV5 |
|
link |
| Bark |
link |
link |
| SpeechT5 |
|
link |
| DeepSeek-MoE |
link |
|
| Ziya-Coding-34B-v1.0 |
link |
|
| Phi-2 |
link |
link |
| Phi-3 |
link |
link |
| Phi-3-vision |
link |
link |
| Yuan2 |
link |
link |
| Gemma |
link |
link |
| DeciLM-7B |
link |
link |
| Deepseek |
link |
link |
| StableLM |
link |
link |
| CodeGemma |
link |
link |
| Command-R/cohere |
link |
link |
| CodeGeeX2 |
link |
link |
************************************************
Get Support
************************************************
* Please report a bug or raise a feature request by opening a `Github Issue `_
* Please report a vulnerability by opening a draft `GitHub Security Advisory `_
------
.. raw:: html
Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.