.. meta:: :google-site-verification: S66K6GAclKw1RroxU0Rka_2d1LZFVe27M0gRneEsIVI .. important:: .. raw:: html

bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.

------ ################################################ 💫 Intel® LLM library for PyTorch* ################################################ .. raw:: html

IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency ^[1].

.. note:: .. raw:: html

It is built on top of the excellent work of llama.cpp, transfromers, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc.
It provides seamless integration with llama.cpp, ollama, Text-Generation-WebUI, HuggingFace transformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
50+ models have been optimized/verified on ipex-llm (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.

************************************************ Latest update 🔥 ************************************************ * [2024/05] ``ipex-llm`` now supports **Axolotl** for LLM finetuning on Intel GPU; see the quickstart `here `_. * [2024/04] You can now run **Open WebUI** on Intel GPU using ``ipex-llm``; see the quickstart `here `_. * [2024/04] You can now run **Llama 3** on Intel GPU using ``llama.cpp`` and ``ollama``; see the quickstart `here `_. * [2024/04] ``ipex-llm`` now supports **Llama 3** on Intel `GPU `_ and `CPU `_. * [2024/04] ``ipex-llm`` now provides C++ interface, which can be used as an accelerated backend for running `llama.cpp `_ and `ollama `_ on Intel GPU. * [2024/03] ``bigdl-llm`` has now become ``ipex-llm`` (see the migration guide `here `_); you may find the original ``BigDL`` project `here `_. * [2024/02] ``ipex-llm`` now supports directly loading model from `ModelScope `_ (`魔搭 `_). * [2024/02] ``ipex-llm`` added inital **INT2** support (based on llama.cpp `IQ2 `_ mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. * [2024/02] Users can now use ``ipex-llm`` through `Text-Generation-WebUI `_ GUI. * [2024/02] ``ipex-llm`` now supports `Self-Speculative Decoding `_, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel `GPU `_ and `CPU `_ respectively. * [2024/02] ``ipex-llm`` now supports a comprehensive list of LLM finetuning on Intel GPU (including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_). * [2024/01] Using ``ipex-llm`` `QLoRA `_, we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for `Standford-Alpaca `_ (see the blog `here `_). .. dropdown:: More updates :color: primary * [2023/12] ``ipex-llm`` now supports `ReLoRA `_ (see `"ReLoRA: High-Rank Training Through Low-Rank Updates" `_). * [2023/12] ``ipex-llm`` now supports `Mixtral-8x7B `_ on both Intel `GPU `_ and `CPU `_. * [2023/12] ``ipex-llm`` now supports `QA-LoRA `_ (see `"QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models" `_). * [2023/12] ``ipex-llm`` now supports `FP8 and FP4 inference `_ on Intel **GPU**. * [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``ipex-llm`` is available. * [2023/11] ``ipex-llm`` now supports `vLLM continuous batching `_ on both Intel `GPU `_ and `CPU `_. * [2023/10] ``ipex-llm`` now supports `QLoRA finetuning `_ on both Intel `GPU `_ and `CPU `_. * [2023/10] ``ipex-llm`` now supports `FastChat serving `_ on on both Intel CPU and GPU. * [2023/09] ``ipex-llm`` now supports `Intel GPU `_ (including iGPU, Arc, Flex and MAX). * [2023/09] ``ipex-llm`` `tutorial `_ is released. ************************************************ ``ipex-llm`` Performance ************************************************ .. raw:: html

See the Token Generation Speed on Intel Core Ultra and Intel Arc GPU below ^[1] (and refer to [2][3][4] for more details).

.. raw:: html

You may follow the `guide `_ to run ``ipex-llm`` performance benchmark yourself. ************************************************ ``ipex-llm`` Demos ************************************************ See demos of running local LLMs *on Intel Iris iGPU, Intel Core Ultra iGPU, single-card Arc GPU, or multi-card Arc GPUs* using ``ipex-llm`` below. .. raw:: html

Intel Iris iGPU	Intel Core Ultra iGPU	Intel Arc dGPU	2-Card Intel Arc dGPUs

`llama.cpp(Phi-3-mini Q4_0)`	`Ollama(Mistral-7B Q4_K)`	`TextGeneration-WebUI(Llama3-8B FP8)`	`FastChat(QWen1.5-32B FP6)`

************************************************ ``ipex-llm`` Quickstart ************************************************ ============================================ Docker ============================================ * `GPU Inference in C++ `_: running ``llama.cpp``, ``ollama``, ``OpenWebUI``, etc., with ``ipex-llm`` on Intel GPU * `GPU Inference in Python `_: running HuggingFace ``transformers``, ``LangChain``, ``LlamaIndex``, ``ModelScope``, etc. with ``ipex-llm`` on Intel GPU * `vLLM on GPU `_: running ``vLLM`` serving with ``ipex-llm`` on Intel GPU * `FastChat on GPU `_: running ``FastChat`` serving with ``ipex-llm`` on Intel GPU ============================================ Run ============================================ * `llama.cpp `_: running **llama.cpp** (*using C++ interface of* ``ipex-llm`` *as an accelerated backend for* ``llama.cpp``) on Intel GPU * `ollama `_: running **ollama** (*using C++ interface of* ``ipex-llm`` *as an accelerated backend for* ``ollama``) on Intel GPU * `vLLM `_: running ``ipex-llm`` in ``vLLM`` on both Intel `GPU `_ and `CPU `_ * `FastChat `_: running ``ipex-llm`` in ``FastChat`` serving on on both Intel GPU and CPU * `LangChain-Chatchat RAG `_: running ``ipex-llm`` in ``LangChain-Chatchat`` (*Knowledge Base QA using* **RAG** *pipeline*) * `Text-Generation-WebUI `_: running ``ipex-llm`` in ``oobabooga`` **WebUI** * `Benchmarking `_: running (latency and throughput) benchmarks for ``ipex-llm`` on Intel CPU and GPU ============================================ Install ============================================ * `Windows GPU `_: installing ``ipex-llm`` on Windows with Intel GPU * `Linux GPU `_: installing ``ipex-llm`` on Linux with Intel GPU .. seealso:: For more details, please refer to the `installation guide `_ ============================================ Code Examples ============================================ * Low bit inference * `INT4 inference `_: **INT4** LLM inference on Intel `GPU `_ and `CPU `_ * `FP8/FP4 inference `_: **FP8** and **FP4** LLM inference on Intel `GPU `_ * `INT8 inference `_: **INT8** LLM inference on Intel `GPU `_ and `CPU `_ * `INT2 inference `_: **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel `GPU `_ * FP16/BF16 inference * **FP16** LLM inference on Intel `GPU `_, with possible `self-speculative decoding `_ optimization * **BF16** LLM inference on Intel `CPU `_, with possible `self-speculative decoding `_ optimization * Save and load * `Low-bit models `_: saving and loading ``ipex-llm`` low-bit models * `GGUF `_: directly loading GGUF models into ``ipex-llm`` * `AWQ `_: directly loading AWQ models into ``ipex-llm`` * `GPTQ `_: directly loading GPTQ models into ``ipex-llm`` * Finetuning * LLM finetuning on Intel `GPU `_, including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_ * QLoRA finetuning on Intel `CPU `_ * Integration with community libraries * `HuggingFace transformers `_ * `Standard PyTorch model `_ * `DeepSpeed-AutoTP `_ * `HuggingFace PEFT `_ * `HuggingFace TRL `_ * `LangChain `_ * `LlamaIndex `_ * `AutoGen `_ * `ModeScope `_ * `Tutorials `_ .. seealso:: For more details, please refer to the |ipex_llm_document|_. .. |ipex_llm_document| replace:: ``ipex-llm`` document .. _ipex_llm_document: doc/LLM/index.html ************************************************ Verified Models ************************************************ .. raw:: html

Model	CPU Example	GPU Example
LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)	link1, link2	link link
LLaMA 2	link1, link2	link link
LLaMA 3	link	link
ChatGLM	link
ChatGLM2	link	link
ChatGLM3	link	link
GLM-4	link	link
Mistral	link	link
Mixtral	link	link
Falcon	link	link
MPT	link	link
Dolly-v1	link	link
Dolly-v2	link	link
Replit Code	link	link
RedPajama	link1, link2
Phoenix	link1, link2
StarCoder	link1, link2	link
Baichuan	link	link
Baichuan2	link	link
InternLM	link	link
Qwen	link	link
Qwen1.5	link	link
Qwen2	link	link
Qwen-VL	link	link
Aquila	link	link
Aquila2	link	link
MOSS	link
Whisper	link	link
Phi-1_5	link	link
Flan-t5	link	link
LLaVA	link	link
CodeLlama	link	link
Skywork	link
InternLM-XComposer	link
WizardCoder-Python	link
CodeShell	link
Fuyu	link
Distil-Whisper	link	link
Yi	link	link
BlueLM	link	link
Mamba	link	link
SOLAR	link	link
Phixtral	link	link
InternLM2	link	link
RWKV4		link
RWKV5		link
Bark	link	link
SpeechT5		link
DeepSeek-MoE	link
Ziya-Coding-34B-v1.0	link
Phi-2	link	link
Phi-3	link	link
Phi-3-vision	link	link
Yuan2	link	link
Gemma	link	link
DeciLM-7B	link	link
Deepseek	link	link
StableLM	link	link
CodeGemma	link	link
Command-R/cohere	link	link
CodeGeeX2	link	link
MiniCPM	link	link

************************************************ Get Support ************************************************ * Please report a bug or raise a feature request by opening a `Github Issue `_ * Please report a vulnerability by opening a draft `GitHub Security Advisory `_ ------ .. raw:: html

^{[1]
Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.}