.. meta:: :google-site-verification: S66K6GAclKw1RroxU0Rka_2d1LZFVe27M0gRneEsIVI .. important:: .. raw:: html

bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.

------ ################################################ 💫 IPEX-LLM ################################################ .. raw:: html

IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency ^[1].

.. note:: .. raw:: html

It is built on top of Intel Extension for PyTorch (IPEX), as well as the excellent work of llama.cpp, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc.
It provides seamless integration with llama.cpp, Text-Generation-WebUI, HuggingFace tansformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
50+ models have been optimized/verified on ipex-llm (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.

************************************************ Latest update 🔥 ************************************************ * [2024/03] ``bigdl-llm`` has now become ``ipex-llm`` (see the migration guide `here `_); you may find the original ``BigDL`` project `here `_. * [2024/02] ``ipex-llm`` now supports directly loading model from `ModelScope `_ (`魔搭 `_). * [2024/02] ``ipex-llm`` added inital **INT2** support (based on llama.cpp `IQ2 `_ mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. * [2024/02] Users can now use ``ipex-llm`` through `Text-Generation-WebUI `_ GUI. * [2024/02] ``ipex-llm`` now supports `*Self-Speculative Decoding* `_, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel `GPU `_ and `CPU `_ respectively. * [2024/02] ``ipex-llm`` now supports a comprehensive list of LLM finetuning on Intel GPU (including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_). * [2024/01] Using ``ipex-llm`` `QLoRA `_, we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for `Standford-Alpaca `_ (see the blog `here `_). .. dropdown:: More updates :color: primary * [2023/12] ``ipex-llm`` now supports `ReLoRA `_ (see `"ReLoRA: High-Rank Training Through Low-Rank Updates" `_). * [2023/12] ``ipex-llm`` now supports `Mixtral-8x7B `_ on both Intel `GPU `_ and `CPU `_. * [2023/12] ``ipex-llm`` now supports `QA-LoRA `_ (see `"QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models" `_). * [2023/12] ``ipex-llm`` now supports `FP8 and FP4 inference `_ on Intel **GPU**. * [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``ipex-llm`` is available. * [2023/11] ``ipex-llm`` now supports `vLLM continuous batching `_ on both Intel `GPU `_ and `CPU `_. * [2023/10] ``ipex-llm`` now supports `QLoRA finetuning `_ on both Intel `GPU `_ and `CPU `_. * [2023/10] ``ipex-llm`` now supports `FastChat serving `_ on on both Intel CPU and GPU. * [2023/09] ``ipex-llm`` now supports `Intel GPU `_ (including iGPU, Arc, Flex and MAX). * [2023/09] ``ipex-llm`` `tutorial `_ is released. ************************************************ ``ipex-llm`` Demos ************************************************ See the **optimized performance** of ``chatglm2-6b`` and ``llama-2-13b-chat`` models on 12th Gen Intel Core CPU and Intel Arc GPU below. .. raw:: html

12th Gen Intel Core CPU		Intel Arc GPU

`chatglm2-6b`	`llama-2-13b-chat`	`chatglm2-6b`	`llama-2-13b-chat`

************************************************ ``ipex-llm`` Quickstart ************************************************ * `Windows GPU `_: installing ``ipex-llm`` on Windows with Intel GPU * `Linux GPU `_: installing ``ipex-llm`` on Linux with Intel GPU * `Docker `_: using ``ipex-llm`` dockers on Intel CPU and GPU .. seealso:: For more details, please refer to the `installation guide `_ ============================================ Run ``ipex-llm`` ============================================ * `llama.cpp `_: running **ipex-llm for llama.cpp** (*using C++ interface of* ``ipex-llm`` *as an accelerated backend for* ``llama.cpp`` *on Intel GPU*) * `vLLM `_: running ``ipex-llm`` in ``vLLM`` on both Intel `GPU `_ and `CPU `_ * `FastChat `_: running ``ipex-llm`` in ``FastChat`` serving on on both Intel GPU and CPU * `LangChain-Chatchat RAG `_: running ``ipex-llm`` in ``LangChain-Chatchat`` (*Knowledge Base QA using* **RAG** *pipeline*) * `Text-Generation-WebUI `_: running ``ipex-llm`` in ``oobabooga`` **WebUI** * `Benchmarking `_: running (latency and throughput) benchmarks for ``ipex-llm`` on Intel CPU and GPU ============================================ Code Examples ============================================ * Low bit inference * `INT4 inference `_: **INT4** LLM inference on Intel `GPU `_ and `CPU `_ * `FP8/FP4 inference `_: **FP8** and **FP4** LLM inference on Intel `GPU `_ * `INT8 inference `_: **INT8** LLM inference on Intel `GPU `_ and `CPU `_ * `INT2 inference `_: **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel `GPU `_ * FP16/BF16 inference * **FP16** LLM inference on Intel `GPU `_, with possible `self-speculative decoding `_ optimization * **BF16** LLM inference on Intel `CPU `_, with possible `self-speculative decoding `_ optimization * Save and load * `Low-bit models `_: saving and loading ``ipex-llm`` low-bit models * `GGUF `_: directly loading GGUF models into ``ipex-llm`` * `AWQ `_: directly loading AWQ models into ``ipex-llm`` * `GPTQ `_: directly loading GPTQ models into ``ipex-llm`` * Finetuning * LLM finetuning on Intel `GPU `_, including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_ * QLoRA finetuning on Intel `CPU `_ * Integration with community libraries * `HuggingFace tansformers `_ * `Standard PyTorch model `_ * `DeepSpeed-AutoTP `_ * `HuggingFace PEFT `_ * `HuggingFace TRL `_ * `LangChain `_ * `LlamaIndex `_ * `AutoGen `_ * `ModeScope `_ * `Tutorials `_ .. seealso:: For more details, please refer to the |ipex_llm_document|_. .. |ipex_llm_document| replace:: ``ipex-llm`` document .. _ipex_llm_document: doc/LLM/index.html ------ .. raw:: html

^{[1]
Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.}