.. meta:: :google-site-verification: S66K6GAclKw1RroxU0Rka_2d1LZFVe27M0gRneEsIVI .. important:: .. raw:: html

bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.

------ ################################################ 💫 IPEX-LLM ################################################ .. raw:: html

IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency [1].

.. note:: .. raw:: html

************************************************ Latest update 🔥 ************************************************ * [2024/03] ``bigdl-llm`` has now become ``ipex-llm`` (see the migration guide `here `_); you may find the original ``BigDL`` project `here `_. * [2024/02] ``ipex-llm`` now supports directly loading model from `ModelScope `_ (`魔搭 `_). * [2024/02] ``ipex-llm`` added inital **INT2** support (based on llama.cpp `IQ2 `_ mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. * [2024/02] Users can now use ``ipex-llm`` through `Text-Generation-WebUI `_ GUI. * [2024/02] ``ipex-llm`` now supports `*Self-Speculative Decoding* `_, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel `GPU `_ and `CPU `_ respectively. * [2024/02] ``ipex-llm`` now supports a comprehensive list of LLM finetuning on Intel GPU (including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_). * [2024/01] Using ``ipex-llm`` `QLoRA `_, we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for `Standford-Alpaca `_ (see the blog `here `_). .. dropdown:: More updates :color: primary * [2023/12] ``ipex-llm`` now supports `ReLoRA `_ (see `"ReLoRA: High-Rank Training Through Low-Rank Updates" `_). * [2023/12] ``ipex-llm`` now supports `Mixtral-8x7B `_ on both Intel `GPU `_ and `CPU `_. * [2023/12] ``ipex-llm`` now supports `QA-LoRA `_ (see `"QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models" `_). * [2023/12] ``ipex-llm`` now supports `FP8 and FP4 inference `_ on Intel **GPU**. * [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``ipex-llm`` is available. * [2023/11] ``ipex-llm`` now supports `vLLM continuous batching `_ on both Intel `GPU `_ and `CPU `_. * [2023/10] ``ipex-llm`` now supports `QLoRA finetuning `_ on both Intel `GPU `_ and `CPU `_. * [2023/10] ``ipex-llm`` now supports `FastChat serving `_ on on both Intel CPU and GPU. * [2023/09] ``ipex-llm`` now supports `Intel GPU `_ (including iGPU, Arc, Flex and MAX). * [2023/09] ``ipex-llm`` `tutorial `_ is released. ************************************************ ``ipex-llm`` Demos ************************************************ See the **optimized performance** of ``chatglm2-6b`` and ``llama-2-13b-chat`` models on 12th Gen Intel Core CPU and Intel Arc GPU below. .. raw:: html
12th Gen Intel Core CPU Intel Arc GPU
chatglm2-6b llama-2-13b-chat chatglm2-6b llama-2-13b-chat
************************************************ ``ipex-llm`` Quickstart ************************************************ * `Windows GPU `_: installing ``ipex-llm`` on Windows with Intel GPU * `Linux GPU `_: installing ``ipex-llm`` on Linux with Intel GPU * `Docker `_: using ``ipex-llm`` dockers on Intel CPU and GPU .. seealso:: For more details, please refer to the `installation guide `_ ============================================ Run ``ipex-llm`` ============================================ * `llama.cpp `_: running **ipex-llm for llama.cpp** (*using C++ interface of* ``ipex-llm`` *as an accelerated backend for* ``llama.cpp`` *on Intel GPU*) * `vLLM `_: running ``ipex-llm`` in ``vLLM`` on both Intel `GPU `_ and `CPU `_ * `FastChat `_: running ``ipex-llm`` in ``FastChat`` serving on on both Intel GPU and CPU * `LangChain-Chatchat RAG `_: running ``ipex-llm`` in ``LangChain-Chatchat`` (*Knowledge Base QA using* **RAG** *pipeline*) * `Text-Generation-WebUI `_: running ``ipex-llm`` in ``oobabooga`` **WebUI** * `Benchmarking `_: running (latency and throughput) benchmarks for ``ipex-llm`` on Intel CPU and GPU ============================================ Code Examples ============================================ * Low bit inference * `INT4 inference `_: **INT4** LLM inference on Intel `GPU `_ and `CPU `_ * `FP8/FP4 inference `_: **FP8** and **FP4** LLM inference on Intel `GPU `_ * `INT8 inference `_: **INT8** LLM inference on Intel `GPU `_ and `CPU `_ * `INT2 inference `_: **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel `GPU `_ * FP16/BF16 inference * **FP16** LLM inference on Intel `GPU `_, with possible `self-speculative decoding `_ optimization * **BF16** LLM inference on Intel `CPU `_, with possible `self-speculative decoding `_ optimization * Save and load * `Low-bit models `_: saving and loading ``ipex-llm`` low-bit models * `GGUF `_: directly loading GGUF models into ``ipex-llm`` * `AWQ `_: directly loading AWQ models into ``ipex-llm`` * `GPTQ `_: directly loading GPTQ models into ``ipex-llm`` * Finetuning * LLM finetuning on Intel `GPU `_, including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_ * QLoRA finetuning on Intel `CPU `_ * Integration with community libraries * `HuggingFace tansformers `_ * `Standard PyTorch model `_ * `DeepSpeed-AutoTP `_ * `HuggingFace PEFT `_ * `HuggingFace TRL `_ * `LangChain `_ * `LlamaIndex `_ * `AutoGen `_ * `ModeScope `_ * `Tutorials `_ .. seealso:: For more details, please refer to the |ipex_llm_document|_. .. |ipex_llm_document| replace:: ``ipex-llm`` document .. _ipex_llm_document: doc/LLM/index.html ------ .. raw:: html

[1] Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.