.. meta:: :google-site-verification: S66K6GAclKw1RroxU0Rka_2d1LZFVe27M0gRneEsIVI ################################################ The IPEX-LLM Project ################################################ ------ ************************************************ IPEX-LLM ************************************************ .. raw:: html

ipex-llm is a library for running LLM (large language model) on Intel XPU (from Laptop to GPU to Cloud) using INT4/FP4/INT8/FP8 with very low latency ^[1] (for any PyTorch model).

.. note:: It is built on top of the excellent work of `llama.cpp `_, `gptq `_, `bitsandbytes `_, `qlora `_, etc. ============================================ Latest update 🔥 ============================================ - [2024/03] **LangChain** added support for ``ipex-llm``; see the details `here `_. - [2024/02] ``ipex-llm`` now supports directly loading model from `ModelScope `_ (`魔搭 `_). - [2024/02] ``ipex-llm`` added inital **INT2** support (based on llama.cpp `IQ2 `_ mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. - [2024/02] Users can now use ``ipex-llm`` through `Text-Generation-WebUI `_ GUI. - [2024/02] ``ipex-llm`` now supports `Self-Speculative Decoding `_, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel `GPU `_ and `CPU `_ respectively. - [2024/02] ``ipex-llm`` now supports a comprehensive list of LLM finetuning on Intel GPU (including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_). - [2024/01] Using ``ipex-llm`` `QLoRA `_, we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for `Standford-Alpaca `_ (see the blog `here `_). - [2024/01] 🔔🔔🔔 **The default** ``ipex-llm`` **GPU Linux installation has switched from PyTorch 2.0 to PyTorch 2.1, which requires new oneAPI and GPU driver versions. (See the** `GPU installation guide `_ **for more details.)** - [2023/12] ``ipex-llm`` now supports `ReLoRA `_ (see `"ReLoRA: High-Rank Training Through Low-Rank Updates" `_). - [2023/12] ``ipex-llm`` now supports `Mixtral-8x7B `_ on both Intel `GPU `_ and `CPU `_. - [2023/12] ``ipex-llm`` now supports `QA-LoRA `_ (see `"QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models" `_). - [2023/12] ``ipex-llm`` now supports `FP8 and FP4 inference `_ on Intel **GPU**. - [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``ipex-llm`` is available. - [2023/11] ``ipex-llm`` now supports `vLLM continuous batching `_ on both Intel `GPU `_ and `CPU `_. - [2023/10] ``ipex-llm`` now supports `QLoRA finetuning `_ on both Intel `GPU `_ and `CPU `_. - [2023/10] ``ipex-llm`` now supports `FastChat serving `_ on on both Intel CPU and GPU. - [2023/09] ``ipex-llm`` now supports `Intel GPU `_ (including Arc, Flex and MAX) - [2023/09] ``ipex-llm`` `tutorial `_ is released. - Over 30 models have been verified on ``ipex-llm``, including *LLaMA/LLaMA2, ChatGLM2/ChatGLM3, Mistral, Falcon, MPT, LLaVA, WizardCoder, Dolly, Whisper, Baichuan/Baichuan2, InternLM, Skywork, QWen/Qwen-VL, Aquila, MOSS* and more; see the complete list `here `_. ============================================ ``ipex-llm`` demos ============================================ See the **optimized performance** of ``chatglm2-6b`` and ``llama-2-13b-chat`` models on 12th Gen Intel Core CPU and Intel Arc GPU below. .. raw:: html

12th Gen Intel Core CPU		Intel Arc GPU

`chatglm2-6b`	`llama-2-13b-chat`	`chatglm2-6b`	`llama-2-13b-chat`

============================================ ``ipex-llm`` quickstart ============================================ - `Windows GPU installation `_ - `Run IPEX-LLM in Text-Generation-WebUI `_ - `Run IPEX-LLM using Docker `_ - `CPU quickstart <#cpu-quickstart>`_ - `GPU quickstart <#gpu-quickstart>`_ -------------------------------------------- CPU Quickstart -------------------------------------------- You may install ``ipex-llm`` on Intel CPU as follows as follows: .. note:: See the `CPU installation guide `_ for more details. .. code-block:: console pip install --pre --upgrade ipex-llm[all] .. note:: ``ipex-llm`` has been tested on Python 3.9, 3.10 and 3.11 You can then apply INT4 optimizations to any Hugging Face *Transformers* models as follows. .. code-block:: python #load Hugging Face Transformers model with INT4 optimizations from ipex_llm.transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True) #run the optimized model on Intel CPU from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_path) input_ids = tokenizer.encode(input_str, ...) output_ids = model.generate(input_ids, ...) output = tokenizer.batch_decode(output_ids) -------------------------------------------- GPU Quickstart -------------------------------------------- You may install ``ipex-llm`` on Intel GPU as follows as follows: .. note:: See the `GPU installation guide `_ for more details. .. code-block:: console # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu .. note:: ``ipex-llm`` has been tested on Python 3.9, 3.10 and 3.11 You can then apply INT4 optimizations to any Hugging Face *Transformers* models on Intel GPU as follows. .. code-block:: python #load Hugging Face Transformers model with INT4 optimizations from ipex_llm.transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True) #run the optimized model on Intel GPU model = model.to('xpu') from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_path) input_ids = tokenizer.encode(input_str, ...).to('xpu') output_ids = model.generate(input_ids, ...) output = tokenizer.batch_decode(output_ids.cpu()) **For more details, please refer to the ipex-llm** `Document `_, `Readme `_, `Tutorial `_ and `API Doc `_.