| .github | ||
| apps | ||
| docker/llm | ||
| docs/mddocs | ||
| python/llm | ||
| .gitignore | ||
| .readthedocs.yml | ||
| LICENSE | ||
| pyproject.toml | ||
| README.md | ||
| README.zh-CN.md | ||
| SECURITY.md | ||
Important
bigdl-llmhas now becomeipex-llm(see the migration guide here); you may find the originalBigDLproject here.
💫 Intel® LLM Library for PyTorch*
< English | 中文 >
IPEX-LLM is an LLM acceleration library for Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max), NPU and CPU 1.
Note
- It is built on top of the excellent work of
llama.cpp,transformers,bitsandbytes,vLLM,qlora,AutoGPTQ,AutoAWQ, etc.- It provides seamless integration with llama.cpp, Ollama, HuggingFace transformers, LangChain, LlamaIndex, vLLM, Text-Generation-WebUI, DeepSpeed-AutoTP, FastChat, Axolotl, HuggingFace PEFT, HuggingFace TRL, AutoGen, ModeScope, etc.
- 70+ models have been optimized/verified on
ipex-llm(e.g., Llama, Phi, Mistral, Mixtral, Whisper, Qwen, MiniCPM, Qwen-VL, MiniCPM-V and more), with state-of-art LLM optimizations, XPU acceleration and low-bit (FP8/FP6/FP4/INT4) support; see the complete list here.
Project updates
- [2024/07] We added support for running Microsoft's **GraphRAG** using local LLM on Intel GPU; see the quickstart guide [here](docs/mddocs/Quickstart/graphrag_quickstart.md). - [2024/07] We added extensive support for Large Multimodal Models, including [StableDiffusion](https://github.com/jason-dai/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion), [Phi-3-Vision](python/llm/example/GPU/HuggingFace/Multimodal/phi-3-vision), [Qwen-VL](python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl), and [more](python/llm/example/GPU/HuggingFace/Multimodal). - [2024/07] We added **FP6** support on Intel [GPU](python/llm/example/GPU/HuggingFace/More-Data-Types). - [2024/06] We added experimental **NPU** support for Intel Core Ultra processors; see the examples [here](python/llm/example/NPU/HF-Transformers-AutoModels). - [2024/06] We added extensive support of **pipeline parallel** [inference](python/llm/example/GPU/Pipeline-Parallel-Inference), which makes it easy to run large-sized LLM using 2 or more Intel GPUs (such as Arc). - [2024/06] We added support for running **RAGFlow** with `ipex-llm` on Intel [GPU](docs/mddocs/Quickstart/ragflow_quickstart.md). - [2024/05] `ipex-llm` now supports **Axolotl** for LLM finetuning on Intel GPU; see the quickstart [here](docs/mddocs/Quickstart/axolotl_quickstart.md). - [2024/05] You can now easily run `ipex-llm` inference, serving and finetuning using the **Docker** [images](#docker). - [2024/05] You can now install `ipex-llm` on Windows using just "*[one command](docs/mddocs/Quickstart/install_windows_gpu.md#install-ipex-llm)*". - [2024/04] You can now run **Open WebUI** on Intel GPU using `ipex-llm`; see the quickstart [here](docs/mddocs/Quickstart/open_webui_with_ollama_quickstart.md). - [2024/04] You can now run **Llama 3** on Intel GPU using `llama.cpp` and `ollama` with `ipex-llm`; see the quickstart [here](docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md). - [2024/04] `ipex-llm` now supports **Llama 3** on both Intel [GPU](python/llm/example/GPU/HuggingFace/LLM/llama3) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama3). - [2024/04] `ipex-llm` now provides C++ interface, which can be used as an accelerated backend for running [llama.cpp](docs/mddocs/Quickstart/llama_cpp_quickstart.md) and [ollama](docs/mddocs/Quickstart/ollama_quickstart.md) on Intel GPU. - [2024/03] `bigdl-llm` has now become `ipex-llm` (see the migration guide [here](docs/mddocs/Quickstart/bigdl_llm_migration.md)); you may find the original `BigDL` project [here](https://github.com/intel-analytics/bigdl-2.x). - [2024/02] `ipex-llm` now supports directly loading model from [ModelScope](python/llm/example/GPU/ModelScope-Models) ([魔搭](python/llm/example/CPU/ModelScope-Models)). - [2024/02] `ipex-llm` added initial **INT2** support (based on llama.cpp [IQ2](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2) mechanism), which makes it possible to run large-sized LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. - [2024/02] Users can now use `ipex-llm` through [Text-Generation-WebUI](https://github.com/intel-analytics/text-generation-webui) GUI. - [2024/02] `ipex-llm` now supports *[Self-Speculative Decoding](docs/mddocs/Inference/Self_Speculative_Decoding.md)*, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel [GPU](python/llm/example/GPU/Speculative-Decoding) and [CPU](python/llm/example/CPU/Speculative-Decoding) respectively. - [2024/02] `ipex-llm` now supports a comprehensive list of LLM **finetuning** on Intel GPU (including [LoRA](python/llm/example/GPU/LLM-Finetuning/LoRA), [QLoRA](python/llm/example/GPU/LLM-Finetuning/QLoRA), [DPO](python/llm/example/GPU/LLM-Finetuning/DPO), [QA-LoRA](python/llm/example/GPU/LLM-Finetuning/QA-LoRA) and [ReLoRA](python/llm/example/GPU/LLM-Finetuning/ReLora)). - [2024/01] Using `ipex-llm` [QLoRA](python/llm/example/GPU/LLM-Finetuning/QLoRA), we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for [Standford-Alpaca](python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora) (see the blog [here](https://www.intel.com/content/www/us/en/developer/articles/technical/finetuning-llms-on-intel-gpus-using-bigdl-llm.html)). - [2023/12] `ipex-llm` now supports [ReLoRA](python/llm/example/GPU/LLM-Finetuning/ReLora) (see *["ReLoRA: High-Rank Training Through Low-Rank Updates"](https://arxiv.org/abs/2307.05695)*). - [2023/12] `ipex-llm` now supports [Mixtral-8x7B](python/llm/example/GPU/HuggingFace/LLM/mixtral) on both Intel [GPU](python/llm/example/GPU/HuggingFace/LLM/mixtral) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mixtral). - [2023/12] `ipex-llm` now supports [QA-LoRA](python/llm/example/GPU/LLM-Finetuning/QA-LoRA) (see *["QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models"](https://arxiv.org/abs/2309.14717)*). - [2023/12] `ipex-llm` now supports [FP8 and FP4 inference](python/llm/example/GPU/HuggingFace/More-Data-Types) on Intel ***GPU***. - [2023/11] Initial support for directly loading [GGUF](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF), [AWQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ) and [GPTQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ) models into `ipex-llm` is available. - [2023/11] `ipex-llm` now supports [vLLM continuous batching](python/llm/example/GPU/vLLM-Serving) on both Intel [GPU](python/llm/example/GPU/vLLM-Serving) and [CPU](python/llm/example/CPU/vLLM-Serving). - [2023/10] `ipex-llm` now supports [QLoRA finetuning](python/llm/example/GPU/LLM-Finetuning/QLoRA) on both Intel [GPU](python/llm/example/GPU/LLM-Finetuning/QLoRA) and [CPU](python/llm/example/CPU/QLoRA-FineTuning). - [2023/10] `ipex-llm` now supports [FastChat serving](python/llm/src/ipex_llm/llm/serving) on on both Intel CPU and GPU. - [2023/09] `ipex-llm` now supports [Intel GPU](python/llm/example/GPU) (including iGPU, Arc, Flex and MAX). - [2023/09] `ipex-llm` [tutorial](https://github.com/intel-analytics/ipex-llm-tutorial) is released.
ipex-llm Demo
See demos of running local LLMs on Intel Core Ultra iGPU, Intel Core Ultra NPU, single-card Arc GPU, or multi-card Arc GPUs using ipex-llm below.
| Intel Core Ultra (Series 1) iGPU | Intel Core Ultra (Series 2) NPU | Intel Arc dGPU | 2-Card Intel Arc dGPUs |
|
|
|
|
|
Ollama (Mistral-7B Q4_K) |
HuggingFace (Llama3.2-3B SYM_INT4) |
TextGeneration-WebUI (Llama3-8B FP8) |
FastChat (QWen1.5-32B FP6) |
ipex-llm Performance
See the Token Generation Speed on Intel Core Ultra and Intel Arc GPU below1 (and refer to [2][3][4] for more details).
|
|
You may follow the Benchmarking Guide to run ipex-llm performance benchmark yourself.
Model Accuracy
Please see the Perplexity result below (tested on Wikitext dataset using the script here).
| Perplexity | sym_int4 | q4_k | fp6 | fp8_e5m2 | fp8_e4m3 | fp16 |
|---|---|---|---|---|---|---|
| Llama-2-7B-chat-hf | 6.364 | 6.218 | 6.092 | 6.180 | 6.098 | 6.096 |
| Mistral-7B-Instruct-v0.2 | 5.365 | 5.320 | 5.270 | 5.273 | 5.246 | 5.244 |
| Baichuan2-7B-chat | 6.734 | 6.727 | 6.527 | 6.539 | 6.488 | 6.508 |
| Qwen1.5-7B-chat | 8.865 | 8.816 | 8.557 | 8.846 | 8.530 | 8.607 |
| Llama-3.1-8B-Instruct | 6.705 | 6.566 | 6.338 | 6.383 | 6.325 | 6.267 |
| gemma-2-9b-it | 7.541 | 7.412 | 7.269 | 7.380 | 7.268 | 7.270 |
| Baichuan2-13B-Chat | 6.313 | 6.160 | 6.070 | 6.145 | 6.086 | 6.031 |
| Llama-2-13b-chat-hf | 5.449 | 5.422 | 5.341 | 5.384 | 5.332 | 5.329 |
| Qwen1.5-14B-Chat | 7.529 | 7.520 | 7.367 | 7.504 | 7.297 | 7.334 |
ipex-llm Quickstart
Docker
- GPU Inference in C++: running
llama.cpp,ollama, etc., withipex-llmon Intel GPU - GPU Inference in Python : running HuggingFace
transformers,LangChain,LlamaIndex,ModelScope, etc. withipex-llmon Intel GPU - vLLM on GPU: running
vLLMserving withipex-llmon Intel GPU - vLLM on CPU: running
vLLMserving withipex-llmon Intel CPU - FastChat on GPU: running
FastChatserving withipex-llmon Intel GPU - VSCode on GPU: running and developing
ipex-llmapplications in Python using VSCode on Intel GPU
Use
- NPU: running
ipex-llmon Intel NPU in both Python and C++ - llama.cpp: running llama.cpp (using C++ interface of
ipex-llm) on Intel GPU - Ollama: running ollama (using C++ interface of
ipex-llm) on Intel GPU - PyTorch/HuggingFace: running PyTorch, HuggingFace, LangChain, LlamaIndex, etc. (using Python interface of
ipex-llm) on Intel GPU for Windows and Linux - vLLM: running
ipex-llmin vLLM on both Intel GPU and CPU - FastChat: running
ipex-llmin FastChat serving on on both Intel GPU and CPU - Serving on multiple Intel GPUs: running
ipex-llmserving on multiple Intel GPUs by leveraging DeepSpeed AutoTP and FastAPI - Text-Generation-WebUI: running
ipex-llminoobaboogaWebUI - Axolotl: running
ipex-llmin Axolotl for LLM finetuning - Benchmarking: running (latency and throughput) benchmarks for
ipex-llmon Intel CPU and GPU
Applications
- GraphRAG: running Microsoft's
GraphRAGusing local LLM withipex-llm - RAGFlow: running
RAGFlow(an open-source RAG engine) withipex-llm - LangChain-Chatchat: running
LangChain-Chatchat(Knowledge Base QA using RAG pipeline) withipex-llm - Coding copilot: running
Continue(coding copilot in VSCode) withipex-llm - Open WebUI: running
Open WebUIwithipex-llm - PrivateGPT: running
PrivateGPTto interact with documents withipex-llm - Dify platform: running
ipex-llminDify(production-ready LLM app development platform)
Install
- Windows GPU: installing
ipex-llmon Windows with Intel GPU - Linux GPU: installing
ipex-llmon Linux with Intel GPU - For more details, please refer to the full installation guide
Code Examples
-
Low bit inference
- INT4 inference: INT4 LLM inference on Intel GPU and CPU
- FP8/FP6/FP4 inference: FP8, FP6 and FP4 LLM inference on Intel GPU
- INT8 inference: INT8 LLM inference on Intel GPU and CPU
- INT2 inference: INT2 LLM inference (based on llama.cpp IQ2 mechanism) on Intel GPU
-
FP16/BF16 inference
- FP16 LLM inference on Intel GPU, with possible self-speculative decoding optimization
- BF16 LLM inference on Intel CPU, with possible self-speculative decoding optimization
-
Distributed inference
-
Save and load
- Low-bit models: saving and loading
ipex-llmlow-bit models (INT4/FP4/FP6/INT8/FP8/FP16/etc.) - GGUF: directly loading GGUF models into
ipex-llm - AWQ: directly loading AWQ models into
ipex-llm - GPTQ: directly loading GPTQ models into
ipex-llm
- Low-bit models: saving and loading
-
Finetuning
-
Integration with community libraries
- Tutorials
API Doc
FAQ
Verified Models
Over 70 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.
| Model | CPU Example | GPU Example | NPU Example |
|---|---|---|---|
| LLaMA | link1, link2 | link | |
| LLaMA 2 | link1, link2 | link | Python link, C++ link |
| LLaMA 3 | link | link | Python link, C++ link |
| LLaMA 3.1 | link | link | |
| LLaMA 3.2 | link | Python link, C++ link | |
| LLaMA 3.2-Vision | link | ||
| ChatGLM | link | ||
| ChatGLM2 | link | link | |
| ChatGLM3 | link | link | |
| GLM-4 | link | link | |
| GLM-4V | link | link | |
| Mistral | link | link | |
| Mixtral | link | link | |
| Falcon | link | link | |
| MPT | link | link | |
| Dolly-v1 | link | link | |
| Dolly-v2 | link | link | |
| Replit Code | link | link | |
| RedPajama | link1, link2 | ||
| Phoenix | link1, link2 | ||
| StarCoder | link1, link2 | link | |
| Baichuan | link | link | |
| Baichuan2 | link | link | Python link |
| InternLM | link | link | |
| InternVL2 | link | ||
| Qwen | link | link | |
| Qwen1.5 | link | link | |
| Qwen2 | link | link | Python link, C++ link |
| Qwen2.5 | link | Python link, C++ link | |
| Qwen-VL | link | link | |
| Qwen2-VL | link | ||
| Qwen2-Audio | link | ||
| Aquila | link | link | |
| Aquila2 | link | link | |
| MOSS | link | ||
| Whisper | link | link | |
| Phi-1_5 | link | link | |
| Flan-t5 | link | link | |
| LLaVA | link | link | |
| CodeLlama | link | link | |
| Skywork | link | ||
| InternLM-XComposer | link | ||
| WizardCoder-Python | link | ||
| CodeShell | link | ||
| Fuyu | link | ||
| Distil-Whisper | link | link | |
| Yi | link | link | |
| BlueLM | link | link | |
| Mamba | link | link | |
| SOLAR | link | link | |
| Phixtral | link | link | |
| InternLM2 | link | link | |
| RWKV4 | link | ||
| RWKV5 | link | ||
| Bark | link | link | |
| SpeechT5 | link | ||
| DeepSeek-MoE | link | ||
| Ziya-Coding-34B-v1.0 | link | ||
| Phi-2 | link | link | |
| Phi-3 | link | link | |
| Phi-3-vision | link | link | |
| Yuan2 | link | link | |
| Gemma | link | link | |
| Gemma2 | link | ||
| DeciLM-7B | link | link | |
| Deepseek | link | link | |
| StableLM | link | link | |
| CodeGemma | link | link | |
| Command-R/cohere | link | link | |
| CodeGeeX2 | link | link | |
| MiniCPM | link | link | Python link, C++ link |
| MiniCPM3 | link | ||
| MiniCPM-V | link | ||
| MiniCPM-V-2 | link | link | |
| MiniCPM-Llama3-V-2_5 | link | Python link | |
| MiniCPM-V-2_6 | link | link | Python link |
| StableDiffusion | link | ||
| Bce-Embedding-Base-V1 | Python link | ||
| Speech_Paraformer-Large | Python link |
Get Support
- Please report a bug or raise a feature request by opening a Github Issue
- Please report a vulnerability by opening a draft GitHub Security Advisory
-
Performance varies by use, configuration and other factors.
ipex-llmmay not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. ↩︎