ayo/ipex-llm

Fork 0

intel/ipex-llm - Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, DeepSeek, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, DeepSpeed, Axolotl, etc.

Find a file

Jason Dai 0a3eda06d0 Update README.md (#12507 )		2024-12-05 15:46:53 +08:00
.github	Remove env variable `BIGDL_LLM_XMX_DISABLED` in documentation (#12445 )	2024-11-27 11:16:36 +08:00
apps	Update setup.py and add new actions and add compatible mode (#25 )	2024-03-22 15:44:59 +08:00
docker/llm	Fix (#12390 )	2024-11-27 10:41:58 +08:00
docs/mddocs	fix readme for npu cpp examples and llama.cpp (#12505 )	2024-12-05 12:32:42 +08:00
python/llm	fix readme for npu cpp examples and llama.cpp (#12505 )	2024-12-05 12:32:42 +08:00
.gitignore	Add vllm quickstart (#10978 )	2024-05-17 16:16:42 +08:00
.readthedocs.yml	cicd: add pip mirror https://pypi.python.org/simple for bigdl related package (#6827 )	2022-12-01 20:47:36 +08:00
LICENSE	Add LICENSE and SECURITY files via upload (#3722 )	2021-12-14 14:49:30 +08:00
pyproject.toml	update readthedocs conf (#3158 )	2021-10-20 18:28:57 +08:00
README.md	Update README.md (#12507 )	2024-12-05 15:46:53 +08:00
README.zh-CN.md	Update README.md (#12507 )	2024-12-05 15:46:53 +08:00
SECURITY.md	Add LICENSE and SECURITY files via upload (#3722 )	2021-12-14 14:49:30 +08:00

README.md

Important

bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.

💫 Intel® LLM Library for PyTorch*

< English | 中文 >

IPEX-LLM is an LLM acceleration library for Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max), NPU and CPU ¹.

Note

It is built on top of the excellent work of llama.cpp, transformers, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc.

It provides seamless integration with llama.cpp, Ollama, HuggingFace transformers, LangChain, LlamaIndex, vLLM, Text-Generation-WebUI, DeepSpeed-AutoTP, FastChat, Axolotl, HuggingFace PEFT, HuggingFace TRL, AutoGen, ModeScope, etc.

70+ models have been optimized/verified on ipex-llm (e.g., Llama, Phi, Mistral, Mixtral, Whisper, Qwen, MiniCPM, Qwen-VL, MiniCPM-V and more), with state-of-art LLM optimizations, XPU acceleration and low-bit (FP8/FP6/FP4/INT4) support; see the complete list here.

Project updates

- [2024/07] We added support for running Microsoft's **GraphRAG** using local LLM on Intel GPU; see the quickstart guide [here](docs/mddocs/Quickstart/graphrag_quickstart.md). - [2024/07] We added extensive support for Large Multimodal Models, including [StableDiffusion](https://github.com/jason-dai/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion), [Phi-3-Vision](python/llm/example/GPU/HuggingFace/Multimodal/phi-3-vision), [Qwen-VL](python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl), and [more](python/llm/example/GPU/HuggingFace/Multimodal). - [2024/07] We added **FP6** support on Intel [GPU](python/llm/example/GPU/HuggingFace/More-Data-Types). - [2024/06] We added experimental **NPU** support for Intel Core Ultra processors; see the examples [here](python/llm/example/NPU/HF-Transformers-AutoModels). - [2024/06] We added extensive support of **pipeline parallel** [inference](python/llm/example/GPU/Pipeline-Parallel-Inference), which makes it easy to run large-sized LLM using 2 or more Intel GPUs (such as Arc). - [2024/06] We added support for running **RAGFlow** with `ipex-llm` on Intel [GPU](docs/mddocs/Quickstart/ragflow_quickstart.md). - [2024/05] `ipex-llm` now supports **Axolotl** for LLM finetuning on Intel GPU; see the quickstart [here](docs/mddocs/Quickstart/axolotl_quickstart.md). - [2024/05] You can now easily run `ipex-llm` inference, serving and finetuning using the **Docker** [images](#docker). - [2024/05] You can now install `ipex-llm` on Windows using just "*[one command](docs/mddocs/Quickstart/install_windows_gpu.md#install-ipex-llm)*". - [2024/04] You can now run **Open WebUI** on Intel GPU using `ipex-llm`; see the quickstart [here](docs/mddocs/Quickstart/open_webui_with_ollama_quickstart.md). - [2024/04] You can now run **Llama 3** on Intel GPU using `llama.cpp` and `ollama` with `ipex-llm`; see the quickstart [here](docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md). - [2024/04] `ipex-llm` now supports **Llama 3** on both Intel [GPU](python/llm/example/GPU/HuggingFace/LLM/llama3) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama3). - [2024/04] `ipex-llm` now provides C++ interface, which can be used as an accelerated backend for running [llama.cpp](docs/mddocs/Quickstart/llama_cpp_quickstart.md) and [ollama](docs/mddocs/Quickstart/ollama_quickstart.md) on Intel GPU. - [2024/03] `bigdl-llm` has now become `ipex-llm` (see the migration guide [here](docs/mddocs/Quickstart/bigdl_llm_migration.md)); you may find the original `BigDL` project [here](https://github.com/intel-analytics/bigdl-2.x). - [2024/02] `ipex-llm` now supports directly loading model from [ModelScope](python/llm/example/GPU/ModelScope-Models) ([魔搭](python/llm/example/CPU/ModelScope-Models)). - [2024/02] `ipex-llm` added initial **INT2** support (based on llama.cpp [IQ2](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2) mechanism), which makes it possible to run large-sized LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. - [2024/02] Users can now use `ipex-llm` through [Text-Generation-WebUI](https://github.com/intel-analytics/text-generation-webui) GUI. - [2024/02] `ipex-llm` now supports *[Self-Speculative Decoding](docs/mddocs/Inference/Self_Speculative_Decoding.md)*, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel [GPU](python/llm/example/GPU/Speculative-Decoding) and [CPU](python/llm/example/CPU/Speculative-Decoding) respectively. - [2024/02] `ipex-llm` now supports a comprehensive list of LLM **finetuning** on Intel GPU (including [LoRA](python/llm/example/GPU/LLM-Finetuning/LoRA), [QLoRA](python/llm/example/GPU/LLM-Finetuning/QLoRA), [DPO](python/llm/example/GPU/LLM-Finetuning/DPO), [QA-LoRA](python/llm/example/GPU/LLM-Finetuning/QA-LoRA) and [ReLoRA](python/llm/example/GPU/LLM-Finetuning/ReLora)). - [2024/01] Using `ipex-llm` [QLoRA](python/llm/example/GPU/LLM-Finetuning/QLoRA), we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for [Standford-Alpaca](python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora) (see the blog [here](https://www.intel.com/content/www/us/en/developer/articles/technical/finetuning-llms-on-intel-gpus-using-bigdl-llm.html)). - [2023/12] `ipex-llm` now supports [ReLoRA](python/llm/example/GPU/LLM-Finetuning/ReLora) (see *["ReLoRA: High-Rank Training Through Low-Rank Updates"](https://arxiv.org/abs/2307.05695)*). - [2023/12] `ipex-llm` now supports [Mixtral-8x7B](python/llm/example/GPU/HuggingFace/LLM/mixtral) on both Intel [GPU](python/llm/example/GPU/HuggingFace/LLM/mixtral) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mixtral). - [2023/12] `ipex-llm` now supports [QA-LoRA](python/llm/example/GPU/LLM-Finetuning/QA-LoRA) (see *["QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models"](https://arxiv.org/abs/2309.14717)*). - [2023/12] `ipex-llm` now supports [FP8 and FP4 inference](python/llm/example/GPU/HuggingFace/More-Data-Types) on Intel ***GPU***. - [2023/11] Initial support for directly loading [GGUF](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF), [AWQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ) and [GPTQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ) models into `ipex-llm` is available. - [2023/11] `ipex-llm` now supports [vLLM continuous batching](python/llm/example/GPU/vLLM-Serving) on both Intel [GPU](python/llm/example/GPU/vLLM-Serving) and [CPU](python/llm/example/CPU/vLLM-Serving). - [2023/10] `ipex-llm` now supports [QLoRA finetuning](python/llm/example/GPU/LLM-Finetuning/QLoRA) on both Intel [GPU](python/llm/example/GPU/LLM-Finetuning/QLoRA) and [CPU](python/llm/example/CPU/QLoRA-FineTuning). - [2023/10] `ipex-llm` now supports [FastChat serving](python/llm/src/ipex_llm/llm/serving) on on both Intel CPU and GPU. - [2023/09] `ipex-llm` now supports [Intel GPU](python/llm/example/GPU) (including iGPU, Arc, Flex and MAX). - [2023/09] `ipex-llm` [tutorial](https://github.com/intel-analytics/ipex-llm-tutorial) is released.

`ipex-llm` Demo

See demos of running local LLMs on Intel Core Ultra iGPU, Intel Core Ultra NPU, single-card Arc GPU, or multi-card Arc GPUs using ipex-llm below.

Intel Core Ultra (Series 1) iGPU	Intel Core Ultra (Series 2) NPU	Intel Arc dGPU	2-Card Intel Arc dGPUs

Ollama (Mistral-7B Q4_K)	HuggingFace (Llama3.2-3B SYM_INT4)	TextGeneration-WebUI (Llama3-8B FP8)	FastChat (QWen1.5-32B FP6)

`ipex-llm` Performance

See the Token Generation Speed on Intel Core Ultra and Intel Arc GPU below¹ (and refer to [2][3][4] for more details).

You may follow the Benchmarking Guide to run ipex-llm performance benchmark yourself.

Model Accuracy

Please see the Perplexity result below (tested on Wikitext dataset using the script here).

Perplexity	sym_int4	q4_k	fp6	fp8_e5m2	fp8_e4m3	fp16
Llama-2-7B-chat-hf	6.364	6.218	6.092	6.180	6.098	6.096
Mistral-7B-Instruct-v0.2	5.365	5.320	5.270	5.273	5.246	5.244
Baichuan2-7B-chat	6.734	6.727	6.527	6.539	6.488	6.508
Qwen1.5-7B-chat	8.865	8.816	8.557	8.846	8.530	8.607
Llama-3.1-8B-Instruct	6.705	6.566	6.338	6.383	6.325	6.267
gemma-2-9b-it	7.541	7.412	7.269	7.380	7.268	7.270
Baichuan2-13B-Chat	6.313	6.160	6.070	6.145	6.086	6.031
Llama-2-13b-chat-hf	5.449	5.422	5.341	5.384	5.332	5.329
Qwen1.5-14B-Chat	7.529	7.520	7.367	7.504	7.297	7.334

`ipex-llm` Quickstart

Docker

GPU Inference in C++: running llama.cpp, ollama, etc., with ipex-llm on Intel GPU
GPU Inference in Python : running HuggingFace transformers, LangChain, LlamaIndex, ModelScope, etc. with ipex-llm on Intel GPU
vLLM on GPU: running vLLM serving with ipex-llm on Intel GPU
vLLM on CPU: running vLLM serving with ipex-llm on Intel CPU
FastChat on GPU: running FastChat serving with ipex-llm on Intel GPU
VSCode on GPU: running and developing ipex-llm applications in Python using VSCode on Intel GPU

Use

NPU: running ipex-llm on Intel NPU in both Python and C++
llama.cpp: running llama.cpp (using C++ interface of ipex-llm) on Intel GPU
Ollama: running ollama (using C++ interface of ipex-llm) on Intel GPU
PyTorch/HuggingFace: running PyTorch, HuggingFace, LangChain, LlamaIndex, etc. (using Python interface of ipex-llm) on Intel GPU for Windows and Linux
vLLM: running ipex-llm in vLLM on both Intel GPU and CPU
FastChat: running ipex-llm in FastChat serving on on both Intel GPU and CPU
Serving on multiple Intel GPUs: running ipex-llm serving on multiple Intel GPUs by leveraging DeepSpeed AutoTP and FastAPI
Text-Generation-WebUI: running ipex-llm in oobabooga WebUI
Axolotl: running ipex-llm in Axolotl for LLM finetuning
Benchmarking: running (latency and throughput) benchmarks for ipex-llm on Intel CPU and GPU

Applications

GraphRAG: running Microsoft's GraphRAG using local LLM with ipex-llm
RAGFlow: running RAGFlow (an open-source RAG engine) with ipex-llm
LangChain-Chatchat: running LangChain-Chatchat (Knowledge Base QA using RAG pipeline) with ipex-llm
Coding copilot: running Continue (coding copilot in VSCode) with ipex-llm
Open WebUI: running Open WebUI with ipex-llm
PrivateGPT: running PrivateGPT to interact with documents with ipex-llm
Dify platform: running ipex-llm in Dify(production-ready LLM app development platform)

Install

Windows GPU: installing ipex-llm on Windows with Intel GPU
Linux GPU: installing ipex-llm on Linux with Intel GPU
For more details, please refer to the full installation guide

Code Examples

Low bit inference
- INT4 inference: INT4 LLM inference on Intel GPU and CPU
- FP8/FP6/FP4 inference: FP8, FP6 and FP4 LLM inference on Intel GPU
- INT8 inference: INT8 LLM inference on Intel GPU and CPU
- INT2 inference: INT2 LLM inference (based on llama.cpp IQ2 mechanism) on Intel GPU
FP16/BF16 inference
- FP16 LLM inference on Intel GPU, with possible self-speculative decoding optimization
- BF16 LLM inference on Intel CPU, with possible self-speculative decoding optimization
Distributed inference
- Pipeline Parallel inference on Intel GPU
- DeepSpeed AutoTP inference on Intel GPU
Save and load
- Low-bit models: saving and loading ipex-llm low-bit models (INT4/FP4/FP6/INT8/FP8/FP16/etc.)
- GGUF: directly loading GGUF models into ipex-llm
- AWQ: directly loading AWQ models into ipex-llm
- GPTQ: directly loading GPTQ models into ipex-llm
Finetuning
- LLM finetuning on Intel GPU, including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA
- QLoRA finetuning on Intel CPU
Integration with community libraries
Tutorials

API Doc

FAQ

FAQ & Trouble Shooting

Verified Models

Over 70 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.

Model	CPU Example	GPU Example	NPU Example
LLaMA	link1, link2	link
LLaMA 2	link1, link2	link	Python link, C++ link
LLaMA 3	link	link	Python link, C++ link
LLaMA 3.1	link	link
LLaMA 3.2		link	Python link, C++ link
LLaMA 3.2-Vision		link
ChatGLM	link
ChatGLM2	link	link
ChatGLM3	link	link
GLM-4	link	link
GLM-4V	link	link
Mistral	link	link
Mixtral	link	link
Falcon	link	link
MPT	link	link
Dolly-v1	link	link
Dolly-v2	link	link
Replit Code	link	link
RedPajama	link1, link2
Phoenix	link1, link2
StarCoder	link1, link2	link
Baichuan	link	link
Baichuan2	link	link	Python link
InternLM	link	link
InternVL2		link
Qwen	link	link
Qwen1.5	link	link
Qwen2	link	link	Python link, C++ link
Qwen2.5		link	Python link, C++ link
Qwen-VL	link	link
Qwen2-VL		link
Qwen2-Audio		link
Aquila	link	link
Aquila2	link	link
MOSS	link
Whisper	link	link
Phi-1_5	link	link
Flan-t5	link	link
LLaVA	link	link
CodeLlama	link	link
Skywork	link
InternLM-XComposer	link
WizardCoder-Python	link
CodeShell	link
Fuyu	link
Distil-Whisper	link	link
Yi	link	link
BlueLM	link	link
Mamba	link	link
SOLAR	link	link
Phixtral	link	link
InternLM2	link	link
RWKV4		link
RWKV5		link
Bark	link	link
SpeechT5		link
DeepSeek-MoE	link
Ziya-Coding-34B-v1.0	link
Phi-2	link	link
Phi-3	link	link
Phi-3-vision	link	link
Yuan2	link	link
Gemma	link	link
Gemma2		link
DeciLM-7B	link	link
Deepseek	link	link
StableLM	link	link
CodeGemma	link	link
Command-R/cohere	link	link
CodeGeeX2	link	link
MiniCPM	link	link	Python link, C++ link
MiniCPM3		link
MiniCPM-V		link
MiniCPM-V-2	link	link
MiniCPM-Llama3-V-2_5		link	Python link
MiniCPM-V-2_6	link	link	Python link
StableDiffusion		link
Bce-Embedding-Base-V1			Python link
Speech_Paraformer-Large			Python link

Get Support

Please report a bug or raise a feature request by opening a Github Issue
Please report a vulnerability by opening a draft GitHub Security Advisory

Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. ↩︎

README.md

💫 Intel® LLM Library for PyTorch*

ipex-llm Demo

ipex-llm Performance

Model Accuracy

ipex-llm Quickstart

Docker

Use

Applications

Install

Code Examples

Low bit inference

FP16/BF16 inference

Distributed inference

Save and load

Finetuning

Integration with community libraries