ipex-llm

intel/ipex-llm - Accelerate local LLM inference and finetuning on Intel XPUs https://github.com/intel/ipex-llm/

Find a file

Shengsheng Huang d7d0e66b18 move migration guide to quickstart (#10521 )		2024-03-25 11:50:49 +08:00
.github	Update setup.py and add new actions and add compatible mode (#25 )	2024-03-22 15:44:59 +08:00
apps	Update setup.py and add new actions and add compatible mode (#25 )	2024-03-22 15:44:59 +08:00
docker/llm	Update_docker by heyang (#29 )	2024-03-25 10:05:46 +08:00
docs/readthedocs	move migration guide to quickstart (#10521 )	2024-03-25 11:50:49 +08:00
python/llm	Update_document by heyang (#30 )	2024-03-25 10:06:02 +08:00
.gitignore	Nano: add libjemalloc.dylib (#7392 )	2023-02-02 14:30:58 +08:00
.readthedocs.yml	cicd: add pip mirror https://pypi.python.org/simple for bigdl related package (#6827 )	2022-12-01 20:47:36 +08:00
LICENSE	Add LICENSE and SECURITY files via upload (#3722 )	2021-12-14 14:49:30 +08:00
pyproject.toml	update readthedocs conf (#3158 )	2021-10-20 18:28:57 +08:00
README.md	Update_document by heyang (#30 )	2024-03-25 10:06:02 +08:00
SECURITY.md	Add LICENSE and SECURITY files via upload (#3722 )	2021-12-14 14:49:30 +08:00

README.md

IPEX-LLM

ipex-llm is a library for running LLM (large language model) on Intel XPU (from Laptop to GPU to Cloud) using INT4/FP4/INT8/FP8 with very low latency[^1] (for any PyTorch model).

It is built on the excellent work of llama.cpp, bitsandbytes, qlora, gptq, AutoGPTQ, awq, AutoAWQ, vLLM, llama-cpp-python, gptq_for_llama, chatglm.cpp, redpajama.cpp, gptneox.cpp, bloomz.cpp, etc.

Latest update 🔥

[2024/03] LangChain added support for ipex-llm; see the details here.
[2024/02] ipex-llm now supports directly loading model from ModelScope (魔搭).
[2024/02] ipex-llm added inital INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
[2024/02] Users can now use ipex-llm through Text-Generation-WebUI GUI.
[2024/02] ipex-llm now supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively.
[2024/02] ipex-llm now supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA).
[2024/01] Using ipex-llm QLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here).
[2024/01] 🔔🔔🔔 The default ipex-llm GPU Linux installation has switched from PyTorch 2.0 to PyTorch 2.1, which requires new oneAPI and GPU driver versions. (See the GPU installation guide for more details.)
[2023/12] ipex-llm now supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates").
[2023/12] ipex-llm now supports Mixtral-8x7B on both Intel GPU and CPU.
[2023/12] ipex-llm now supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models").
[2023/12] ipex-llm now supports FP8 and FP4 inference on Intel GPU.
[2023/11] Initial support for directly loading GGUF, AWQ and GPTQ models into ipex-llm is available.
[2023/11] ipex-llm now supports vLLM continuous batching on both Intel GPU and CPU.
[2023/10] ipex-llm now supports QLoRA finetuning on both Intel GPU and CPU.
[2023/10] ipex-llm now supports FastChat serving on on both Intel CPU and GPU.
[2023/09] ipex-llm now supports Intel GPU (including iGPU, Arc, Flex and MAX).
[2023/09] ipex-llm tutorial is released.
[2023/09] Over 40 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, ChatGLM2/ChatGLM3, Mistral, Falcon, MPT, LLaVA, WizardCoder, Dolly, Whisper, Baichuan/Baichuan2, InternLM, Skywork, QWen/Qwen-VL, Aquila, MOSS, and more; see the complete list here.

`ipex-llm` Demos

See the optimized performance of chatglm2-6b and llama-2-13b-chat models on 12th Gen Intel Core CPU and Intel Arc GPU below.

12th Gen Intel Core CPU		Intel Arc GPU

`chatglm2-6b`	`llama-2-13b-chat`	`chatglm2-6b`	`llama-2-13b-chat`

`ipex-llm` quickstart

CPU INT4

Install

You may install ipex-llm on Intel CPU as follows:

Note: See the CPU installation guide for more details.

pip install --pre --upgrade ipex-llm[all]

Note: ipex-llm has been tested on Python 3.9, 3.10 and 3.11

Run Model

You may apply INT4 optimizations to any Hugging Face Transformers models as follows.

#load Hugging Face Transformers model with INT4 optimizations
from ipex_llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)

#run the optimized model on CPU
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...)
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids)

See the complete examples here.

GPU INT4

Install

You may install ipex-llm on Intel GPU as follows:

Note: See the GPU installation guide for more details.

# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu

Note: ipex-llm has been tested on Python 3.9, 3.10 and 3.11

Run Model

You may apply INT4 optimizations to any Hugging Face Transformers models as follows.

#load Hugging Face Transformers model with INT4 optimizations
from ipex_llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)

#run the optimized model on Intel GPU
model = model.to('xpu')

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...).to('xpu')
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids.cpu())

See the complete examples here.

More Low-Bit Support

Save and load

After the model is optimized using ipex-llm, you may save and load the model as follows:

model.save_low_bit(model_path)
new_model = AutoModelForCausalLM.load_low_bit(model_path)

See the complete example here.

Additonal data types

In addition to INT4, You may apply other low bit optimizations (such as INT8, INT5, NF4, etc.) as follows:

model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int8")

See the complete example here.

Verified Models

Over 40 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, ChatGLM/ChatGLM2, Mistral, Falcon, MPT, Baichuan/Baichuan2, InternLM, QWen and more; see the example list below.

Model	CPU Example	GPU Example
LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)	link1, link2	link
LLaMA 2	link1, link2	link1, link2-low GPU memory example
ChatGLM	link
ChatGLM2	link	link
ChatGLM3	link	link
Mistral	link	link
Mixtral	link	link
Falcon	link	link
MPT	link	link
Dolly-v1	link	link
Dolly-v2	link	link
Replit Code	link	link
RedPajama	link1, link2
Phoenix	link1, link2
StarCoder	link1, link2	link
Baichuan	link	link
Baichuan2	link	link
InternLM	link	link
Qwen	link	link
Qwen1.5	link	link
Qwen-VL	link	link
Aquila	link	link
Aquila2	link	link
MOSS	link
Whisper	link	link
Phi-1_5	link	link
Flan-t5	link	link
LLaVA	link	link
CodeLlama	link	link
Skywork	link
InternLM-XComposer	link
WizardCoder-Python	link
CodeShell	link
Fuyu	link
Distil-Whisper	link	link
Yi	link	link
BlueLM	link	link
Mamba	link	link
SOLAR	link	link
Phixtral	link	link
InternLM2	link	link
RWKV4		link
RWKV5		link
Bark	link	link
SpeechT5		link
DeepSeek-MoE	link
Ziya-Coding-34B-v1.0	link
Phi-2	link	link
Yuan2	link	link
Gemma	link	link
DeciLM-7B	link	link
Deepseek	link	link

For more details, please refer to the ipex-llm Document, Readme, Tutorial and API Doc.

README.md

IPEX-LLM

Latest update 🔥

ipex-llm Demos

ipex-llm quickstart

CPU INT4

Install

Run Model

GPU INT4

Install

Run Model

More Low-Bit Support

Save and load

Additonal data types

Verified Models

`ipex-llm` Demos

`ipex-llm` quickstart