ipex-llm

History

binbin Deng 6fc31bb4cf LLM: first update descriptions for ChatGLM transformers int4 example (#8646 )		2023-08-02 11:00:56 +08:00
..
dev	llm: benchmark tool for transformers int4 (separate 1st token and rest) (#8460 )	2023-07-06 09:49:52 +08:00
example	LLM: first update descriptions for ChatGLM transformers int4 example (#8646 )	2023-08-02 11:00:56 +08:00
src/bigdl	Optimize Llama Attention to to reduce KV cache memory copy (#8580 )	2023-08-01 16:37:58 -07:00
test	[LLM] add CausalLM and Speech UT (#8597 )	2023-07-25 11:22:36 +08:00
README.md	[LLM] Add more transformers int4 example (Llama 2) (#8602 )	2023-07-25 09:21:12 +08:00
setup.py	[LLM] Add chatglm support for llm-cli (#8641 )	2023-08-01 14:30:17 +09:00

README.md

BigDL-LLM

bigdl-llm is a library for running LLM (language language model) on your Intel laptop using INT4 with very low latency (for any Hugging Face Transformers model).

(It is built on top of the excellent work of llama.cpp, gptq, ggml, llama-cpp-python, gptq_for_llama, bitsandbytes, redpajama.cpp, gptneox.cpp, bloomz.cpp, etc.)

Demos

See the optimized performance of phoenix-inst-chat-7b, vicuna-13b-v1.1, and starcoder-15b models on a 12th Gen Intel Core CPU below.

Verified models

We may use any Hugging Face Transfomer models on bigdl-llm, and the following models have been verified on Intel laptops.

Model	Example
LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)	link1, link2
LLaMA 2	link
MPT	link
Falcon	link
ChatGLM	link
ChatGLM2	link
MOSS	link
Baichuan	link
Dolly-v1	link
Dolly-v2	link
RedPajama	link1, link2
Phoenix	link1, link2
StarCoder	link1, link2
InternLM	link
Whisper	link

Working with `bigdl-llm`

Table of Contents

Install
Download Model
Run Model
bigdl-llm Dependence

Install

You may install bigdl-llm as follows:

pip install --pre --upgrade bigdl-llm[all]

Download Model

You may download any PyTorch model in Hugging Face Transformers format (including FP16 or FP32 or GPTQ-4bit).

Run Model

You may run the models using bigdl-llm through one of the following APIs:

Hugging Face transformers API
LangChain API
CLI (command line interface) Tool

Hugging Face `transformers` API

You may run the models using transformers-style API in bigdl-llm.

Using Hugging Face `transformers` INT4 format

You may apply INT4 optimizations to any Hugging Face Transformers models as follows.

#load Hugging Face Transformers model with INT4 optimizations
from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)

#run the optimized model
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...)
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids)

See the complete examples here.

Note

: You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows:
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5")
See the complete example here.

After the model is optimizaed using INT4 (or INT5/INT8), you may save and load the optimized model as follows:

model.save_low_bit(model_path)

new_model = AutoModelForCausalLM.load_low_bit(model_path)

See the example here.

Using native INT4 format

You may also convert Hugging Face Transformers models into native INT4 format for maximum performance as follows.

Note

: Currently only llama/bloom/gptneox/starcoder model family is supported; for other models, you may use the Transformers INT4 format as described above).

#convert the model
from bigdl.llm import llm_convert
bigdl_llm_path = llm_convert(model='/path/to/model/',
       outfile='/path/to/output/', outtype='int4', model_family="llama")

#load the converted model
from bigdl.llm.transformers import BigdlNativeForCausalLM
llm = BigdlNativeForCausalLM.from_pretrained("/path/to/output/model.bin",...)

#run the converted  model
input_ids = llm.tokenize(prompt)
output_ids = llm.generate(input_ids, ...)
output = llm.batch_decode(output_ids)

See the complete example here.

LangChain API

You may run the models using the LangChain API in bigdl-llm.

Using Hugging Face transformers INT4 format

You may run any Hugging Face Transformers model (with INT4 optimiztions applied) using the LangChain API as follows:

from bigdl.llm.langchain.llms import TransformersLLM
from bigdl.llm.langchain.embeddings import TransformersEmbeddings
from langchain.chains.question_answering import load_qa_chain

embeddings = TransformersEmbeddings.from_model_id(model_id=model_path)
bigdl_llm = TransformersLLM.from_model_id(model_id=model_path, ...)

doc_chain = load_qa_chain(bigdl_llm, ...)
output = doc_chain.run(...)

See the examples here.

Using native INT4 format

You may also convert Hugging Face Transformers models into native INT4 format (currently only llama/bloom/gptneox/starcoder model family is supported), and then run the converted models using the LangChain API as follows.

Note

: Currently only llama/bloom/gptneox/starcoder model family is supported; for other models, you may use the Transformers INT4 format as described above).

from bigdl.llm.langchain.llms import BigdlNativeLLM
from bigdl.llm.langchain.embeddings import BigdlNativeEmbeddings
from langchain.chains.question_answering import load_qa_chain

embeddings = BigdlNativeEmbeddings(model_path='/path/to/converted/model.bin',
                          model_family="llama",...)
bigdl_llm = BigdlNativeLLM(model_path='/path/to/converted/model.bin',
                     model_family="llama",...)

doc_chain = load_qa_chain(bigdl_llm, ...)
doc_chain.run(...)

See the examples here.

CLI Tool

Note

: Currently bigdl-llm CLI supports LLaMA (e.g., vicuna), GPT-NeoX (e.g., redpajama), BLOOM (e.g., pheonix) and GPT2 (e.g., starcoder) model architecture; for other models, you may use the transformers-style or LangChain APIs.

Convert model

You may convert the downloaded model into native INT4 format using llm-convert.

#convert PyTorch (fp16 or fp32) model; 
#llama/bloom/gptneox/starcoder model family is currently supported
llm-convert "/path/to/model/" --model-format pth --model-family "bloom" --outfile "/path/to/output/"

#convert GPTQ-4bit model
#only llama model family is currently supported
llm-convert "/path/to/model/" --model-format gptq --model-family "llama" --outfile "/path/to/output/"

Run model

You may run the converted model using llm-cli or llm-chat (built on top of main.cpp in llama.cpp)

#help
#llama/bloom/gptneox/starcoder model family is currently supported
llm-cli -x gptneox -h

#text completion
#llama/bloom/gptneox/starcoder model family is currently supported
llm-cli -t 16 -x gptneox -m "/path/to/output/model.bin" -p 'Once upon a time,'

#chat mode
#llama/gptneox model family is currently supported
llm-chat -m "/path/to/output/model.bin" -x llama

CLI Tool

Note

: Currently bigdl-llm CLI supports LLaMA (e.g., vicuna), GPT-NeoX (e.g., redpajama), BLOOM (e.g., pheonix) and GPT2 (e.g., starcoder) model architecture; for other models, you may use the transformers-style or LangChain APIs.

Convert model

You may convert the downloaded model into native INT4 format using llm-convert.

#convert PyTorch (fp16 or fp32) model; 
#llama/bloom/gptneox/starcoder model family is currently supported
llm-convert "/path/to/model/" --model-format pth --model-family "bloom" --outfile "/path/to/output/"

#convert GPTQ-4bit model
#only llama model family is currently supported
llm-convert "/path/to/model/" --model-format gptq --model-family "llama" --outfile "/path/to/output/"

Run model

You may run the converted model using llm-cli or llm-chat (built on top of main.cpp in llama.cpp)

#help
#llama/bloom/gptneox/starcoder model family is currently supported
llm-cli -x gptneox -h

#text completion
#llama/bloom/gptneox/starcoder model family is currently supported
llm-cli -t 16 -x gptneox -m "/path/to/output/model.bin" -p 'Once upon a time,'

#chat mode
#llama/gptneox model family is currently supported
llm-chat -m "/path/to/output/model.bin" -x llama

`bigdl-llm` Dependence

The native code/lib in bigdl-llm has been built using the following tools; in particular, lower LIBC version on your Linux system may be incompatible with bigdl-llm.

Model family	Platform	Compiler	GLIBC
llama	Linux	GCC 11.2.1	2.17
llama	Windows	MSVC 19.36.32532.0
llama	Windows	GCC 13.1.0
gptneox	Linux	GCC 11.2.1	2.17
gptneox	Windows	MSVC 19.36.32532.0
gptneox	Windows	GCC 13.1.0
bloom	Linux	GCC 11.2.1	2.29
bloom	Windows	MSVC 19.36.32532.0
bloom	Windows	GCC 13.1.0
starcoder	Linux	GCC 11.2.1	2.29
starcoder	Windows	MSVC 19.36.32532.0
starcoder	Windows	GCC 13.1.0

README.md

BigDL-LLM

Demos

Verified models

Working with bigdl-llm

Install

Download Model

Run Model

Hugging Face transformers API

Using Hugging Face transformers INT4 format

Using native INT4 format

LangChain API

CLI Tool

Convert model

Run model

CLI Tool

Convert model

Run model

bigdl-llm Dependence

Working with `bigdl-llm`

Hugging Face `transformers` API

Using Hugging Face `transformers` INT4 format

`bigdl-llm` Dependence