13 KiB
BigDL-LLM
bigdl-llm is a library for running LLM (large language model) on your Intel laptop or GPU using INT4 with very low latency1 (for any Hugging Face Transformers model).
It is built on top of the excellent work of llama.cpp, gptq, ggml, llama-cpp-python, bitsandbytes, qlora, gptq_for_llama, chatglm.cpp, redpajama.cpp, gptneox.cpp, bloomz.cpp, etc.
Latest update
bigdl-llmnow supports Intel Arc or Flex GPU; see the the latest GPU examples here.
Demos
See the optimized performance of chatglm2-6b, llama-2-13b-chat, and starcoder-15.5b models on a 12th Gen Intel Core CPU below.
Verified models
We may use any Hugging Face Transfomer models on bigdl-llm, and the following models have been verified on Intel laptops.
| Model | Example |
|---|---|
| LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.) | link1, link2 |
| LLaMA 2 | link |
| MPT | link |
| Falcon | link |
| ChatGLM | link |
| ChatGLM2 | link |
| Qwen | link |
| MOSS | link |
| Baichuan | link |
| Dolly-v1 | link |
| Dolly-v2 | link |
| RedPajama | link1, link2 |
| Phoenix | link1, link2 |
| StarCoder | link1, link2 |
| InternLM | link |
| Whisper | link |
Working with bigdl-llm
Table of Contents
Install
You may install bigdl-llm as follows:
pip install --pre --upgrade bigdl-llm[all]
Download Model
You may download any PyTorch model in Hugging Face Transformers format (including FP16 or FP32 or GPTQ-4bit).
Run Model
You may run the models using bigdl-llm through one of the following APIs:
Hugging Face transformers API
You may run the models using transformers-style API in bigdl-llm.
-
Using Hugging Face
transformersINT4 formatYou may apply INT4 optimizations to any Hugging Face Transformers models as follows.
#load Hugging Face Transformers model with INT4 optimizations from bigdl.llm.transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)After loading the Hugging Face Transformers model, you may easily run the optimized model as follows.
#run the optimized model from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_path) input_ids = tokenizer.encode(input_str, ...) output_ids = model.generate(input_ids, ...) output = tokenizer.batch_decode(output_ids)See the complete examples here.
Note
: You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows:
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5")See the complete example here.
After the model is optimizaed using INT4 (or INT8/INT5), you may save and load the optimized model as follows:
model.save_low_bit(model_path) new_model = AutoModelForCausalLM.load_low_bit(model_path)See the example here.
-
Using native INT4 format
You may also convert Hugging Face Transformers models into native INT4 format for maximum performance as follows.
Notes: Currently only llama/bloom/gptneox/starcoder/chatglm model families are supported; you may use the corresponding API to load the converted model. (For other models, you can use the Transformers INT4 format as described above).
#convert the model from bigdl.llm import llm_convert bigdl_llm_path = llm_convert(model='/path/to/model/', outfile='/path/to/output/', outtype='int4', model_family="llama") #load the converted model #switch to ChatGLMForCausalLM/GptneoxForCausalLM/BloomForCausalLM/StarcoderForCausalLM to load other models from bigdl.llm.transformers import LlamaForCausalLM llm = LlamaForCausalLM.from_pretrained("/path/to/output/model.bin", native=True, ...) #run the converted model input_ids = llm.tokenize(prompt) output_ids = llm.generate(input_ids, ...) output = llm.batch_decode(output_ids)See the complete example here.
LangChain API
You may run the models using the LangChain API in bigdl-llm.
-
Using Hugging Face
transformersINT4 formatYou may run any Hugging Face Transformers model (with INT4 optimiztions applied) using the LangChain API as follows:
from bigdl.llm.langchain.llms import TransformersLLM from bigdl.llm.langchain.embeddings import TransformersEmbeddings from langchain.chains.question_answering import load_qa_chain embeddings = TransformersEmbeddings.from_model_id(model_id=model_path) bigdl_llm = TransformersLLM.from_model_id(model_id=model_path, ...) doc_chain = load_qa_chain(bigdl_llm, ...) output = doc_chain.run(...)See the examples here.
-
Using native INT4 format
You may also convert Hugging Face Transformers models into native INT4 format, and then run the converted models using the LangChain API as follows.
Notes:
- Currently only llama/bloom/gptneox/starcoder/chatglm model families are supported; for other models, you may use the Hugging Face
transformersINT4 format as described above).
- You may choose the corresponding API developed for specific native models to load the converted model.
from bigdl.llm.langchain.llms import LlamaLLM from bigdl.llm.langchain.embeddings import LlamaEmbeddings from langchain.chains.question_answering import load_qa_chain #switch to ChatGLMEmbeddings/GptneoxEmbeddings/BloomEmbeddings/StarcoderEmbeddings to load other models embeddings = LlamaEmbeddings(model_path='/path/to/converted/model.bin') #switch to ChatGLMLLM/GptneoxLLM/BloomLLM/StarcoderLLM to load other models bigdl_llm = LlamaLLM(model_path='/path/to/converted/model.bin') doc_chain = load_qa_chain(bigdl_llm, ...) doc_chain.run(...)See the examples here.
- Currently only llama/bloom/gptneox/starcoder/chatglm model families are supported; for other models, you may use the Hugging Face
CLI Tool
Note
: Currently
bigdl-llmCLI supports LLaMA (e.g., vicuna), GPT-NeoX (e.g., redpajama), BLOOM (e.g., pheonix) and GPT2 (e.g., starcoder) model architecture; for other models, you may use thetransformers-style or LangChain APIs.
-
Convert model
You may convert the downloaded model into native INT4 format using
llm-convert.#convert PyTorch (fp16 or fp32) model; #llama/bloom/gptneox/starcoder model family is currently supported llm-convert "/path/to/model/" --model-format pth --model-family "bloom" --outfile "/path/to/output/" #convert GPTQ-4bit model #only llama model family is currently supported llm-convert "/path/to/model/" --model-format gptq --model-family "llama" --outfile "/path/to/output/" -
Run model
You may run the converted model using
llm-cliorllm-chat(built on top ofmain.cppin llama.cpp)#help #llama/bloom/gptneox/starcoder model family is currently supported llm-cli -x gptneox -h #text completion #llama/bloom/gptneox/starcoder model family is currently supported llm-cli -t 16 -x gptneox -m "/path/to/output/model.bin" -p 'Once upon a time,' #chat mode #llama/gptneox model family is currently supported llm-chat -m "/path/to/output/model.bin" -x llama
CLI Tool
Note
: Currently
bigdl-llmCLI supports LLaMA (e.g., vicuna), GPT-NeoX (e.g., redpajama), BLOOM (e.g., pheonix) and GPT2 (e.g., starcoder) model architecture; for other models, you may use thetransformers-style or LangChain APIs.
-
Convert model
You may convert the downloaded model into native INT4 format using
llm-convert.#convert PyTorch (fp16 or fp32) model; #llama/bloom/gptneox/starcoder model family is currently supported llm-convert "/path/to/model/" --model-format pth --model-family "bloom" --outfile "/path/to/output/" #convert GPTQ-4bit model #only llama model family is currently supported llm-convert "/path/to/model/" --model-format gptq --model-family "llama" --outfile "/path/to/output/" -
Run model
You may run the converted model using
llm-cliorllm-chat(built on top ofmain.cppin llama.cpp)#help #llama/bloom/gptneox/starcoder model family is currently supported llm-cli -x gptneox -h #text completion #llama/bloom/gptneox/starcoder model family is currently supported llm-cli -t 16 -x gptneox -m "/path/to/output/model.bin" -p 'Once upon a time,' #chat mode #llama/gptneox model family is currently supported llm-chat -m "/path/to/output/model.bin" -x llama
bigdl-llm API Doc
See the inital bigdl-llm API Doc here.
bigdl-llm Dependencies
The native code/lib in bigdl-llm has been built using the following tools.
Note that lower LIBC version on your Linux system may be incompatible with bigdl-llm.
| Model family | Platform | Compiler | GLIBC |
|---|---|---|---|
| llama | Linux | GCC 11.2.1 | 2.17 |
| llama | Windows | MSVC 19.36.32532.0 | |
| llama | Windows | GCC 13.1.0 | |
| gptneox | Linux | GCC 11.2.1 | 2.17 |
| gptneox | Windows | MSVC 19.36.32532.0 | |
| gptneox | Windows | GCC 13.1.0 | |
| bloom | Linux | GCC 11.2.1 | 2.29 |
| bloom | Windows | MSVC 19.36.32532.0 | |
| bloom | Windows | GCC 13.1.0 | |
| starcoder | Linux | GCC 11.2.1 | 2.29 |
| starcoder | Windows | MSVC 19.36.32532.0 | |
| starcoder | Windows | GCC 13.1.0 |
-
Performance varies by use, configuration and other factors.
bigdl-llmmay not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. ↩︎